Sunday, March 17, 2024
HomeNatural Language ProcessingHierarchical (and different) Indexes utilizing LlamaIndex for RAG Content material Enrichment

Hierarchical (and different) Indexes utilizing LlamaIndex for RAG Content material Enrichment


At our weekly This Week in Machine Studying (TWIML) conferences, (our chief and facilitataor) Darin Plutchok identified a LinkedIn weblog publish on Semantic Chunking that has been not too long ago carried out within the LangChain framework. Not like extra conventional chunking approaches that use variety of tokens or separator tokens as a information, this one chunks teams of sentences into semantic items by breaking them when the (semantic) similarity between consecutive sentences (or sentence-grams) fall beneath some predefined threshold. I had tried it earlier (pre-LangChain) and whereas outcomes have been cheap, it will want loads of processing, so I went again to what I used to be utilizing earlier than.

I used to be additionally not too long ago exploring LlamaIndex as a part of the trouble to familiarize myself with the GenAI ecosystem. LlamaIndex helps hierarchical indexes natively, that means it supplies the information constructions that make constructing them simpler and extra pure. Not like the standard RAG index, that are only a sequence of chunks (and their vectors), hierarchical indexes would cluster chunks into dad or mum chunks, and dad or mum chunks into grandparent chunks, and so forth. A dad or mum chunk would typically inherit or merge a lot of the metadata from its youngsters, and its textual content can be a abstract of its youngsters’s textual content contents. As an example my level about LlamaIndex knowledge constructions having pure help for this type of setup, listed below are the definitions of the LlamaIndex TextNode (the LlamaIndex Doc object is only a youngster of TextNode with an extra doc_id: str subject) and the LangChain Doc. Of explicit curiosity is the relationships subject, which permits tips to different chunks utilizing named relationships PARENT, CHILD, NEXT, PREVIOUS, SOURCE, and so on. Arguably, the LlamaIndex TextNode may be represented extra typically and succintly by the LangChain Doc, however the hooks do assist to help hierarchical indexing extra naturally.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# it is a LlamaIndex TextNode
class TextNode:
  id_: str = None
  embedding: Elective[List[float]] = None
  extra_info: Dict[str, Any]
  excluded_embed_metadata_keys: Record[str] = None
  excluded_llm_metadata_keys: Record[str] = None
  relationships: Dict[NodeRelationship, Union[RelatedNodeInfo, List[RelatedNodeInfo]] = None
  textual content: str
  start_char_idx: Elective[int] = None
  end_char_idx: Elective[int] = None
  text_template: str = "{metadata_str}nn{content material}"
  metadata_template: str = "{key}: {worth}",
  metadata_separator = str = "n"

# and it is a LangChain Doc
class Doc:
  page_content: str
  metadata: Dict[str, Any]

In any case, having found the hammer that’s LlamaIndex, I started to see loads of potential hierarchical indexes nails. One such nail that occurred to me was to make use of Semantic Chunking to cluster consecutive chunks moderately than sentences (or sentence-grams), after which create dad and mom nodes from these chunk clusters. As a substitute of computing cosine similarity between consecutive sentence vectors to construct up chunks, we compute cosine similarity throughout consecutive chunk vectors and cut up them up into clusters based mostly on some similarity threshold, i.e. if the similarity drops beneath the brink, we terminate the cluster and begin a brand new one.

Each LangChain and LlamaIndex have implementations of Semantic Chunking (for sentence clustering into chunks, not chunk clustering into dad or mum chunks). LangChain’s Semantic Chunking permits you to set the brink utilizing percentiles, normal deviation and inter-quartile vary, whereas the LlamaIndex implementation helps solely the percentile threshold. However intuitively, here is how you would get an thought of the percentile threshold to make use of — thresholds for the opposite strategies may be computed equally. Assume your content material has N chunks and Ok clusters (based mostly in your understanding of the information or from different estimates), then assuming a uniform distribution, there can be N/Ok chunks in every cluster. If N/Ok is roughly 20%, then your percentile threshold can be roughly 80.

LlamaIndex supplies an IngestionPipeline which takes an inventory of TransformComponent objects. My pipeline appears to be like one thing like beneath. The final element is a customized subclass of TransformComponent, all you must do is to override it is __call__ methodology, which takes a Record[TextNode] and returns a Record[TextNode].

1
2
3
4
5
6
7
8
transformations = [
    text_splitter: SentenceSplitter,
    embedding_generator: HuggingFaceEmbedding,
    summary_node_builder: SemanticChunkingSummaryNodeBuilder
]
ingestion_pipeline = IngestionPipeline(transformations=transformations)
docs = SimpleDirectoryReader("/path/to/enter/docs")
nodes = ingestion_pipeline.run(paperwork=docs)

My customized element takes the specified cluster measurement Ok throughout development. It makes use of the vectors computed by the (LlamaIndex supplied) HuggingFaceEmbedding element to compute similarities between consecutive vectors and makes use of Ok to compute a threshold to make use of. It then makes use of the brink to cluster the chunks, leading to an inventory of listing of chunks Record[List[TextNode]]. For every cluster, we create a abstract TextNode and set its CHILD relationships to the cluster nodes, and the PARENT relationship of every youngster within the cluster to this new abstract node. The textual content of the kid nodes are first condensed utilizing extractive summarization, then these condensed summaries are additional summarized into one remaining abstract utilizing abstractive summarization. I used bert-extractive-summarizer with bert-base-uncased for the primary and a HuggingFace summarization pipeline with fb/bert-large-cnn for the second. I suppose I might have used an LLM for the second step, however it will have taken extra time to construct the index, and I’ve been experimenting with concepts described within the DeepLearning.AI course Open Supply Fashions with HuggingFace.

Lastly, I recalculate the embeddings for the abstract nodes — I ran the abstract node texts by the HuggingFaceEmbedding, however I suppose I might have executed some aggregation (mean-pool / max-pool) on the kid vectors as effectively.

Darin additionally identified one other occasion of Hierarchical Index proposed by way of the RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval and described intimately by the authors on this LlamaIndex webinar. This is a little more radical than my thought of utilizing semantic chunking to cluster consecutive chunks, in that it permits clustering of chunks throughout your complete corpus. One different vital distinction is that it permits for soft-clustering, that means a bit generally is a member of a couple of chunk. They first scale back the dimensionality of the vector house utilizing UMAP (Uniform Manifold Approximation and Projection) after which apply Gaussian Combination Mannequin (GMM) to do the gentle clustering. To seek out the optimum variety of clusters Ok for the GMM, one can use a mixture of AIC (Aikake Data Criterion) and BIC (Bayesian Data Criterion).

In my case, when coaching the GMM, the AIC saved lowering because the variety of clusters elevated, and the BIC had its minimal worth for Ok=10, which corresponds roughly to the 12 chapters in my Snowflake e book (my take a look at corpus). However there was loads of overlap, which might pressure me to implement some type of logic to benefit from the gentle clustering, which I did not need to do, since I needed to reuse code from my earlier Semantic Chunking node builder element. Finally, I settled on 90 clusters by utilizing my unique instinct to compute Ok, and the ensuing clusters appear fairly effectively separated as seen beneath.

Utilizing the outcomes of the clustering, I constructed this additionally as one other customized LlamaIndex TransformComponent for hierarchical indexing. This implementation differs from the earlier one solely in the way in which it assigns nodes to clusters, all different particulars with respect to textual content summarization and metadata merging are an identical.

For each these indexes, we’ve got a alternative to take care of the index as hierarchical, and resolve which layer(s) to question based mostly on the query, or add the abstract nodes into the identical degree as the opposite chunks, and let vector similarity floor them when queries take care of cross-cutting points which may be discovered collectively in these nodes. The RAPTOR paper reviews that they do not see a major achieve utilizing the primary method over the second. As a result of my question performance is LangChain based mostly, my method has been to generate the nodes after which reformat them into LangChain Doc objects and use LCEL to question the index and generate solutions, so I have not appeared into querying from a hierarchical index in any respect.

Trying again on this work, I’m reminded of comparable selections when designing conventional search pipelines. Typically there’s a alternative between constructing performance into the index to help a less expensive question implementation, or constructing the logic into the question pipeline which may be costlier but additionally extra versatile. I believe LlamaIndex began with the primary method (as evidenced by their weblog posts Chunking Methods for Giant Language Fashions Half I and Evaluating Supreme Chunk Sizes for RAG Programs utilizing LlamaIndex) whereas LangChain began with the second, although these days there may be loads of convergence between the 2 frameworks.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments