This package, developed as part of our research detailed in the Chroma Technical Report, provides tools for text chunking and evaluation. It allows users to compare different chunking methods and includes implementations of several novel chunking strategies.
- Compare Chunking Methods: Evaluate and compare various popular chunking strategies.
- Novel Chunking Methods: Implementations of new chunking methods such as
ClusterSemanticChunker
andLLMChunker
. - Evaluation Framework: Tools to generate domain-specific datasets and evaluate retrieval quality in the context of AI applications.
You can immediately test the package via Google Colab.
You can install the package directly from GitHub:
pip install git+https://github.com/brandonstarxel/chunking_evaluation.git
This example shows how to implement your own chunking logic and evaluate its performance.
from chunking_evaluation import BaseChunker, GeneralEvaluation
from chromadb.utils import embedding_functions
# Define a custom chunking class
class CustomChunker(BaseChunker):
def split_text(self, text):
# Custom chunking logic
return [text[i:i+1200] for i in range(0, len(text), 1200)]
# Instantiate the custom chunker and evaluation
chunker = CustomChunker()
evaluation = GeneralEvaluation()
# Choose embedding function
default_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="OPENAI_API_KEY",
model_name="text-embedding-3-large"
)
# Evaluate the chunker
results = evaluation.run(chunker, default_ef)
print(results)
# {'iou_mean': 0.17715979570301696, 'iou_std': 0.10619791407460026,
# 'recall_mean': 0.8091207841640163, 'recall_std': 0.3792297991952294}
from chromadb import Documents, EmbeddingFunction, Embeddings
class MyEmbeddingFunction(EmbeddingFunction):
def __call__(self, input: Documents) -> Embeddings:
# embed the documents somehow
return embeddings
# Instantiate instance of ef
default_ef = MyEmbeddingFunction()
# Evaluate the embedding function with a chunker
results = evaluation.run(chunker, default_ef)
This example demonstrates how to use our ClusterSemanticChunker and how you can evaluate it yourself.
from chunking_evaluation import BaseChunker, GeneralEvaluation
from chunking_evaluation.chunking import ClusterSemanticChunker
from chromadb.utils import embedding_functions
# Instantiate evaluation
evaluation = GeneralEvaluation()
# Choose embedding function
default_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="OPENAI_API_KEY",
model_name="text-embedding-3-large"
)
# Instantiate chunker and run the evaluation
chunker = ClusterSemanticChunker(default_ef, max_chunk_size=400)
results = evaluation.run(chunker, default_ef)
print(results)
# {'iou_mean': 0.18255175232840098, 'iou_std': 0.12773219595465307,
# 'recall_mean': 0.8973469551927365, 'recall_std': 0.29042203879923994}
Here are the steps you can take to develop a sythetic dataset based off your own corpora for domain specific evaluation.
-
Initialize the Environment:
from chunking_evaluation import SyntheticEvaluation # Specify the corpora paths and output CSV file corpora_paths = [ 'path/to/chatlogs.txt', 'path/to/finance.txt', # Add more corpora files as needed ] queries_csv_path = 'generated_queries_excerpts.csv' # Initialize the evaluation evaluation = SyntheticEvaluation(corpora_paths, queries_csv_path, openai_api_key="OPENAI_API_KEY")
-
Generate Queries and Excerpts:
# Generate queries and excerpts, and save to CSV evaluation.generate_queries_and_excerpts()
-
Apply Filters:
# Apply filter to remove queries with poor excerpts evaluation.filter_poor_excerpts(threshold=0.36) # Apply filter to remove duplicates evaluation.filter_duplicates(threshold=0.6)
-
Run the Evaluation:
from chunking_evaluation import BaseChunker # Define a custom chunking class class CustomChunker(BaseChunker): def split_text(self, text): # Custom chunking logic return [text[i:i+1200] for i in range(0, len(text), 1200)] # Instantiate the custom chunker chunker = CustomChunker() # Run the evaluation on the filtered data results = evaluation.run(chunker) print("Evaluation Results:", results)
-
Optional: If generation is unable to generate queries try approximate excerpts
# Generate queries and excerpts, and save to CSV evaluation.generate_queries_and_excerpts(approximate_excerpts=True)
The following will be installed along with the package:
- tiktoken
- fuzzywuzzy
- pandas
- numpy
- tqdm
- chromadb
- python-Levenshtein
- openai
- anthropic
- attrs
If you use this package in your research, please cite our technical report:
@techreport{smith2024evaluating,
title = {Evaluating Chunking Strategies for Retrieval},
author = {Smith, Brandon and Troynikov, Anton},
year = {2024},
month = {July},
institution = {Chroma},
url = {https://research.trychroma.com/evaluating-chunking},
}
We welcome contributions and are excited you'd like to get involved! Make sure your pull request goes to the dev branch. We will test it and then later merge it to main.