You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I am writing because I have been using your Statistical Chunker to perform semantic chunking on a large (but not too large text dataset). The quality of the produced chunks is rather good and the algorithm is rather understandable from my part.
However, it took me quite many days to chunk my dataset on a consumer grade cpu. Since my dataset has grown I looked for alternatives and found the semchunk library. It seems like they have a much more greedy algorithm but with the trade-off of speed.
Based on these observations, I set out to benchmark the execution times of both.
I used the first 1000 rows of PORTULAN/parlamento-pt, a dataset with old portuguese legislative texts. The model used was marquesafonso/albertina-sts, the one I am using in my project, which it is rather small.
Here are the results in terms of execution time:
Model
Time (seconds)
semantic-chunkers
13124.23
semchunk
8.81
Both models were initialized first and then run as lambda functions over a dataframe:
class SemanticChunker:
def __init__(self, model_name):
self.chunker = StatisticalChunker(encoder=HuggingFaceEncoder(name=model_name), window_size=2, max_split_tokens=300)
def chunk(self, text):
chunks = self.chunker(docs=[text])
return [" ".join(chunk.splits) for chunk in chunks[0]]
More important than the comparison with the other library, in my opinion, is that it took approximately 3.65 hours to chunk 1000 legislative documents using the Statistical Chunker.
While the quality is very good from my experience, and in the past I used multi-threading to speed up inference times, it renders the method difficult to include in a workflow, making it more sensible even to consider character-based chunking techniques.
The purpose of this issue is not to bash the library but rather to share my feedback and findings with you as well as start a discussion on what could be causing these very long executions times + possible fixes.
Thanks for sharing this library openly btw!
The text was updated successfully, but these errors were encountered:
I would just like to point out one glaring flaw in the above analysis, which is simply that you are comparing Apples to Oranges.
semchunk despite it being called a "Semantic Chunking Library" is, in fact, not a semantic chunking library. It has no association to semantic information at all and does not use any embeddings.
Meanwhile, the StatisticalChunker is in fact using semantic information, which utilizes a encoder to get the embedding vectors out. Now, since you said you were running it on a CPU, it would take quite a lot of time, which is what you are seeing as well. There are multiple ways to speed this up, the simplest being: Attach a GPU and run it with that, which should show 5x-35x speed-ups (depending on the model)
StatisticalChunker would definitely have higher quality chunks, but if you care more about the injestion speed, you can go with semchunk. Do note that semchunk has had issues with chunking documents that are not very well formatted.
TL;DR: Two different approaches, no direct comparison possible
Hello, I am writing because I have been using your Statistical Chunker to perform semantic chunking on a large (but not too large text dataset). The quality of the produced chunks is rather good and the algorithm is rather understandable from my part.
However, it took me quite many days to chunk my dataset on a consumer grade cpu. Since my dataset has grown I looked for alternatives and found the semchunk library. It seems like they have a much more greedy algorithm but with the trade-off of speed.
Based on these observations, I set out to benchmark the execution times of both.
I used the first 1000 rows of PORTULAN/parlamento-pt, a dataset with old portuguese legislative texts. The model used was marquesafonso/albertina-sts, the one I am using in my project, which it is rather small.
Here are the results in terms of execution time:
Both models were initialized first and then run as lambda functions over a dataframe:
The SemanticChunker class was set up as follows:
More important than the comparison with the other library, in my opinion, is that it took approximately 3.65 hours to chunk 1000 legislative documents using the Statistical Chunker.
While the quality is very good from my experience, and in the past I used multi-threading to speed up inference times, it renders the method difficult to include in a workflow, making it more sensible even to consider character-based chunking techniques.
The purpose of this issue is not to bash the library but rather to share my feedback and findings with you as well as start a discussion on what could be causing these very long executions times + possible fixes.
Thanks for sharing this library openly btw!
The text was updated successfully, but these errors were encountered: