-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch split task to token based splitting #283
base: main
Are you sure you want to change the base?
Conversation
for _, row in df_filtered.iterrows(): | ||
content = row["metadata"]["content"] | ||
|
||
if content is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this necessarily need to be an error? Or could we just set it to an empty string and continue processing?
@drobison00 Thanks for the reviews! I have a couple questions: How do you think we should go about preloading the vocab files in the case that the user doesn't want to allow downloads and in the case they do, how should we go about passing along the huggingface token to access gated models? My thought was to pull it from an environment variable on the client side like we do with unstructured and adobe and pass it along as another parameter in the schema |
Also I can't seem to reproduce the test failure locally |
We have a flaky test. I meant to look into fixing it, but for now you can go into Actions and rerun the test. |
|
||
tokenizer_model = AutoTokenizer.from_pretrained(tokenizer, token=hf_access_token) | ||
|
||
split_docs = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matching documents should be collected and processed all at once as opposed to iterating rows; and we should actually make this a multiprocessing stage so we're able to use the worker pool for CPU bound tasks. But we can hold of if this needs to go in sooner.
Description
Checklist