Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make tokenize-shuffle more robust to long pages #271

Merged
merged 2 commits into from
May 10, 2024

Conversation

jeffreywpli
Copy link
Collaborator

@jeffreywpli jeffreywpli commented May 10, 2024

Current Issue:

Some sources (e.g. code repos, books) are prone to having very long pages leading to long lists of tokens. However, the way we currently yield sequences is to simply add all tokens from a page to the buffer and to then repeatedly slice the buffer. When buffer grows too large this can be very slow (and many slices need to take place).

Solution:

Control the max buffer size by yielding tokens from a page's token set to the buffer in a more controlled manner.

Copy link
Contributor

@Vaishaal Vaishaal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Vaishaal Vaishaal merged commit b47fd05 into main May 10, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants