Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete sentence prediction #20

Open
orlink opened this issue Oct 20, 2024 · 2 comments
Open

Complete sentence prediction #20

orlink opened this issue Oct 20, 2024 · 2 comments

Comments

@orlink
Copy link

orlink commented Oct 20, 2024

Hi Oliver,
Your library is like the gift that keeps on giving. Thank you again for it. I noticed that model tends to predict a sentence ending punctuation mark at the end of the input text even if it is unlikely a complete sentence. Even the model trained for task1 and only for English is like that. It seems to be due that the tokenizer tokenizes each TSV file separately and there thousands a lot of .tsv files, including small ones, and they all end on a full stop or question mark. This seems to teach the model to almost always put a full stop at the end of the input text. I tried grouping the training data from the .tsv files into larger chunks in the ModelTrainer.to_dataset() method before passing them to the tokenizer. I've added a parameter to the method called 'min_tokenize_size' and set it to 10,000, which seems to balance predictions better, at least for task 1, English only. I plan to try it for task 2, hoping the current accuracy in other respects won't be lost. Please let me know if you have any suggestions.

@oliverguhr
Copy link
Owner

oliverguhr commented Oct 22, 2024

Wow, thanks for your effort!

I thought that I mitigated this issue by moving a "sliding text window" over the data. Most of the TSV files were too long for the model anyway. So I used a fixed sized token window and slid that over training data. I hoped that the model could not overfit on file endings because they rarely occur. However, I did not combine the training files, at least if I recall that correctly. So there are some "text windows" that contain a file ending.

If you fix this issue, I am happy to merge it :)

@orlink
Copy link
Author

orlink commented Oct 22, 2024

By sliding window you must mean the max_length and stride tokenizer settings. Yeah, I guess it still overfits because there several thousand TSV files per language. And the more languages and more epochs, the easier it seems to overfit. What I noticed is that when I increased it to four languages, the 10,000 minimum tokenize size was not enough to prevent it. So now I'm trying it with four languages and 100,000 minimum tokenize size. I think there is an optimal number that depends on the number of languages (TSV files) as well as the number of epochs, because it's still perhaps desirable for some applications, such as mine, for a bit of overfitting so the model puts a full stop at the end of some short phrases such as those used in medical records. I may end up making the parameter a function of the total number of files (items in the data array). Will keep you posted. Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants