-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complete sentence prediction #20
Comments
Wow, thanks for your effort! I thought that I mitigated this issue by moving a "sliding text window" over the data. Most of the TSV files were too long for the model anyway. So I used a fixed sized token window and slid that over training data. I hoped that the model could not overfit on file endings because they rarely occur. However, I did not combine the training files, at least if I recall that correctly. So there are some "text windows" that contain a file ending. If you fix this issue, I am happy to merge it :) |
By sliding window you must mean the max_length and stride tokenizer settings. Yeah, I guess it still overfits because there several thousand TSV files per language. And the more languages and more epochs, the easier it seems to overfit. What I noticed is that when I increased it to four languages, the 10,000 minimum tokenize size was not enough to prevent it. So now I'm trying it with four languages and 100,000 minimum tokenize size. I think there is an optimal number that depends on the number of languages (TSV files) as well as the number of epochs, because it's still perhaps desirable for some applications, such as mine, for a bit of overfitting so the model puts a full stop at the end of some short phrases such as those used in medical records. I may end up making the parameter a function of the total number of files (items in the data array). Will keep you posted. Thanks again. |
Hi Oliver,
Your library is like the gift that keeps on giving. Thank you again for it. I noticed that model tends to predict a sentence ending punctuation mark at the end of the input text even if it is unlikely a complete sentence. Even the model trained for task1 and only for English is like that. It seems to be due that the tokenizer tokenizes each TSV file separately and there thousands a lot of .tsv files, including small ones, and they all end on a full stop or question mark. This seems to teach the model to almost always put a full stop at the end of the input text. I tried grouping the training data from the .tsv files into larger chunks in the ModelTrainer.to_dataset() method before passing them to the tokenizer. I've added a parameter to the method called 'min_tokenize_size' and set it to 10,000, which seems to balance predictions better, at least for task 1, English only. I plan to try it for task 2, hoping the current accuracy in other respects won't be lost. Please let me know if you have any suggestions.
The text was updated successfully, but these errors were encountered: