-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Focus-file creation #21
Comments
For the time being, I think I can fake making the focus file proper by running both TICCL-unk and TICCL-anahash with only the corpus to be corrected and giving that more limited focus file to TICCL-indexer(NT). I will try this path. |
yes, this is what was agreed upon what anahash should do: (when --ngrams is specified!)
'being present' means: in the provided input file of anahash. |
I think you are mistaken here. If the background is specified with TICCL-unk already, it is merged and so is used by TICCL-anahash. So maybe it is correct that it is not used when specified with TICCL-anahash and this provides the option which I require, to base the focus file only on the input to be corrected. I will check this possibility. |
I currently have two options to merge the frequency file of the corpus to be corrected with the background file (representing a lexicon or a large background corpus or some combination of both).
However, I seem to have only one option to create a focus file. And this results in all the ngrams that do not completely make/meet the artifrq being incorporated in the focus file. (Unless I am mistaken and overlook another option...)
I would like to have the option to only have those ngrams from the corpus to be corrected that do not meet the artifrq to be incorporated in the focus file.
This might be achieved perhaps by may deferring to include the background file to TICCL-anahash, and have this (or TICCL-unk ?) produce the focus file.
It would be handier, too, if the focus file would also list the actual word forms included, for easy reference.
This last would also enable TICCL-LDcalc to focus only on this (probably) single version of the possible anagrams associated with a particular anagram value, rather than processing them all.
You may wish to regard the above as two separate issues and handle them accordingly.
Thank you!
The text was updated successfully, but these errors were encountered: