Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Focus-file creation #21

Open
martinreynaert opened this issue Jul 4, 2018 · 3 comments
Open

Focus-file creation #21

martinreynaert opened this issue Jul 4, 2018 · 3 comments
Assignees

Comments

@martinreynaert
Copy link
Collaborator

I currently have two options to merge the frequency file of the corpus to be corrected with the background file (representing a lexicon or a large background corpus or some combination of both).

However, I seem to have only one option to create a focus file. And this results in all the ngrams that do not completely make/meet the artifrq being incorporated in the focus file. (Unless I am mistaken and overlook another option...)

I would like to have the option to only have those ngrams from the corpus to be corrected that do not meet the artifrq to be incorporated in the focus file.

This might be achieved perhaps by may deferring to include the background file to TICCL-anahash, and have this (or TICCL-unk ?) produce the focus file.

It would be handier, too, if the focus file would also list the actual word forms included, for easy reference.

This last would also enable TICCL-LDcalc to focus only on this (probably) single version of the possible anagrams associated with a particular anagram value, rather than processing them all.

You may wish to regard the above as two separate issues and handle them accordingly.

Thank you!

@martinreynaert
Copy link
Collaborator Author

For the time being, I think I can fake making the focus file proper by running both TICCL-unk and TICCL-anahash with only the corpus to be corrected and giving that more limited focus file to TICCL-indexer(NT).

I will try this path.

@kosloot
Copy link
Collaborator

kosloot commented Jul 4, 2018

However, I seem to have only one option to create a focus file. And this results in all the 
ngrams that do not completely make/meet the artifrq being incorporated in the focus file. 
(Unless I am mistaken and overlook another option...)

yes, this is what was agreed upon what anahash should do: (when --ngrams is specified!)
Add all n-grams to the foci file for which yields:

  • at least one part is present but with LOW freq
  • and the lowercase variant of the part is NOT present OR also with low freq

'being present' means: in the provided input file of anahash.
The background corpus is NOT used in this lookup. (maybe that is wrong?)

@martinreynaert
Copy link
Collaborator Author

I think you are mistaken here.

If the background is specified with TICCL-unk already, it is merged and so is used by TICCL-anahash.

So maybe it is correct that it is not used when specified with TICCL-anahash and this provides the option which I require, to base the focus file only on the input to be corrected.

I will check this possibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants