Focus-file creation #21

martinreynaert · 2018-07-04T14:13:26Z

I currently have two options to merge the frequency file of the corpus to be corrected with the background file (representing a lexicon or a large background corpus or some combination of both).

However, I seem to have only one option to create a focus file. And this results in all the ngrams that do not completely make/meet the artifrq being incorporated in the focus file. (Unless I am mistaken and overlook another option...)

I would like to have the option to only have those ngrams from the corpus to be corrected that do not meet the artifrq to be incorporated in the focus file.

This might be achieved perhaps by may deferring to include the background file to TICCL-anahash, and have this (or TICCL-unk ?) produce the focus file.

It would be handier, too, if the focus file would also list the actual word forms included, for easy reference.

This last would also enable TICCL-LDcalc to focus only on this (probably) single version of the possible anagrams associated with a particular anagram value, rather than processing them all.

You may wish to regard the above as two separate issues and handle them accordingly.

Thank you!

martinreynaert · 2018-07-04T14:24:56Z

For the time being, I think I can fake making the focus file proper by running both TICCL-unk and TICCL-anahash with only the corpus to be corrected and giving that more limited focus file to TICCL-indexer(NT).

I will try this path.

kosloot · 2018-07-04T15:18:27Z

However, I seem to have only one option to create a focus file. And this results in all the 
ngrams that do not completely make/meet the artifrq being incorporated in the focus file. 
(Unless I am mistaken and overlook another option...)

yes, this is what was agreed upon what anahash should do: (when --ngrams is specified!)
Add all n-grams to the foci file for which yields:

at least one part is present but with LOW freq
and the lowercase variant of the part is NOT present OR also with low freq

'being present' means: in the provided input file of anahash.
The background corpus is NOT used in this lookup. (maybe that is wrong?)

martinreynaert · 2018-07-04T15:31:18Z

I think you are mistaken here.

If the background is specified with TICCL-unk already, it is merged and so is used by TICCL-anahash.

So maybe it is correct that it is not used when specified with TICCL-anahash and this provides the option which I require, to base the focus file only on the input to be corrected.

I will check this possibility.

martinreynaert assigned kosloot Jul 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Focus-file creation #21

Focus-file creation #21

martinreynaert commented Jul 4, 2018

martinreynaert commented Jul 4, 2018

kosloot commented Jul 4, 2018 •

edited

Loading

martinreynaert commented Jul 4, 2018

Focus-file creation #21

Focus-file creation #21

Comments

martinreynaert commented Jul 4, 2018

martinreynaert commented Jul 4, 2018

kosloot commented Jul 4, 2018 • edited Loading

martinreynaert commented Jul 4, 2018

kosloot commented Jul 4, 2018 •

edited

Loading