-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TICCL-LDcalc output of frequency draw word pairs #42
Comments
I checked the code and it is not as easy as I thought. |
Hi Ko, The value in field 7 is in fact the numerical difference between the Anagram Values of the pair. It is a value from the character confusion list produced on the basis of the alphabet by TICCL-lexstat and stands for a difference (usually) of just two characters, at most. TICCL-indexer(NT) attaches to these character confusion values the lower value of any word pair (in fact: set of word anagrams) identified. So TICCL-LDcalc reads in these character confusion values (column 1 in TICCL-indexer output) and for each of them picks the attached values (which are the lower ones) to retrieve from the anahash the set of word anagrams, i.e. the word(s), associated to this value and pairs them to the other set of word(s) also retrieved from the anahash. This retrieval is done on the basis of the sum of the character confusion value with the associated (lower) word anagram value. So the result of this addition, i.e. sum, gives the value for the higher AV. At least at the start of LDcalc, you therefore have both the lower and higher values at hand. LDcalc next proceeds to look at the associated words frequencies etc. Hope this helps! Martin |
I tried to implement this and installed the fix on maize and violet. |
Hi Ko, Thank you! I see no difference on maize between the LDcalc-output of the previous version and the current one either:
And I re-ran the same on violet: no difference with the new maize output there either:
So something did not work as expected. To try and help solve this, I extracted the hapaxes from the corpus frequency list and ran TICCL-LDcalc only on that. So that, artificially, reproduces nothing but draws. This here is the command line:
We select 4 examples from the tail of the output:
Their AVs:
The word forms nearer to the modern canonical form (if there is or would be such) are consistently the lower AV forms. I would very much like to see the output reversed! Thanks! Martin |
So I made a small change ON MAIZE ONLY!. Happy testing |
Many thanks, Ko! Sure I will test this!!! Starting it up right now ;0) M. |
Yes! They're all reversed now :0)
Btw, these are all words from Dutch 'Golden Age' notarial descriptions of house inventories about paintings. A 'zeestucxkien' would have been a small painting depicting a sea scene. Will now run the full thing ;0) |
Nice this looks good.
to
and:
To
Which doesn't look like progress to me, and the complete removal of:
As the left sides are 'out of the lexicion' and deleted after reversal. On a side note: shouldn't the |
reminder for @martinreynaert : |
In TICCL-LDcalc it may happen that the frequencies of words in a retrieved pair are the same.
In the case of such a draw, it is actually more likely (for diverse reasons) that the word form having the larger anagram value is the 'variant' and the one having the lower one the 'correction candidate'. Please output these accordingly.
Thank you!
The text was updated successfully, but these errors were encountered: