-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Main two things currently wrong with TICCL-LDcalc and TICCL-rank (and two more gripes...) #29
Comments
OK. All of the above probably constitutes an intertwined set of problems too complicated to be solved all at once. There seem to be a few problems that viewed on their own should be quite easily solved. I suggest we solve these first and then proceed from there. First, the wrong ordening of the CCs by TICCL-rank. Before we implemented the descending sort by frequency of the CCs, all was well. This should only have been implemented for best-first ranked (--clip=1) output lists anyway. This sorting is easily done by hand apart from TICCL-rank, on its output. So: we should either disable this now or correct it so it is done on best-first ranked lists only, respecting the actual best-first ranking according to the confidence. Second, we do need to figure out why and how bigrams such as tire_as, tire_on, being composed of validated words only, still end up in the 'long' ldcalc file. And prevent this from happening. Third, if the ngram ranking feature is not yet operational, it should be made so in order that we can see what effect it has. I think these are to be addressed first, if and when you have the time to do so, Ko. MRE |
I was mistaken before: the correct resolution for 'ifle' (taking into account the long s to f confusion) is: 'isle'. Cf. the contexts: reynaert@red:/reddata/PILOTS/MORSE/FOLIA/AONG$ grep --color 'ifle of' Morse.archiveorg_nietgetraind.xml.folia.xml And for the plural: reynaert@red:/reddata/PILOTS/MORSE/FOLIA/AONG$ grep --color 'ifles of' Morse.archiveorg_nietgetraind.xml.folia.xml |
I wonder if this still an issue, or solved sowhere along the line. (it may be...) |
1 similar comment
I wonder if this still an issue, or solved sowhere along the line. (it may be...) |
At least two things seem to have wrong here: On the basis of my own logs, I now conclude that A/ in this issue was definitely solved. It must have been, it was very clear what happened and what had made it happen. Also output from not too long after this issue was posted, corroborates that this was solved. Note that the filename explicitly mentions a 'new' TICCL-LDcalc and a fix by Ko in TICCL-rank. `(LMdev) reynaert@violet:MORSE$ ls -l /reddata/PILOTS/MORSE/RUNAMALGAM3NEWLDCALC/zzz/TICCL/RUNAMALGAM3NEWLDCALC.wordfreqlist.1to3.tsv.tsv.clean.ldcalc.RANKED.FIXKORANK.SKIP9-10-11-13.ranked (LMdev) reynaert@violet:MORSE$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAMALGAM3NEWLDCALC/zzz/TICCL/RUNAMALGAM3NEWLDCALC.wordfreqlist.1to3.tsv.tsv.clean.ldcalc.RANKED.FIXKORANK.SKIP9-10-11-13.ranked (LMdev) reynaert@violet:MORSE$ grep --color '^Amarican' /reddata/PILOTS/MORSE/RUNAMALGAM3NEWLDCALC/zzz/TICCL/RUNAMALGAM3NEWLDCALC.wordfreqlist.1to3.tsv.tsv.clean.ldcalc.RANKED.FIXKORANK.SKIP9-10-11-13.ranked |
The following is a long explanation of things going wrong currently. It offers no possible solutions yet. These will follow asap. I am trying to figure out the 'easiest fix'.
A/ We have recently adapted TICCL-rank to the needs of the new TICCL-chain by making it sort its best-first ranked (parameter --clip=1 ) output file numerically descending on the frequency of the Correction Candidate (CC). This has broken the correct working of TICCL-rank.
B/ We have also quite recently make TICCL-LDcalc output 'short' correction pairs to a new output file *short.ldcalc and the ngrams from which the short correction pairs were derived to a new file with extension 'ambi'. This creates further problems for TICCL-rank, as we shall explain later.
C/ Furthermore, we do not know if the new ranking feature based on the number of observed ngrams in which a particular word forms appears is in fact operational in TICCL-LDcalc yet.
D/ We remain handicapped by the fact that we do not have an exhaustive description of the full ranking system as currently implemented in TICCL-LDcalc and TICCL-rank.
Addressing A/ : We have for a while been under the impression that TICCL 'just' misses the most obvious Correction Candidate. We think we now have found the cause for this.
We present output from TICCL-rank run with respectively --clip=1, --clip=5 and --clip=10 on TICCL-LDcalc output on the English book by Morse.
In CLIP5 we see clearly that the CCs are ranked according to their frequency and no longer according to the confidence score. In fact the highest confidence score is with the fifth ranked CC. In CLIP10 we see that the highest confidence score in CLIP5 is outranked by the even higher confidence score of CC 'Niles'.
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.ranked
nuiles#1#Naples#4000030272#2#0.998194
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP5.ranked
nuiles#1#Naples#4000030272#2#0.998194
nuiles#1#Miles#3000031696#2#0.998486
nuiles#1#Giles#2000014531#2#0.998273
nuiles#1#Jules#2000014280#2#0.99853
nuiles#1#suites#2000007207#2#0.9993
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP10.ranked
nuiles#1#Naples#4000030272#2#0.998194
nuiles#1#Miles#3000031696#2#0.998486
nuiles#1#Giles#2000014531#2#0.998273
nuiles#1#Jules#2000014280#2#0.99853
nuiles#1#suites#2000007207#2#0.9993
nuiles#1#Dulles#2000005022#2#0.998645
nuiles#1#Niles#2000004196#1#0.999699
nuiles#1#Wiles#2000001183#2#0.998052
nuiles#1#Nines#2000000883#2#0.998734
nuiles#1#Ailes#2000000578#2#0.999088
When we look at the appropriately sorted output of CLIP1000 we see that 'Niles' in fact has the highest confidence score. The now 'best' ranked top 10 CCs have swamped the actual desired correction 'miles', its capitalized version 'Miles', which was present in CLIP5, is now out of sight, too.
Current TICCL output (incorrectly sorted by CC frequency) for non-word word form 'nuiles':
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP1000.ranked | sort -gr -t '#' -k 4 |head -n 10
nuiles#1#Naples#4000030272#2#0.998194
nuiles#1#Miles#3000031696#2#0.998486
nuiles#1#Giles#2000014531#2#0.998273
nuiles#1#Jules#2000014280#2#0.99853
nuiles#1#suites#2000007207#2#0.9993
nuiles#1#Dulles#2000005022#2#0.998645
nuiles#1#Niles#2000004196#1#0.999699
nuiles#1#Wiles#2000001183#2#0.998052
nuiles#1#Nines#2000000883#2#0.998734
nuiles#1#Ailes#2000000578#2#0.999088
Output as should be sorted by highest confidence:
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP1000.ranked | sort -gr -t '#' -k 6 |head -n 10
nuiles#1#Niles#2000004196#1#0.999699
nuiles#1#Tules#2000000029#2#0.999486
nuiles#1#nuclei#1000008297#2#0.999478
nuiles#1#rules#1000152878#2#0.99946
nuiles#1#Rules#1000021220#2#0.999433
nuiles#1#suites#2000007207#2#0.9993
nuiles#1#nails#1000009554#2#0.999203
nuiles#1#Suites#1705034559#2#0.999194
nuiles#1#Nilus#1000000335#2#0.999176
nuiles#1#Yules#2000000019#2#0.999097
Anyway, the main thing is that currently even the best-first ranked CC offered with CLIP1 is not the one with the highest confidence score, but the one with the highest frequency, which is plainly wrong. This is an undesired artefact of the resorting implemented for TICCL-chain.
We see much the same, though the result is less wrong -- here the most confident score is given to the right correction, for 'Amarican':
TICCL sorted output:
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^Amarican' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP5.ranked |more
Amarican#1#America#4000475833#2#0.996842
Amarican#1#American#3001522167#1#0.998421
Amarican#1#Americas#3000025187#2#0.995474
Amarican#1#Américas#3000000831#2#0.991158
Amarican#1#African#2000256933#2#0.993263
Output resorted descendingly by confidence:
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^Amarican' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP5.ranked |sort -gr -t '#' -k 6 |more
Amarican#1#American#3001522167#1#0.998421
Amarican#1#America#4000475833#2#0.996842
Amarican#1#Americas#3000025187#2#0.995474
Amarican#1#African#2000256933#2#0.993263
Amarican#1#Américas#3000000831#2#0.991158
Nevertheless: the 'best-first ranked' candidate without parameter --clip is still the one obtained by highest frequency sorting:
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^Amarican' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.ranked |more
Amarican#1#America#4000475833#2#0.996842
Addressing B/ : In prior runs without the foci file curtailed to the foreground corpus only we found that 'tire' is often a confusable for 'the'. We are rather surprised that that is still the case, although many more pairs representing this pair seem now to have been properly filtered out on the basis of their frequencies, i.e. these being validated word form pairs. We now see that in some cases this still happens, which is in itself another issue to be addressed. (This may be because capitalized word forms did not get the artifrq, at least in some of these cases).
Example:
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^tire
the' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc.ambithe#first_tiretire
First_the#first_tirefirst_the#tire_Great_Kanhawaythe_Great_Kanhaway#tire_Great_Kanhawaythe_great_Kanhaway#tire_Guisos_Mexicothe_Guisos_Mexico#tire_Guisos_Mexicothe_guisos_Mexico#tire_Guisosthe_Guisos#tire_Guisosthe_guisos#tire_Milliiippithe_Milliiippi#tire_lifethe_LIFE#tire_lifethe_Life#tire_lifethe_life#As stated before, we are not currently attempting to solve confusables. But this example allows us to explain the issue currently at hand.
The short forms have duly been added to the *short.ldcalc file, as we have recently decided to do. It is here the first of the nine last of 52 such 'confusable' pairs in *short.ldcalc.
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^tire' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.short.ldcalc |tail -n 9
tire
00the000220100120tire
0tides1000007728100000988102311001tire
00ties00022010050tire
0tin0002201001tire
00tis00022010010tire
0toe0002201001tire
00tone00022010010tire
0wine0002200002tire
00wise000220000~1[Another new issue which seems to have popped up the last week or so (as a consequence of one of the latest adjustments to the work flow) is here apparent: for lots of these pairs the usual information such as frequencies etc. is now missing.]
The issue we are inching towards is this: short word forms may well be 'properly' handled by *short.ldcalc and *ambi, but other pairs based on the actual bigram (mostly, if not exclusively, we suspect) are still incorporated in the regular 'long' *ldcalc file: (we do no longer see the actual 'tire_land' and 'tire_bay' examples we had a couple of weeks ago. The first delivered e.g. CCs 'Ireland' and 'fireland' in the long ldcalc file). But these examples are clear enough (granted: they should not be there by virtue of the frequencies of their composing words alone):
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^tire_' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc
tire_as
4455Tijeras1000000109100000010923318930336251110044tire_as
55Treas10000000981000000124238036236572511100tire_as
4455treas10000000261000000124238036236572511100266tire_on
266Ireson10000000921000000092148343068382510100tire_on
266266Tiron10000000841000000084232073370562511100266tire_on
266Treon10000000411000000041238036236572511100tire_or
6565TREVOR10520000183025512626967251110065tire_or
65Trevor2000018197200001830255126269672511100tire_to
170187Tirito1000000000100000000010444521431251110~0A non-word example concerns 'ifle':
We have 596 pairs containing this non-word in short.ldcalc.
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ cat /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.short.ldcalc |grep '^ifle~' |wc
596 596 21098
For the probably correct resolution 'rifle' we have the following evidence:
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^ifle
rifle' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc.ambirifle#The_ifle_isifle
the_rifle_is#The_ifleThe_rifle#The_iflethe_rifle#and_the_ifleand_the_rifle#ifle_isrifle_is#ifle_ofrifle_of#ifle_on_therifle_on_the#ifle_onrifle_on#ifle_orrifle_or#small_iflesmall_rifle#the_ifle_ofthe_rifle_of#the_ifleThe_rifle#the_ifle~the_rifle#'Long' LDcalc nevertheless still retains a number of 'ifle' bigrams.
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^ifle_' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc
ifle_is
11Ifles142380362365725010001ifle_is
1ifles34238036236572501000ifle_on
11Flemon10000000021000000002280020708125101001ifle_on
1Fleron111197781063502500100ifle_on
11Flexon4444923474575025001001ifle_on
1Isleton10000000521000000052110889093722511100ifle_or
11Flexor1181128923474575025001001ifle_or
1flexor1010112892347457502500100The problem with these is that TICCL-rank misses the possibly likeliest resolution which is in short.ldalc and will rank the rest, probably delivering a False Positive.
I am not sure what would be best to do about this. I think for now we should keep both the short.ldcalc and ambi output. And still add the 'short' bigrams to 'long' ldcalc so that TICCL-rank has the data necessary to do its job well.
Given the inordinate amount of possible pairs for 'ifle' in short.ldcalc, I am not sure the very large background corpus containing also ngrams helps rather than obfuscates the situation. It seems that we should boost the evidence of validated ngrams present in the foreground corpus where and how possible.
Yet one more 'new' issue that bothers me is the fact that capitalized word forms seem to have gained prominence in the corrections. This is due to the fact that TICCL-anahash sorts the anagrams collected alphabetically, it seems. If at all possible, these should rather be sorted by frequency.
Another thing... This run had --low=4. Yet we find the couple 'ifles~riffles', word lengths 5 and 7 respectively, in short.ldcalc.
reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^ifles~' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc.ambi |grep 'ifles
riffles'riffles#ifles_of~riffles_of#ifles
How does that happen?
Addressing C/ : I need to know.
Addressing D/ : I need to know, too.
Further to the ranking features: now we have the foreground foci file: we should use this as another, strong ranking feature: if the CC is present: boost.
Following up on mainly A/ and B/: I will post recommendations for remedial work asap.
MRE
The text was updated successfully, but these errors were encountered: