-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operation 'hemp' parameter in FoLiA-stats #29
Comments
I was notified FoLiA-stats, as installed on the new server 'violet', should now be able to handle ligatures. I tested this on 'violet'. Note this was the very first time I ran any FoLiA- or TICCL tool on this new machine. It seemed very slow. And it did not work as can be seen from the output file: reynaert@violet:/reddata$ grep 'F_r_a_n' /reddata/PILOTS/LEVITICUS/TESTFRQ/TESTFRQFOLIAtagdivNEW.hemp The command run was: reynaert@violet:/reddata$ /exp/sloot/usr/local/bin/FoLiA-stats --max-ngram=3 --separator='_' --collect --tags=div -t max --hemp=/reddata/PILOTS/LEVITICUS/TESTFRQ/TESTFRQFOLIAtagdivNEW.hemp -e folia.xml$ -o /reddata/PILOTS/LEVITICUS/TESTFRQ/TESTFRQFOLIAtagdivNEW /reddata/PILOTS/LEVITICUS/FOLIA/NOFOREIGN/ |
Ok, closer examining the provided data reveals that the 'ij' ISN'T a ligature but indeed just 2 separate characters. So the patch to handle multi-byte characters didn't work out. We really need to be more lax hear and accept 2-byte sequences too. |
Ok, I improved 'hemp' detection. the bi-gram 'ij' is now always accepted, and bi-grams with a trailing punctuation too, but they are assumed to END the 'hemp' |
@martinreynaert I would like to improve, and clarify 'hemp' detection a bit, especially while we are using the same procedure in FoLiA-correct now. I will use some corner-cases to illustrate the difficulties. Take the following examples:
I suppose the hemp to be detected is Some cases with a punctuated hemp:
1,2 and 3 will give the hemp: 1-digit numbers can also be part of an hemp, like in: NOTE: as an exception the bi-gram 'ij' (and case variants) is also part of a hemp. To summarize: |
still waiting for an answer |
The hemp parameter in FoLiA-stats collects spaced words. It currently breaks on ligatures (see example). It also fails to collect the last letter if this has a trailing punctuation mark, which happens often.
reynaert@black:/reddata/PILOTS/LEVITICUS$ grep 'F r a n' /reddata/PILOTS/LEVITICUS/FOLIA/NOFOREIGN/levit.03.NoForeigns.folia.xml.txt
F r a n s c h zal
Z. F r a n k r ij k.
uitgeoefend. Z. F r a n k r ij k.
F r a n k r ij k.
reynaert@black:/reddata/PILOTS/LEVITICUS$ cat TESTFRQ/TESTFRQFOLIAtagdiv.hemp |grep 'F_r_a_n'
F_r_a_n_k_r
1/ ligatures should be seen as single characters.
2/ a final character with a trailing punctuation mark should also be collected.
Perhaps both little issues might be solved by allowing for the 'occasional' two character sequence, given repetitions of single characters in historically emphasised text.
The text was updated successfully, but these errors were encountered: