Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

one token aligned multiple-times #209

Open
matyaskopp opened this issue Oct 22, 2024 · 1 comment
Open

one token aligned multiple-times #209

matyaskopp opened this issue Oct 22, 2024 · 1 comment
Assignees
Labels
🔔 audio bug Something isn't working

Comments

@matyaskopp
Copy link
Member

Tokens to align:

$ grep ps2013-001-01-002-002.u1.p5.s1.w18 /net/work/people/kopp/ParCzech/audio-alignment/Data/audio-corresp-tsv-in/www.psp.cz/eknih/2013ps/audio/2013/11/25/2013112513581412.tsv 
se	ps2013-001-01-002-002.u1.p5.s1.w18	MiroslavaNemcova.1952

Alignment output (multiple alignments of one token):

$ grep ps2013-001-01-002-002.u1.p5.s1.w18 /net/work/people/kopp/ParCzech/audio-alignment/Data/audio-align-token/2013112513581412.tsv
se	-	False	ps2013-001-01-002-002.u1.p5.s1.w18	False	-	-	-	-	-	-
se	-	False	ps2013-001-01-002-002.u1.p5.s1.w18	False	-	-	-	-	-	-
se	-	False	ps2013-001-01-002-002.u1.p5.s1.w18	False	-	-	-	-	-	-
se	to	False	ps2013-001-01-002-002.u1.p5.s1.w18	True	2	1.000	612760.0	612910.0	150.0	75.000
se	se	False	ps2013-001-01-002-002.u1.p5.s1.w18	True	0	0.000	743260.0	743420.0	160.0	80.000
se	-	False	ps2013-001-01-002-002.u1.p5.s1.w18	False	-	-	-	-	-	-

The first match in audio-align-token/2013112513581412.tsv is aligned

<anchor synch="#ps2013-001-01-002-002.u1.p5.s1.w18.ab"/>
<w xml:id="ps2013-001-01-002-002.u1.p5.s1.w18" lemma="se" pos="PRON" msd="UPosTag=PRON|Case=Acc|PronType=Prs|Reflex=Yes|Variant=Short" ana="pdt:P7-X4----------">se</w>
<anchor synch="#ps2013-001-01-002-002.u1.p5.s1.w18.ae"/>
<!-- .... -->
<when xml:id="ps2013-001-01-002-002.u1.p5.s1.w18.ab" interval="612760.0" since="#ps2013-001-01-002-002.audio1.origin"/>
<when xml:id="ps2013-001-01-002-002.u1.p5.s1.w18.ae" interval="612910.0" since="#ps2013-001-01-002-002.audio1.origin"/>

$ grep ps2013-001-01-002-002.u1.p3.s1.w26 audio-corresp-tsv-in/www.psp.cz/eknih/2013ps/audio/2013/11/25/2013112513581412.tsv 
ten	ps2013-001-01-002-002.u1.p3.s1.w26	MiroslavaNemcova.1952

$ grep ps2013-001-01-002-002.u1.p3.s1.w26 /net/work/people/kopp/ParCzech/audio-alignment/Data/audio-align-token/2013112513581412.tsv 
ten	ten	False	ps2013-001-01-002-002.u1.p3.s1.w26	True	0	0.000	516570.0	516820.0	250.0	83.333
ten	-	False	ps2013-001-01-002-002.u1.p3.s1.w26	False	-	-	-	-	-	-
ten	-	False	ps2013-001-01-002-002.u1.p3.s1.w26	False	-	-	-	-	-	-
ten	-	False	ps2013-001-01-002-002.u1.p3.s1.w26	False	-	-	-	-	-	-
ten	ten	False	ps2013-001-01-002-002.u1.p3.s1.w26	True	0	0.000	693800.0	694080.0	280.0	93.333
ten	-	False	ps2013-001-01-002-002.u1.p3.s1.w26	False	-	-	-	-	-	-
@matyaskopp matyaskopp added bug Something isn't working 🔔 audio labels Oct 22, 2024
@matyaskopp matyaskopp self-assigned this Oct 22, 2024
@matyaskopp
Copy link
Member Author

ParCzech 4.0:

/net/work/people/kopp/ParCzech/audio-alignment/Data/audio-align-token$ ls | xargs grep '^[^\t]*\t[^-]*\t'|cut -f4|grep -v CONTEXT|sort | uniq -c|sort -n| grep -v '^ *1 ' > ~/double-aligned-tokens.log

Affected sentences:

$ cat ~/double-aligned-tokens.log | sed "s/.* //;s/.w.*//"|sort|uniq|wc -l
32464

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🔔 audio bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant