Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

udpipe2teitok.pl tries to write to folder as if it were a file #6

Closed
SteffRhes opened this issue Dec 16, 2024 · 4 comments
Closed

udpipe2teitok.pl tries to write to folder as if it were a file #6

SteffRhes opened this issue Dec 16, 2024 · 4 comments

Comments

@SteffRhes
Copy link

description

the script fails to create output files but instead creates folders with the target names of the output files. E.g. input file is DEU_001.txt, then the console output is:

perl Scripts/udpipe2teitok.pl --orgfolder=/veld/input/ --lang=de --model=german-hdt-ud-2.6-200830 --mixed
Using model: german-hdt-ud-2.6-200830

Treating /veld/input/DEU_001.txt
 - Running UDPIPE from http://lindat.mff.cuni.cz/services/udpipe/api/process/german-hdt-ud-2.6-200830
 - Writing tmp to udpipeDEU_001.conllu
 - Writing to udpipeDEU_001.xml

but no files udpipeDEU_001.conllu or udpipeDEU_001.xml are created. Instead there are folders with such names:

drwxr-xr-x. 1 steff steff    0 Dec 16 15:50 udpipeDEU_001.conllu
drwxr-xr-x. 1 steff steff    0 Dec 16 15:50 udpipeDEU_001.xml

I guess a / is missing? I.e. it should be udpipe/DEU_001.conllu?

how to reproduce

I've created a branch in my veld repos udpipe2teitok_bug_reproduction

git clone --recurse-submodules --branch udpipe2teitok_bug_reproduction https://github.com/veldhub/veld_chain__demo_teitok-tools
cd veld_chain__demo_teitok-tools
docker compose -f veld_udpipe2teitok.yaml up

After this, the output folders instead of files are created in the code submodule

cd code/veld_code__teitok-tools/
ls -l

which would show these four folders (of two input files https://github.com/veldhub/veld_chain__demo_teitok-tools/tree/udpipe2teitok_bug_reproduction/data/udpipe2teitok/in ):

drwxr-xr-x. 1 steff steff    0 Dec 16 15:56 udpipeDEU_001.conllu
drwxr-xr-x. 1 steff steff    0 Dec 16 15:56 udpipeDEU_001.xml
drwxr-xr-x. 1 steff steff    0 Dec 16 15:56 udpipeDEU_002.conllu
drwxr-xr-x. 1 steff steff    0 Dec 16 15:56 udpipeDEU_002.xml
@SteffRhes SteffRhes changed the title udpipe2teitok.pl tries to write to folder as if it were a file, instead of "into" udpipe2teitok.pl tries to write to folder as if it were a file Dec 16, 2024
@maartenpt
Copy link
Collaborator

Should be solved now - it now creates the folder if it does not exist, and it does more checks to figure out how to properly name the output files.

@SteffRhes
Copy link
Author

Thanks Maarten ... but ... err ...where's the commit? Under https://github.com/ufal/teitok-tools/commits/master/ I only see 78d93e9 as the most recent one?

Do I overlook something?

@SteffRhes
Copy link
Author

I saw a newer commit de76fa4 and merged it. The behavior changed but the bug is still there.

how to reproduce

I've simplified the bug reproduction by moving the sample data and a hard-coded explicit script call into the code repo.

To reproduce, do

git clone --branch udpipe2teitok_bug_reproduction https://github.com/veldhub/veld_code__teitok-tools.git
cd veld_code__teitok-tools/
docker compose -f udpipe2teitok_bug_reproduction.yaml up

which executes this perl script (which should be executable without docker as well):

perl Scripts/udpipe2teitok.pl --orgfolder=data/in/ --model=german-hdt-ud-2.6-200830

where data/in/ is the folder containing a txt file.

what's the output:

The command output is

udpipe2teitok_bug_reproduction-1  | Using model: german-hdt-ud-2.6-200830
udpipe2teitok_bug_reproduction-1  | Treating folder: data/in/
udpipe2teitok_bug_reproduction-1  | 
udpipe2teitok_bug_reproduction-1  | Treating data/in/DEU_001.txt
udpipe2teitok_bug_reproduction-1  |  - Running UDPIPE from http://lindat.mff.cuni.cz/services/udpipe/api/process/german-hdt-ud-2.6-200830
udpipe2teitok_bug_reproduction-1  |  - Writing tmp to udpipeDEU_001.conllu
udpipe2teitok_bug_reproduction-1  |  - Writing to xmlfiles/DEU_001.xml

after which I have three new folders

...
drwxr-xr-x. 1 steff steff    0 Jan 24 15:23 udpipe
drwxr-xr-x. 1 steff steff    0 Jan 24 15:23 udpipeDEU_001.conllu
...
drwxr-xr-x. 1 steff steff   22 Jan 24 15:23 xmlfiles

The two folders udpipe and udpipeDEU_001.conllu are empty, while xmlfiles contains DEU_001.xml but that only contains such content.

<TEI>
<teiHeader>
	<fileDesc>
		<profileDesc>
			<langUsage><language code="de">German</language></langUsage>
		</profileDesc>
	</fileDesc>
	<notesStmt><note n="orgfile"></note></notesStmt>
	<revisionDesc><change who="udpipe" when="2025-01-24">dependency parsed with the udpipe web-service using model german-hdt-ud-2.6-200830</change></revisionDesc>
</teiHeader>
<text>

</text>
</TEI>

interpretation

I don't know what causes the almost empty xml file, but as for the conllu file, I suspect a missing / somewhere in the code? Because the log print contains udpipeDEU_001.conllu and xmlfiles/DEU_001.xml where only the xml one gets a /. and it might explain the two folders udpipe and udpipeDEU_001.conllu.

@SteffRhes
Copy link
Author

SteffRhes commented Feb 28, 2025

Okay, specifying a tmp folder explicitly works. E.g.

perl Scripts/udpipe2teitok.pl --orgfolder=data/in/ --model=german-hdt-ud-2.6-200830 --tmpfolder=/tmp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants