Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with download_udpos #1

Open
zphang opened this issue Apr 5, 2020 · 1 comment
Open

Issues with download_udpos #1

zphang opened this issue Apr 5, 2020 · 1 comment

Comments

@zphang
Copy link

zphang commented Apr 5, 2020

Hi,

I'm currently running the download script for XTREME. I'm running into some issues with the downloading and preprocessing of the UD data, and wanted to check if some of these are an issue with my setup or an issue with the provided code.

  1. The script uses the third party ud-conversion-tools file $REPO/third_party/ud-conversion-tools/conllu_to_conll.py. However, the script contains the line
    from lib.conll import CoNLLReader
    whereas the lib folder from ud-conversion-tools has not been included in the $REPO/third_party/ud-conversion-tools folder. I was able to get around this by separately git cloning from https://github.com/coastalcph/ud-conversion-tools and adding that to my PYTHONPATH
  2. After correcting for the above, it looks like a good number of the preprocessing commands for UD are able to work, but a small number still run into some errors (or warnings). Are these to be expected? (These are just messages I grabbed during my run)

Case 1.

python /mypath/xtreme/third_party/ud-conversion-tools/conllu_to_conll.py /mypath/xtreme/download//udpos-tmp/ud-treebanks-v2.5/UD_Dutch-Alpino/nl_alpino-ud
-train.conllu /mypath/xtreme/download//udpos-tmp/conll//nl//nl_alpino-ud-trai
n.conll --lang nl --replace_subtokens_with_fused_forms --print_fused_forms
Traceback (most recent call last):
  File "/mypath/xtreme/third_party/ud-conversion-tools/conllu_to_conll.py", l
ine 53, in <module>
    main()
  File "/mypath/xtreme/third_party/ud-conversion-tools/conllu_to_conll.py", l
ine 41, in main
    orig_treebank = cio.read_conll_u(args.input)#, args.keep_fused_forms, args.lang, POSRANKPRECEDENC
EDICT)
  File "/mypath/xtreme/ud-conversion-tools/lib/conll.py", line 350, in read_conll_
u
    token_dict = {key: conv_fn(val) for (key, conv_fn), val in zip(self.CONLL_U_COLUMNS, parts)}
  File "/mypath/xtreme/ud-conversion-tools/lib/conll.py", line 350, in <dictcomp>
    token_dict = {key: conv_fn(val) for (key, conv_fn), val in zip(self.CONLL_U_COLUMNS, parts)}
  File "/mypath/xtreme/ud-conversion-tools/lib/conll.py", line 26, in parse_deps
    return [(int(pair[0]), pair[1]) for pair in dep_pairs]
  File "/mypath/xtreme/ud-conversion-tools/lib/conll.py", line 26, in <listcomp>
    return [(int(pair[0]), pair[1]) for pair in dep_pairs]
ValueError: invalid literal for int() with base 10: '5.1'

Case 2.

Not a tree after fused-form heuristics: غزة 15 - 8 ( اف ب ) - حذرت الجبهة الشعبية لتحرير فلسطين وحزب 
الخلاص الوطني ، الاسلامي القريب من حركة حماس ، من اية محاولات او اف منه الى وكالة فرانس برس الى " ضرو
رة الحفاظ على المصداقية في هذا الخصوص والا فان الدولة ستتحول الى ورقة استهلاكية تستخدم في المناسبات "
 .

Case 3.

Traceback (most recent call last):
  File "/mypath/xtreme/third_party/ud-conversion-tools/conllu_to_conll.py", l
ine 53, in <module>
    main()
  File "/mypath/xtreme/third_party/ud-conversion-tools/conllu_to_conll.py", l
ine 48, in main
    s.filter_sentence_content(args.replace_subtokens_with_fused_forms, args.lang, current_pos_precede
nce_list,args.remove_node_properties,args.remove_deprel_suffixes,args.remove_arabic_diacritics)
  File "/mypath/xtreme/ud-conversion-tools/lib/conll.py", line 219, in filter_sent
ence_content
    self._keep_fused_form(posPreferenceDict)
  File "/mypath/xtreme/ud-conversion-tools/lib/conll.py", line 179, in _keep_fused
_form
    deprel = self[localhead][ext_dep]["deprel"]
KeyError: 3

Thanks!

@JunjieHu
Copy link
Owner

Hi @zphang
Thanks a lot for pointing out the issues w/ detailed cases!

  1. The .gitignore file made me miss the lib folder. I've just uploaded my modified conll.py file. For the particular error in your case1, there are a very small number of words in some files that have non-integer indexes. So I filtered them out by:
    https://github.com/JunjieHu/xtreme/blob/develop/third_party/ud-conversion-tools/lib/conll.py#L28

  2. That warming is because the heuristic conversion breaks down a single tree structure for the sentence. Since we are doing mostly on the POS tagging task, that should be fine. I also commented that warming.
    https://github.com/JunjieHu/xtreme/blob/develop/third_party/ud-conversion-tools/lib/conll.py#L229

  3. If you use my uploaded file, there should not be such errors. I just test the download script one more time in a fresh new machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants