Skip to content

Code for the paper "How does Grammatical Gender Affect Noun Representations in Gender-Marking Languages?"

License

Notifications You must be signed in to change notification settings

gonenhila/grammatical_gender

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

How does Grammatical Gender Affect Noun Representations in Gender-Marking Languages?

This project includes the experiments described in the paper:

"How does Grammatical Gender Affect Noun Representations in Gender-Marking Languages?"
Hila Gonen, Yova Kementchedjhieva and Yoav Goldberg, CoNLL 2019 (best paper).

Prerequisites

Debiased embeddings - ready to use

To use the non-debiased and debiased embeddings in German and in Italian, download the files from this folder (under embeddings, 8 files):

  • de: nondebiased, German
  • it: nondebiased, Italian
  • de_lemma_basic_all: debiased, German
  • it_lemma_to_fem: debiased, Italian

These files are preprossesed for fast loading. To load them, use the script source/save_embeds.py:

vocab = {}  
wv = {}  
w2i = {}  
load_and_normalize('en', path, vocab, wv, w2i)

Training debiased embeddings

If you want to train debiased embeddings from scratch, here are the steps to take:

  • Download the corpora from this folder (under data) into the data folder (or use your own)

  • Use the script create_pairs_word2vecf.py.

    This will create a dictionary that converts every word in the corpus to its new form. Using this dictionary, the script will create files of pairs for word2vecf.

    Usage for German:

     python create_pairs_word2vecf.py --lang de --lemmatize basic --input ../data/de_corpus_tokenized --output ../data/word2vecf_pairs/de_lemma_basic_all

    Usage for Italian:

     python create_pairs_word2vecf.py --lang it --lemmatize to_fem --input ../data/it_corpus_tokenized --output ../data/word2vecf_pairs/it_lemma_to_fem
  • Next, run word2vecf (scripts can be found here):

    Usage example for German:

     ./count_and_filter -train ../data/word2vecf_pairs/de_lemma_basic_all -cvocab ../data/word2vecf_pairs/de_lemma_basic_all_cv -wvocab ../data/word2vecf_pairs/de_lemma_basic_all_wv -min-count 100
     
     ./word2vecf -train ../data/word2vecf_pairs/de_lemma_basic_all -wvocab ../data/word2vecf_pairs/de_lemma_basic_all_wv -cvocab  ../data/word2vecf_pairs/de_lemma_basic_all_cv -output ../data/embeddings/word2vecf/de_lemma_basic_all -dumpcv ../data/embeddings/word2vecf/de_lemma_basic_all_ctx -size 300 -negative 15 -threads 40 -iters 5

Simlex-999 inanimate pairs

Can be found under the data folder.

Cite

If you find this project useful, please cite the paper:

@inproceedings{grammatical_gonen,
    title = "How Does Grammatical Gender Affect Noun Representations in Gender-Marking Languages?",
    author = "Gonen, Hila and Kementchedjhieva, Yova and Goldberg, Yoav",
    booktitle = "Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)",
    year = "2019",
    pages = "463--471",
}

Contact

If you have any questions or suggestions, please contact Hila Gonen.

License

This project is licensed under Apache License - see the LICENSE file for details.

About

Code for the paper "How does Grammatical Gender Affect Noun Representations in Gender-Marking Languages?"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages