Skip to content

Latest commit

 

History

History
63 lines (42 loc) · 3.64 KB

README.md

File metadata and controls

63 lines (42 loc) · 3.64 KB

Exploring Gender Biases in Word2vec

ECE324 Project
Winter 2022

Authors

Contents

  1. Data Collection
  2. Gensim Word Embedding Model Training
  3. Custom Word2Vec Definition and Training
  4. Storing the Word Embedding Models
  5. Bolukbasi et al. Debiasing
  6. Zhao et al. Debiasing
  7. Savani et al. Debiasing
  8. Bias Measurements

1. Data Collection

Wikipedia:

  • wiki/Downloading_wiki.ipynb: Jupyter notebook which downloads 2020 Wikipidia articles from Tensorflow's cloud database. (Database) (Code credit)

Gutenburg Books:

  • gutenberg/gutenberg_data.py: Reads the urls from the gutenburg url files and reads the data inside. These functions are called in Word2Vec Model and Bias Measurements.ipynb.
  • Gutenburg URL files (Source):
    • gutenberg/gutenberg-test-urls.csv: test data urls
    • gutenberg/gutenberg-train-urls.csv: train data urls
    • gutenberg/gutenberg-validation-urls.csv: validation data urls

2. Gensim Word Embedding Model Training

  • Exploring_Gender_Biases_in_Word2Vec.ipynb is where the gensim word2vec model is defined (Source) and embedding training occurs for the Gutenberg dataset.
  • wiki/Wiki_Word2Vec_Training.ipynb is where the training occurs for the Wikipedia dataset.

3. Custom Word2Vec Definition and Training

  • The custom word2vec model instance, along with the random pertubution algorithm functions can be found in custom_word2vec.py.
  • The training for the custom model was done in the main notebook, Exploring_Gender_Biases_in_Word2Vec.ipynb

4. Storing the Word Embedding Models

  • /models: For the genism models, this folder stores the Wikipedia and Gutenberg models produced by the built-in save function in the gensim library. For the custom models, the files where saved using pickle, but were not added to this github due to the large file size.
  • /embeddings/:The embedding folder saves the word embedding dictionaries (word, embeddings on each line) in a .txt format.

5. Bolukbasi et al. Debiasing

Code related to Bolukbasi et al. debaising is found in the 2016debais folder. See the ReadMe in the folder for more details. (Code credit)

6. Zhao et al. Debiasing

Code related to Zhao et al. debiasing is found in the 2018debias folder, along with the debiased .txt word embeddings files.

7. Savani et al. Debiasing

Code related to Savani et al. debiasing is found in the main notebook, Exploring_Gender_Biases_in_Word2Vec.ipynb, right after the custom word2vec model is trained.

8. Bias Measurements

The three bias measurements (direct, indirect, WEAT) and their results can be found in Exploring_Gender_Biases_in_Word2Vec.ipynb