Skip to content

lukaselmer/hyperwords

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

# hyperwords: Hyperparameter-Enabled Word Representations #

hyperwords is a collection of scripts and programs for creating word representations, designed to facilitate academic
research and prototyping of word representations. It allows you to tune many hyperparameters that are pre-set or
ignored in other word representation packages.

hyperwords is free and open software. If you use hyperwords in scientific publication, we would appreciate citations:  
"Improving Distributional Similarity with Lessons Learned from Word Embeddings"
Omer Levy, Yoav Goldberg, and Ido Dagan. TACL 2015.


## Requirements ##
Running hyperwords may require a lot of computational resources:  
- disk space for independently pre-processing the corpus  
- internal memory for loading sparse matrices  
- significant running time; hyperwords is neither optimized nor multi-threaded

hyperwords assumes a *nix shell, and requires Python 2.7 (or later, excluding 3+) with the following packages installed:
numpy, scipy, sparsesvd.


## Quick-Start ##
1. Download the latest version from BitBucket, unzip, and make sure all scripts have running permissions (chmod 755 *.sh).
2. Download a text corpus of your choice.
3. To create word vectors...
    * ...with SVD over PPMI, use: *corpus2svd.sh*
    * ...with SGNS (skip-grams with negative sampling), use: *corpus2sgns.sh*
4. The vectors should be available in textual format under <output_path>/vectors.txt

To explore the list of hyperparameters, use the *-h* or *--help* option.


##Pipeline##
The following figure shows the hyperwords' pipeline:

**DATA:**  raw corpus  =>  corpus  =>  pairs  =>  counts  =>  vocab  
**TRADITIONAL:**  counts + vocab  =>  pmi  =>  svd  
**EMBEDDINGS:**  pairs  + vocab  =>  sgns  

**raw corpus  =>  corpus**  
- *scripts/clean_corpus.sh*
- Eliminates non-alphanumeric tokens from the original corpus.

**corpus  =>  pairs**  
- *corpus2pairs.py*  
- Extracts a collection of word-context pairs from the corpus.

**pairs  =>  counts**  
- *scripts/pairs2counts.sh*
- Aggregates identical word-context pairs.

**counts  =>  vocab**  
- *counts2vocab.py*  
- Creates vocabularies with the words' and contexts' unigram distributions.

**counts + vocab  =>  pmi**  
- *counts2pmi.py*  
- Creates a PMI matrix (*scipy.sparse.csr_matrix*) from the counts.

**pmi  =>  svd**  
- *pmi2svd.py*  
- Factorizes the PMI matrix using SVD. Saves the result as three dense numpy matrices.

**pairs  + vocab  =>  sgns**  
- *word2vecf/word2vecf*
- An external program for creating embeddings with SGNS. For more information, see:  
**"Dependency-Based Word Embeddings". Omer Levy and Yoav Goldberg. ACL 2014.**

An example pipeline is demonstrated in: *example_test.sh*


##Evaluation##
hyperwords also allows easy evaluation of word representations on two tasks: word similarity and analogies.

**Word Similarity**
- *hyperwords/ws_eval.py*
- Compares how a representation ranks pairs of related words by similarity versus human ranking.  
- 5 readily-available datasets

**Analogies**  
- *hyperwords/analogy_eval.py*
- Solves analogy questions, such as: "man is to woman as king is to...?" (answer: queen).  
- 2 readily-available datasets  
- Shows results of two analogy recovery methods: 3CosAdd and 3CosMul. For more information, see:  
**"Linguistic Regularities in Sparse and Explicit Word Representations". Omer Levy and Yoav Goldberg. CoNLL 2014.**

These programs assume that the representation was created by hyperwords, and can be loaded by
*hyperwords.representations.embedding.Embedding*. Dense vectors in textual format (such as the ones produced by word2vec
and GloVe) can be converted to hyperwords' format using *hyperwords/text2numpy.py*.