Skip to content

Extra tools for working with word embeddings, such as those in Embeddings.jl. However, the compatibility is currently limited.

License

Notifications You must be signed in to change notification settings

Marwolaeth/EmbeddingsTools.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EmbeddingsTools

Build Status Aqua codecov

EmbeddingsTools.jl is a Julia package that provides additional tools for working with word embeddings, complementing existing packages such as Embeddings.jl. Please note that the compatibility with other packages is currently limited, namely, type conversions are currently missing. Still, this package can be used as a standalone tool for working with embedding vectors.

Installation

You can install EmbeddingsTools.jl from GitHub through the Julia package manager. From the Julia REPL, type ] to enter the Pkg REPL mode and run:

pkg> add https://github.com/Marwolaeth/EmbeddingsTools.jl.git

Or, within your Julia environment, use the following command:

using Pkg
Pkg.add("https://github.com/Marwolaeth/EmbeddingsTools.jl.git")

Usage

The package is intended to read local embedding files, and it currently supports only text files (e.g., .vec) and binary Julia files. The package can perform basic operations on these embedding files.

The embeddings are represented as either WordEmbedding or IndexedWordEmbedding types. Both types contain an embedding table and a token vocabulary that is similar to embedding objects in Embeddings.jl. They also have ntokens and ndims fields to store the dimensionality of an embedding table. In addition, IndexedWordEmbedding objects have an extra lookup dictionary that maps its tokens to corresponding embedding vectors' views.

Indexing is useful when the embedding table must be aligned with a pre-existing vocabulary, such as the one obtained from a corpus of texts.

Loading Word Embeddings

The original goal of the package was to allow users to read local embedding vectors in Julia. We discovered that this feature was quite limited in Embeddings.jl. For example, a user can manually download an embedding table, e.g. from the FastText repository or RusVectōrēs project (a collection of Ukrainian and Russian embeddings) and then read it into Julia using:

using EmbeddingsTools

# download and unzip the embedding file
## unless you prefer to do it manually
download(
    "https://rusvectores.org/static/models/rusvectores4/taiga/taiga_upos_skipgram_300_2_2018.vec.gz",
    "taiga_upos_skipgram_300_2_2018.vec.gz"
)
run(`gzip -dk taiga_upos_skipgram_300_2_2018.vec.gz`);

# Load word embeddings from a file
embtable = read_vec("taiga_upos_skipgram_300_2_2018.vec")

The read_vec() function is a basic function that reads embeddings. It takes two arguments: path and delim (the delimiter), and creates a WordEmbedding object using CSV.jl. This function reads the entire embedding table, which results in better performance due to its straightforward logic. However, it may fail to read embeddings with more than 500k words.

read_embedding() is an alternative function that provides more control options through keyword arguments. If max_vocab_size is specified, the function limits the size of the vector to that number. If a vector keep_words is provided, it only keeps those words. If a word in keep_words is not found, the function returns a zero vector for that word.

If the file is a WordEmbedding object within a Julia binary file (with extension .jld or in specific formats .emb or .wem), the entire embedding is loaded, and keyword arguments are not applicable. You can also use the read_emb() function directly on binary files. See ?write_embedding for saving embedding objects to Julia binary files to read them faster in the future.

# Load word embeddings for 10k most frequent words in a model
embtable = read_embedding(
    "taiga_upos_skipgram_300_2_2018.vec",
    max_vocab_size=10_000
)

Creating Embedding Indices

There are some differences in the behavior of certain functions in EmbeddingsTools.jl, depending on whether the embedding table object contains a lookup dictionary or not. If the object contains a lookup dictionary, then it is referred to as an object of type IndexedWordEmbedding, which is considerably faster to operate on. On the other hand, if it does not contain the lookup dictionary, then it is referred to as an object of type WordEmbedding, which takes a bit of time to index and should only be done when necessary. To index an embedding object, you can either call IndexedWordEmbedding() (which is a constructor function) or index() on the object.

# These are equivalent
embtable_ind = IndexedWordEmbedding(embtable)
embtable_ind = index(embtable)

Quering Embeddings

We can use the get_vector() function with either indexed or simple embeddings table to obtain a word-vector for a given word:

get_vector(embtable, "человек_NOUN")
get_vector(embtable_ind, "человек_NOUN")

Limiting Embedding Vocabulary

Regardles of whether we have read the embedding with limited vocabulary size or not, we can limit it with the limit() function:

small_embtable = limit(embtable, 111)

Embedding Subspaces

At times, we may need to adjust an embedding table to match a set of words or tokens. This could be the result of pre-processing a corpus of text documents using the TextAnalysis.jl package. The subspace() function can be used to create a new WordEmbedding object from an existing embedding and a vector of strings containing the words or tokens of interest. The order of the new embedding vectors corresponds to the order of the input tokens. If a token is not present in the source embedding vocabulary, a zero vector is returned for that token.

It's important to note that the subspace() method performs much faster when used with an indexed embedding object.

words = embtable.vocab[13:26]
embtable2 = subspace(embtable_ind, words)

Dimensionality Reduction

The reduce_emb() function allows you to decrease the size of embedding objects, whether they are indexed or not. You can choose between two reduction techniques (specified using the method keyword): pca (the default) for Principal Component Analysis, or svd for Singular Value Decomposition.

# Reduce the dimensionality of the word embeddings using PCA or SVD
embtable20 = reduce_emb(embtable, 20)
embtable20_svd = reduce_emb(embtable, 20, method="svd")

Compatibility

As of the current version, EmbeddingsTools.jl has limited compatibility with the package that has inspired the entire project. We are actively working on expanding compatibility and interoperability with a wider range of packages.

Contributing

We welcome contributions from the community to enhance the functionality and compatibility of EmbeddingsTools.jl. If you encounter any issues or have ideas for improvement, please feel free to open an issue or submit a pull request on our GitHub repository.

License

EmbeddingsTools.jl is provided under the MIT License. See the LICENSE file for more details.

About

Extra tools for working with word embeddings, such as those in Embeddings.jl. However, the compatibility is currently limited.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages