Thank you for your interest in contributing to Rusty-DAWG! There are lots of interesting algorithmic improvements under the hood to make Rusty-DAWG more efficient and scalable. Get in contact if you want to help out!
This file provides some high-level documentation about the library and guidance for code style and deployments.
Implements graph types for representing the DAWG. The only one currently supported is AvlGraph
: a memory-efficient representation of a graph by a single list of nodes and a single list of edges. The list of edges associated with a node are stored by a balanced binary tree. This follows a similar API to the petgraph Graph
class, but transitions are much more efficient with a large branching factor (O(b)
-> O(log b)
).
Builds a DAWG (represented as a graph) following the classical construction algorithm due to Blumer (1984).
Roughly speaking, the states/nodes of the DAWG represent different substrings (actually sets of "equivalent" substrings/indices) in the corpus. We associate additional information with each state via a weight to record information about that state, such as the length of the substring it represents and optionally its frequency in the corpus.
This is implemented by src/weight
. There are various options here depending on how much memory we want to use and which quantities we want the weights to track.
This module implements two tokenization strategies: either tokenize by whitespace and identify each token string to an integer ID (TokenIndex
) or assume the data has already been tokenized and is space-separated (NullTokenIndex
).
By default, main.rs will use NullTokenIndex
, but you can pass --tokenize
to use a custom TokenIndex
.
Much of the other modules are for n-gram language modeling and can be ignored for our purposes. src/lms
implements various types of n-gram language models on top of a DAWG. src/evaluator
implements logic to evaluate these n-gram language models on a test set. src/stat_utils
implements a library of basic statistical functions for n-gram language modeling.
Before contributing code, make sure to run format and clippy:
cargo clippy --fix
cargo fmt --
To test for possible other clippy issues, run:
cargo clippy --all-targets -- -D warnings \
-A clippy::comparison_chain \
-A clippy::upper-case-acronyms \
-A dead-code
Follow these steps to create a new release of Rusty-DAWG.
- Install
toml-cli
if you haven't already (cargo install toml-cli --version 0.2.3
). - Run the script
./scripts/release.sh
and follow the prompts.
If for some reason the GitHub Actions release workflow failed with an error that needs to be fixed, you'll have to delete both the tag and corresponding release from GitHub. After you've pushed a fix, delete the tag from your local clone with
git tag -l | xargs git tag -d && git fetch -t
Then repeat the steps above.