Skip to content

Latest commit

 

History

History
66 lines (40 loc) · 3.24 KB

CONTRIBUTING.md

File metadata and controls

66 lines (40 loc) · 3.24 KB

Contributing to Rusty-DAWG

Thank you for your interest in contributing to Rusty-DAWG! There are lots of interesting algorithmic improvements under the hood to make Rusty-DAWG more efficient and scalable. Get in contact if you want to help out!

This file provides some high-level documentation about the library and guidance for code style and deployments.

Core Packages

Graph Implementation: src/graph

Implements graph types for representing the DAWG. The only one currently supported is AvlGraph: a memory-efficient representation of a graph by a single list of nodes and a single list of edges. The list of edges associated with a node are stored by a balanced binary tree. This follows a similar API to the petgraph Graph class, but transitions are much more efficient with a large branching factor (O(b) -> O(log b)).

DAWG Construction Algorithm: src/dawg

Builds a DAWG (represented as a graph) following the classical construction algorithm due to Blumer (1984).

Weights of DAWG States: src/weight

Roughly speaking, the states/nodes of the DAWG represent different substrings (actually sets of "equivalent" substrings/indices) in the corpus. We associate additional information with each state via a weight to record information about that state, such as the length of the substring it represents and optionally its frequency in the corpus.

This is implemented by src/weight. There are various options here depending on how much memory we want to use and which quantities we want the weights to track.

Tokenization: src/tokenize

This module implements two tokenization strategies: either tokenize by whitespace and identify each token string to an integer ID (TokenIndex) or assume the data has already been tokenized and is space-separated (NullTokenIndex).

By default, main.rs will use NullTokenIndex, but you can pass --tokenize to use a custom TokenIndex.

Other Modules

Much of the other modules are for n-gram language modeling and can be ignored for our purposes. src/lms implements various types of n-gram language models on top of a DAWG. src/evaluator implements logic to evaluate these n-gram language models on a test set. src/stat_utils implements a library of basic statistical functions for n-gram language modeling.

Code style

Before contributing code, make sure to run format and clippy:

cargo clippy --fix
cargo fmt --

To test for possible other clippy issues, run:

cargo clippy --all-targets -- -D warnings \
    -A clippy::comparison_chain \
    -A clippy::upper-case-acronyms \
    -A dead-code

Publishing New Releases

Follow these steps to create a new release of Rusty-DAWG.

  1. Install toml-cli if you haven't already (cargo install toml-cli --version 0.2.3).
  2. Run the script ./scripts/release.sh and follow the prompts.

Fixing a Failed Release

If for some reason the GitHub Actions release workflow failed with an error that needs to be fixed, you'll have to delete both the tag and corresponding release from GitHub. After you've pushed a fix, delete the tag from your local clone with

git tag -l | xargs git tag -d && git fetch -t

Then repeat the steps above.