Have you ever wanted to quickly search for text in a massive corpus of text, such as a language model pretraining dataset? Now you can, using the magic of directed acyclic word graphs (DAWGs) implemented in Rust. Some ways where the Rusty-DAWG library could be directly useful:
- Check which substrings of some text occurred in a pretraining corpus, and how often
- Deduplicate or decontaminate a pretraining corpus
- Build an unbounded-length n-gram model
- Build a retrieval language model
The key features of the Rusty-DAWG library are:
- Build a DAWG or CDAWG index on a corpus with a one-liner. The (C)DAWG can be saved in two formats: a graph stored in RAM and a graph stored on disk.
- Use Python bindings to load a saved (C)DAWG for doing fast n-gram search (you can also load it in Rust, but we recommend working with the Python API).
- An API server and web demo for visualizing n-gram search results over a pre-built (C)DAWG.
This library was started by Will Merrill and Yanai Elazar as part of an internship project at AI2. Ananya Jha, Rodney Kinney, Jackson Petty, Luca Soldaini, David Wadden, Pete Walsh have all since contributed. We've also appreciated the support of Michal Guerquin, Johann Dahm, and other members of the Beaker team at AI2 for getting the library to run at very large scale, as well as the data structures expertise of Shunsuke Inenaga.
Simply use the one-liner here.
To run tests, you can call Cargo (which should have been installed with Rust) from inside the repo directory:
cargo test
To compile an optimized release build, you can run:
cargo build --release
Note that the --release
flag is very important for performance. The code will be 10-100x slower without it.
To run the benchmarking script, you need the Wikitext2/103 data. You can either download this to rusty-dawg/data path or point to an existing repository (easy on beaker, you can use my copy of the data).
You first need to download the data directory, unzip it, and put it in the root of the repository directory (i.e., rusty-dawg/data). Then you can run:
./scripts/benchmark.sh wikitext-2-raw
If the data is stored somewhere else, you can do:
DATA=/home/willm/splits ./scripts/benchmark.sh wikitext-2-raw
The core functionality of Rusty-DAWG is to build DAWGs and CDAWGs, which are indexing structures for large corpora. The CDAWG is a strict improvement of the DAWG, so we recommend using the CDAWG if you are building a new index from scratch.
To get started building a CDAWG on your corpus, we recommend adapting the scripts/cdawg/run_pile.sh script. This script was written to build a CDAWG on the Pile. Assuming you have access to the Pile in .jsonl.gz
format (ask William Merrill), the script can be run as follows:
scripts/cdawg/run_pile.sh /home/willm/data/pile/00_0.json.gz /home/willm/cdawgs/00_0
Here the first argument is the path where the input corpus can be found.
By default, this should be a path to an input file in .jsonl.gz
format, where each line looks like:
{"text": "this is a document", "meta": {"data": "here"}}
The meta
key must be present but "meta": {}
can be specified if no metadata exists. If you wish to pass input data in a different format, you can change the --data-reader
flag to a different option.
The second argument is a path at which output CDAWG will get created (as well as a disk vector storing a copy of the training tokens and a log of CDAWG stats during building).
Other arguments in the scripts/cdawg/run_pile.sh script:
N_TOKENS
,NODES_RATIO
, andEDGES_RATIO
: These are used to allocate memory for the CDAWG.N_TOKENS
should be an upper bound on the number of tokens in the dataset.NODES_RATIO
andEDGES_RATIO
should be upper bounds on the # of nodes and # of edges per input token. For the DAWG, these have an upper bound of 2 and 3, and for the CDAWG, they will typically be (well) below 1 and 2. You can estimate these values for a large dataset by simply building on a smaller chunk of the data first and extrapolating.- Tokenizer: By default, this script uses the
gpt2
tokenizer. You might consider using a different tokenizer, sincegpt2
treats whitespace somewhat poorly. - Cache size: This parameters simply controls how many bytes of text are read into RAM at once while decompressing the training data. It isn't that important, but if you run into RAM issues, you should lower it!
The library is implemented in Rust, but DAWGs, once built, can be loaded and used easily in Python! You can even build DAWGs from scratch using the Python bindings, though we don't necessarily recommend that.
The Python bindings are generated using maturin. First install maturin in your Python environment:
pip install maturin
Then, you should be able to build (or rebuild) the Python bindings with:
source scripts/rebuild_bindings.sh
(If above doesn't work) Alternatively, cd
into the Python bindings directory (bindings/python
) and run:
make install
(If above still doesn't work) You can build the bindings in two steps:
python -m maturin build --release
pip install target/wheels/*.whl
After installing the bindings, you should be able to import the library:
from rusty_dawg import Cdawg, DiskCdawg
Refer to scripts/cdawg/test_cdawg_matches_dawg.py for an example of how to build and use a CDAWG in RAM with the Python bindings. To use a disk CDAWG instead, you can use DiskCdawg
instead of Cdawg
. scripts/cdawg/test_load_cdawg.py shows an example of how to load a pre-built DiskCdawg
.
If you found Rusty-DAWG useful, please cite it with either the ACL Anthology citation or the following:
@inproceedings{merrill-etal-2024-evaluating,
title = "Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG",
author={William Merrill and Noah A. Smith and Yanai Elazar},
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://openreview.net/forum?id=NgWSakw55z",
}