Skip to content

Rust library for indexing and quickly searching large pretraining corpora

License

Notifications You must be signed in to change notification settings

viking-sudo-rm/rusty-dawg

Repository files navigation

Rusty-DAWG

Have you ever wanted to quickly search for text in a massive corpus of text, such as a language model pretraining dataset? Now you can, using the magic of directed acyclic word graphs (DAWGs) implemented in Rust. Some ways where the Rusty-DAWG library could be directly useful:

  1. Check which substrings of some text occurred in a pretraining corpus, and how often
  2. Deduplicate or decontaminate a pretraining corpus
  3. Build an unbounded-length n-gram model
  4. Build a retrieval language model

The key features of the Rusty-DAWG library are:

  1. Build a DAWG or CDAWG index on a corpus with a one-liner. The (C)DAWG can be saved in two formats: a graph stored in RAM and a graph stored on disk.
  2. Use Python bindings to load a saved (C)DAWG for doing fast n-gram search (you can also load it in Rust, but we recommend working with the Python API).
  3. An API server and web demo for visualizing n-gram search results over a pre-built (C)DAWG.

Authors

This library was started by Will Merrill and Yanai Elazar as part of an internship project at AI2. Ananya Jha, Rodney Kinney, Jackson Petty, Luca Soldaini, David Wadden, Pete Walsh have all since contributed. We've also appreciated the support of Michal Guerquin, Johann Dahm, and other members of the Beaker team at AI2 for getting the library to run at very large scale, as well as the data structures expertise of Shunsuke Inenaga.

Getting Started

Installing Rust

Simply use the one-liner here.

Testing and Building Rusty-DAWG

To run tests, you can call Cargo (which should have been installed with Rust) from inside the repo directory:

cargo test

To compile an optimized release build, you can run:

cargo build --release

Note that the --release flag is very important for performance. The code will be 10-100x slower without it.

Running Benchmarking Script

To run the benchmarking script, you need the Wikitext2/103 data. You can either download this to rusty-dawg/data path or point to an existing repository (easy on beaker, you can use my copy of the data).

You first need to download the data directory, unzip it, and put it in the root of the repository directory (i.e., rusty-dawg/data). Then you can run:

./scripts/benchmark.sh wikitext-2-raw

If the data is stored somewhere else, you can do:

DATA=/home/willm/splits ./scripts/benchmark.sh wikitext-2-raw

Building Your CDAWG

The core functionality of Rusty-DAWG is to build DAWGs and CDAWGs, which are indexing structures for large corpora. The CDAWG is a strict improvement of the DAWG, so we recommend using the CDAWG if you are building a new index from scratch.

To get started building a CDAWG on your corpus, we recommend adapting the scripts/cdawg/run_pile.sh script. This script was written to build a CDAWG on the Pile. Assuming you have access to the Pile in .jsonl.gz format (ask William Merrill), the script can be run as follows:

scripts/cdawg/run_pile.sh /home/willm/data/pile/00_0.json.gz /home/willm/cdawgs/00_0

Here the first argument is the path where the input corpus can be found. By default, this should be a path to an input file in .jsonl.gz format, where each line looks like:

{"text": "this is a document", "meta": {"data": "here"}}

The meta key must be present but "meta": {} can be specified if no metadata exists. If you wish to pass input data in a different format, you can change the --data-reader flag to a different option.

The second argument is a path at which output CDAWG will get created (as well as a disk vector storing a copy of the training tokens and a log of CDAWG stats during building).

Other arguments in the scripts/cdawg/run_pile.sh script:

  • N_TOKENS, NODES_RATIO, and EDGES_RATIO: These are used to allocate memory for the CDAWG. N_TOKENS should be an upper bound on the number of tokens in the dataset. NODES_RATIO and EDGES_RATIO should be upper bounds on the # of nodes and # of edges per input token. For the DAWG, these have an upper bound of 2 and 3, and for the CDAWG, they will typically be (well) below 1 and 2. You can estimate these values for a large dataset by simply building on a smaller chunk of the data first and extrapolating.
  • Tokenizer: By default, this script uses the gpt2 tokenizer. You might consider using a different tokenizer, since gpt2 treats whitespace somewhat poorly.
  • Cache size: This parameters simply controls how many bytes of text are read into RAM at once while decompressing the training data. It isn't that important, but if you run into RAM issues, you should lower it!

Using CDAWGs for Inference in Python

The library is implemented in Rust, but DAWGs, once built, can be loaded and used easily in Python! You can even build DAWGs from scratch using the Python bindings, though we don't necessarily recommend that.

Building the Python Bindings

The Python bindings are generated using maturin. First install maturin in your Python environment:

pip install maturin

Then, you should be able to build (or rebuild) the Python bindings with:

source scripts/rebuild_bindings.sh

(If above doesn't work) Alternatively, cd into the Python bindings directory (bindings/python) and run:

make install

(If above still doesn't work) You can build the bindings in two steps:

python -m maturin build --release
pip install target/wheels/*.whl

Using the Python Library

After installing the bindings, you should be able to import the library:

from rusty_dawg import Cdawg, DiskCdawg

Refer to scripts/cdawg/test_cdawg_matches_dawg.py for an example of how to build and use a CDAWG in RAM with the Python bindings. To use a disk CDAWG instead, you can use DiskCdawg instead of Cdawg. scripts/cdawg/test_load_cdawg.py shows an example of how to load a pre-built DiskCdawg.

Citation

If you found Rusty-DAWG useful, please cite it with either the ACL Anthology citation or the following:

@inproceedings{merrill-etal-2024-evaluating,
      title = "Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG",
      author={William Merrill and Noah A. Smith and Yanai Elazar},
      booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
      month = nov,
      year = "2024",
      address = "Miami, Florida, USA",
      publisher = "Association for Computational Linguistics",
      url = "https://openreview.net/forum?id=NgWSakw55z",
}

About

Rust library for indexing and quickly searching large pretraining corpora

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published