readme

p-lambda · Nov 7, 2023 · 8428565 · 8428565
1 parent afd13bb
commit 8428565
Show file tree

Hide file tree

Showing 2 changed files with 51 additions and 50 deletions.
diff --git a/README.md b/README.md
@@ -20,7 +20,7 @@ Code related to the DSIR paper's experiments are in the `experimental/` director
 
 ## Quickstart
 
-Install from pip:
+Install with pip:
 ```
 pip install data-selection
 ```
@@ -70,55 +70,6 @@ Subsequent resampling with the same target data is very cheap, and the runtime d
 - *Resample 10M documents*: 353.68 seconds
 - *Resample 100M documents*: 352.69 seconds
 
-## Pre-filtered datasets
-Note: previous versions of the datasets had a small validation and test split (50000 examples each), but we concatenated these onto the end of the train set (in the order validation, then test) to better align with the paper. The datasets should be further shuffled during preprocessing before training.
-
-### DSIR-filtered-pile-50M
-- Target distribution: Wikipedia, BookCorpus2
-- Selection method: DSIR (with importance resampling on hashed n-gram model importance weights)
-- Raw dataset: The Pile
-- Size: 80GB, 51.2M examples
-- Used for 128-token context models in the paper. Suitable for token length 512 or 1024, but can be used for shorter token lengths.
-- The dataset contains 51.2M examples, most of which are selected from Pile subsets that are not Wikipedia or books-related (BookCorpus2, Books3, Gutenberg). 4% of the data is randomly selected from Wikipedia and books-related subsets. Every example concatenates 2 snippets, possibly from different sources, to ensure that the examples are long enough for longer context models (512 or 1024 tokens). Metadata about which sources the text comes from is included with every example.
-- Available on HuggingFace at https://huggingface.co/datasets/stanford-crfm/DSIR-filtered-pile-50M. Use with HuggingFace Datasets:
-```python
-from datasets import load_dataset
-dataset = load_dataset("stanford-crfm/DSIR-filtered-pile-50M")
-```
-
-### heuristic_classification-filtered-pile-50M
-- Target distribution: Wikipedia, BookCorpus2
-- Selection method: Heuristic classification (FastText binary classifier)
-- Raw dataset: The Pile
-- Size: 80GB, 51.2M examples
-- Used for 128-token context length models in the paper. Suitable for token length 512 or 1024, but can be used for shorter token lengths
-- The dataset contains 51.2M examples, most of which are selected from Pile subsets that are not Wikipedia or books-related (BookCorpus2, Books3, Gutenberg). 4% of the data is randomly selected from Wikipedia and books-related subsets. Every example concatenates 2 snippets, possibly from different sources, to ensure that the examples are long enough for longer context models (512 or 1024 tokens). Metadata about which sources the text comes from is included with every example.
-- Available on HuggingFace at https://huggingface.co/datasets/stanford-crfm/heuristic_classification-filtered-pile-50M. Use with HuggingFace Datasets:
-```python
-from datasets import load_dataset
-dataset = load_dataset("stanford-crfm/heuristic_classification-filtered-pile-50M")
-```
-- Comparisons for training BERT-base models from scratch (50k steps, 128 max token length, 4096 batch size):
-
-| GLUE dev                                          |  MNLI |  QNLI |   QQP |   RTE | SST2 |  MRPC |  CoLA | STSB |   Avg |
-|---------------------------------------------------|------:|------:|------:|------:|------:|------:|------:|------:|------:|
-| Random selection from The Pile                    | 82.63 |  86.9 | 89.57 | 67.37 | 90.05 | 87.40 | 49.41 | 88.63 | 80.25 |
-| Heuristic classification (GPT-3/Pile/PaLM method) | 82.69 | 85.95 | 89.77 | 68.59 | 88.94 | 86.03 | 48.17 | 88.62 | 79.85 |
-| DSIR                                              | 83.07 | 89.11 | 89.80 | 75.09 | 90.48 | 87.70 | 54.00 | 89.17 | 82.30 |
-
-
-## Pretrained models
-
-In the table below, `{dataset}` can be replaced with one of `{ag, amazon, citation_intent, hyp, imdb, sciie, chemprot, rct-20k}` for the continued pretraining models.
-
-| HuggingFace ID | Link | Dataset size | Max token length | Training steps | Architecture | Initialization | Description |
-|---|---|---|---|---|---|---|---|
-| dsir-bert-scratch-wiki_and_books | [Link](https://huggingface.co/sangmichaelxie/dsir-bert-scratch-wiki_and_books) | 6.5B tokens (51.2M examples) | 128 | 5.00E+04 | bert-base-uncased | scratch | BERT model trained on [DSIR-filtered-pile-50M](https://huggingface.co/datasets/stanford-crfm/DSIR-filtered-pile-50M/viewer/default/train?p=31445&row=3144531) |
-| heuristiccls-bert-scratch-wiki_and_books | [Link](https://huggingface.co/sangmichaelxie/heuristiccls-bert-scratch-wiki_and_books) | 6.5B tokens (51.2M examples) | 128 | 5.00E+04 | bert-base-uncased | scratch | BERT model trained on Pile data filtered by heuristic classification |
-| randomselect-bert-scratch | [Link](https://huggingface.co/sangmichaelxie/randomselect-bert-scratch) | 6.5B tokens (51.2M examples) | 128 | 5.00E+04 | bert-base-uncased | scratch | BERT model trained on random subset of The Pile |
-| dsir-roberta-continuedpretrain-{dataset} | Link format: `https://huggingface.co/sangmichaelxie/dsir-roberta-continuedpretrain-{dataset}` | 6.4B tokens (25M examples) | 256 | 25000 | roberta-base | roberta-base | RoBERTa model with continued pretraining on data selected by DSIR with target={dataset} |
-| heuristiccls-roberta-continuedpretrain-{dataset} | Link format: `https://huggingface.co/sangmichaelxie/dsir-roberta-continuedpretrain-{dataset}` | 6.4B tokens (25M examples) | 256 | 25000 | roberta-base | roberta-base | RoBERTa model with continued pretraining on data selected by heurstic classification with target={dataset} |
-| randomselect-roberta-continuedpretrain | [Link](https://huggingface.co/sangmichaelxie/randomselect-roberta-continuedpretrain) | 6.4B tokens (25M examples) | 256 | 25000 | roberta-base | roberta-base | RoBERTa model with continued pretraining on random subset of The Pile |
 
 ## Citation Information
 Paper: <https://arxiv.org/abs/2302.03169>

diff --git a/experimental/README.md b/experimental/README.md
@@ -20,6 +20,56 @@ We provide scripts for training BERT-style masked language models on the selecte
 4. Evaluate the trained model by editing the evaluation job command in `glue_eval/run_eval_exps.sh` with the path to the model checkpoint. This script runs 5 seeds for each GLUE dataset. The results and finetuned models will be saved a new `finetune_runs` directory inside the pretrained model checkpoint directory. Kick off the jobs by running `bash glue_exps/run_eval_exps.sh`.
 5. Read the GLUE results by running `python read_glue_results.py --results_dir </path/to/checkpoint>/finetune_runs` in the `glue_eval` directory.
 
+## Pre-filtered datasets
+Note: previous versions of the datasets had a small validation and test split (50000 examples each), but we concatenated these onto the end of the train set (in the order validation, then test) to better align with the paper. The datasets should be further shuffled during preprocessing before training.
+
+### DSIR-filtered-pile-50M
+- Target distribution: Wikipedia, BookCorpus2
+- Selection method: DSIR (with importance resampling on hashed n-gram model importance weights)
+- Raw dataset: The Pile
+- Size: 80GB, 51.2M examples
+- Used for 128-token context models in the paper. Suitable for token length 512 or 1024, but can be used for shorter token lengths.
+- The dataset contains 51.2M examples, most of which are selected from Pile subsets that are not Wikipedia or books-related (BookCorpus2, Books3, Gutenberg). 4% of the data is randomly selected from Wikipedia and books-related subsets. Every example concatenates 2 snippets, possibly from different sources, to ensure that the examples are long enough for longer context models (512 or 1024 tokens). Metadata about which sources the text comes from is included with every example.
+- Available on HuggingFace at https://huggingface.co/datasets/stanford-crfm/DSIR-filtered-pile-50M. Use with HuggingFace Datasets:
+```python
+from datasets import load_dataset
+dataset = load_dataset("stanford-crfm/DSIR-filtered-pile-50M")
+```
+
+### heuristic_classification-filtered-pile-50M
+- Target distribution: Wikipedia, BookCorpus2
+- Selection method: Heuristic classification (FastText binary classifier)
+- Raw dataset: The Pile
+- Size: 80GB, 51.2M examples
+- Used for 128-token context length models in the paper. Suitable for token length 512 or 1024, but can be used for shorter token lengths
+- The dataset contains 51.2M examples, most of which are selected from Pile subsets that are not Wikipedia or books-related (BookCorpus2, Books3, Gutenberg). 4% of the data is randomly selected from Wikipedia and books-related subsets. Every example concatenates 2 snippets, possibly from different sources, to ensure that the examples are long enough for longer context models (512 or 1024 tokens). Metadata about which sources the text comes from is included with every example.
+- Available on HuggingFace at https://huggingface.co/datasets/stanford-crfm/heuristic_classification-filtered-pile-50M. Use with HuggingFace Datasets:
+```python
+from datasets import load_dataset
+dataset = load_dataset("stanford-crfm/heuristic_classification-filtered-pile-50M")
+```
+- Comparisons for training BERT-base models from scratch (50k steps, 128 max token length, 4096 batch size):
+
+| GLUE dev                                          |  MNLI |  QNLI |   QQP |   RTE | SST2 |  MRPC |  CoLA | STSB |   Avg |
+|---------------------------------------------------|------:|------:|------:|------:|------:|------:|------:|------:|------:|
+| Random selection from The Pile                    | 82.63 |  86.9 | 89.57 | 67.37 | 90.05 | 87.40 | 49.41 | 88.63 | 80.25 |
+| Heuristic classification (GPT-3/Pile/PaLM method) | 82.69 | 85.95 | 89.77 | 68.59 | 88.94 | 86.03 | 48.17 | 88.62 | 79.85 |
+| DSIR                                              | 83.07 | 89.11 | 89.80 | 75.09 | 90.48 | 87.70 | 54.00 | 89.17 | 82.30 |
+
+
+## Pretrained models
+
+In the table below, `{dataset}` can be replaced with one of `{ag, amazon, citation_intent, hyp, imdb, sciie, chemprot, rct-20k}` for the continued pretraining models.
+
+| HuggingFace ID | Link | Dataset size | Max token length | Training steps | Architecture | Initialization | Description |
+|---|---|---|---|---|---|---|---|
+| dsir-bert-scratch-wiki_and_books | [Link](https://huggingface.co/sangmichaelxie/dsir-bert-scratch-wiki_and_books) | 6.5B tokens (51.2M examples) | 128 | 5.00E+04 | bert-base-uncased | scratch | BERT model trained on [DSIR-filtered-pile-50M](https://huggingface.co/datasets/stanford-crfm/DSIR-filtered-pile-50M/viewer/default/train?p=31445&row=3144531) |
+| heuristiccls-bert-scratch-wiki_and_books | [Link](https://huggingface.co/sangmichaelxie/heuristiccls-bert-scratch-wiki_and_books) | 6.5B tokens (51.2M examples) | 128 | 5.00E+04 | bert-base-uncased | scratch | BERT model trained on Pile data filtered by heuristic classification |
+| randomselect-bert-scratch | [Link](https://huggingface.co/sangmichaelxie/randomselect-bert-scratch) | 6.5B tokens (51.2M examples) | 128 | 5.00E+04 | bert-base-uncased | scratch | BERT model trained on random subset of The Pile |
+| dsir-roberta-continuedpretrain-{dataset} | Link format: `https://huggingface.co/sangmichaelxie/dsir-roberta-continuedpretrain-{dataset}` | 6.4B tokens (25M examples) | 256 | 25000 | roberta-base | roberta-base | RoBERTa model with continued pretraining on data selected by DSIR with target={dataset} |
+| heuristiccls-roberta-continuedpretrain-{dataset} | Link format: `https://huggingface.co/sangmichaelxie/dsir-roberta-continuedpretrain-{dataset}` | 6.4B tokens (25M examples) | 256 | 25000 | roberta-base | roberta-base | RoBERTa model with continued pretraining on data selected by heurstic classification with target={dataset} |
+| randomselect-roberta-continuedpretrain | [Link](https://huggingface.co/sangmichaelxie/randomselect-roberta-continuedpretrain) | 6.4B tokens (25M examples) | 256 | 25000 | roberta-base | roberta-base | RoBERTa model with continued pretraining on random subset of The Pile |
+
 ## Citation Information
 Paper: <https://arxiv.org/abs/2302.03169>
 ```