V1.0 Dataset

Location of the input files

This Google Drive location contains the following.

tf1_ckpt folder: contains checkpoint files
- model.ckpt-28252.data-00000-of-00001
- model.ckpt-28252.index
- model.ckpt-28252.meta
tf2_ckpt folder: contains TF2 checkpoint files
- model.ckpt-28252.data-00000-of-00001
- model.ckpt-28252.index
bert_config.json: Config file which specifies the hyperparameters of the model
enwiki-20200101-pages-articles-multistream.xml.bz2 : Compressed file containing wiki data
enwiki-20200101-pages-articles-multistream.xml.bz2.md5sum: md5sum hash for the enwiki-20200101-pages-articles-multistream.xml.bz2 file
License.txt
vocab.txt: Contains WordPiece to id mapping

Alternatively, TF2 checkpoint can also be generated using tf2_encoder_checkpoint_converter.py and TF1 checkpoint

python3 tf2_encoder_checkpoint_converter.py \
  --bert_config_file=<path to bert_config.json> \
  --checkpoint_to_convert=<path to tf1 model.ckpt-28252> \
  --converted_checkpoint_path=<path to output tf2 model checkpoint>

Note that the checkpoint converter removes optimizer slot variables, so the resulting TF2 checkpoint is only about 1/3 size of the TF1 checkpoint.

Download and preprocess datasets

The dataset was prepared using Python 3.7.6, nltk 3.4.5 and the tensorflow/tensorflow:1.15.2-gpu docker image.

Download and uncompress

The files at the time of v0.7 were available at https://dumps.wikimedia.org/enwiki/20200101/enwiki-20200101-pages-articles-multistream.xml.bz2

The files have since been removed from this link. Instead, they are now available at this Google Drive location. The file enwiki-20200101-pages-articles-multistream.xml.bz2 containing wikipedia dump should be downloaded and uncompressed.

# TODO: Change this to correct commit when patch is merged
# Clone the training repo

git clone https://github.com/sgpyc/training
git checkout bert_fix

# TODO: Add HEAD Commit SHA

cd language_model/tensorflow/bert/cleanup_scripts

# Download and uncompress files from the Google Drive location
source download_and_umcompress.sh

After downloading and uncompressing files, confirm if the md5sums match the expected values.

MD5sums of provided files:

File	Size (bytes)	MD5
bert_config.json	314	7f59165e21b7d566db610ff6756c926b
vocab.txt	231,508	64800d5d8528ce344256daf115d4965e
model.ckpt-28252.index (tf1)	17,371	f97de3ae180eb8d479555c939d50d048
model.ckpt-28252.meta (tf1)	24,740,228	dbd16c731e8a8113bc08eeed0326b8e7
model.ckpt-28252.data-00000-of-00001 (tf1)	4,034,713,312	50797acd537880bfb5a7ade80d976129
model.ckpt-28252.index (tf2)	6,420	fc34dd7a54afc07f2d8e9d64471dc672
model.ckpt-28252.data-00000-of-00001 (tf2)	1,344,982,997	77d642b721cf590c740c762c7f476e04
enwiki-20200101-pages-articles-multistream.xml.bz2	17,751,214,669	00d47075e0f583fb7c0791fac1c57cb3
enwiki-20200101-pages-articles-multistream.xml	75,163,254,305	1021bd606cba24ffc4b93239f5a09c02

Extract

Run WikiExtractor.py to extract the wiki pages from the XML The generated wiki pages file will be stored as /LL/wiki_nn; for example /AA/wiki_00. Each file is ~1MB, and each sub directory has 100 files from wiki_00 to wiki_99, except the last sub directory. For the 20200101 dump, the last file is FE/wiki_17.

Next, clone the WikiExtractor repo, and extract data from XML.

git clone https://github.com/attardi/wikiextractor.git

cd wikiextractor

git checkout 3162bb6c3c9ebd2d15be507aa11d6fa818a454ac

# Back to <bert>/cleanup_scripts
cd .. 

# Run `WikiExtractor.py` to extract data from XML.
python wikiextractor/WikiExtractor.py wiki/enwiki-20200101-pages-articles-multistream.xml

The generated wiki pages file will be stored in <bert>/cleanup_scripts/text/<XX>/wiki_<nn> where <XX> are folders from AA to FE and <nn> ranges from 00 to 99.

For example :<bert>/cleanup_scripts/text/BD/wiki_37.

Each file is ~1MB, and each sub directory has 100 files from wiki_00 to wiki_99, except the last sub directory FE. For the 20200101 dump, the last file is FE/wiki_17.

Files in /cleanup_scripts/text/FE/:

File	Size (bytes)	MD5
wiki_00	1,048,175	d8ad2f6311e3692e9b5ec9d38bfe8707
wiki_01	1,047,515	f098c976543d39e9aa99f91d278686f8
wiki_02	1,047,954	fab7f42b8df1e3d8dd6db7d672e05cc3
wiki_03	1,048,205	c27cf920d8954f6b76576363d14945ba
wiki_04	1,047,729	0d5ccc12742c2123330b2205ab7bae99
wiki_05	1,045,417	991f06e6fe50c99e6b50e6f778dc9181
wiki_06	1,048,289	d160d3edcd847b896b988c261d7b3951
wiki_07	1,045,378	5e8a262f80575aad0f1b3f337fd0a2f9
wiki_08	1,047,758	bbeadd3b9045eb1468d5f546b5013b41
wiki_09	1,048,314	d9d6bf4d61259d7a7760f52da8ca03be
wiki_10	1,048,422	a139da62c0cf443401162093a3c8018a
wiki_11	1,048,255	100bd5153de234e4769a6e9baf103d43
wiki_12	1,048,548	3bda2c6eeea74ef37314e5e3f9d8dbff
wiki_13	1,046,253	9b8084d36640b536458345f6a6400d70
wiki_14	1,036,170	7d5ca15dab637fc3d36124fd404e037a
wiki_15	1,048,378	9b6dea989a5ca2d46e6f0a0eb730197c
wiki_16	1,046,493	ee7870f5dbd4de278825e9d32ee1fa78
wiki_17	398,182	fce4a6b8886e2796409a8588f3e88b75

Note: WikiExtractor.py replaces some of the tags present in XML such as {CURRENTDAY}, {CURRENTMONTHNAMEGEN} with the current values obtained from time.strftime (code). Hence, one might see slighly different preprocessed files after the WikiExtractor.py file is invoked. This means the md5sum hashes of these files will also be different each time WikiExtractor is called.

Clean up and dataset seperation

The scripts are located in cleanup_scripts. Specifically, files clean.sh, cleanup_file.py, do_gather.py, seperate_test_set.py and do_sentence_segmentation.py are used for further preprocessing. A wrapper shell script process_wiki.sh calls these cleanup scripts and does end-to-end preprocessing of files

./process_wiki.sh './text/*/wiki_??'

After running the process_wiki.sh script,

For every file <bert>/cleanup_scripts/text/<XX>/wiki_<nn>, there are four additional files generated as below
- <bert>/cleanup_scripts/text/<XX>/wiki_<nn>.1
- <bert>/cleanup_scripts/text/<XX>/wiki_<nn>.2
- <bert>/cleanup_scripts/text/<XX>/wiki_<nn>.3
- <bert>/cleanup_scripts/text/<XX>/wiki_<nn>.4
For the 20200101 wiki dump, there will be 502 files, named part-(00000 to 00499)-of-00500, eval.md5 and eval.txt in the <bert>/results directory.

Files in /results directory:

File	Size (bytes)	MD5
eval.md5	330000	71a58382a68947e93e88aa0d42431b6c
eval.txt	32851144	2a220f790517261547b1b45ed3ada07a
part-00000-of-00500	27150902	a64a7c31eff5cd38ae6d94f7a6229dab
part-00001-of-00500	27198569	549a9ed4f805257245bec936563abfd0
part-00002-of-00500	27395616	1a1366ddfc03aef9d41ce552ee247abf
...
part-00497-of-00500	24775043	66835aa75d4855f2e678e8f3d73812e9
part-00498-of-00500	24575505	e6d68a7632e9f4aa1a94128cce556dc9
part-00499-of-00500	21873644	b3b087ad24e3770d879a351664cebc5a

Each of part-00xxx-of-00500 and eval.txt contains one sentence of an article in one line and different articles separated by blank line.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset.md

dataset.md

V1.0 Dataset

Location of the input files

Download and preprocess datasets

Download and uncompress

MD5sums of provided files:

Extract

Files in /cleanup_scripts/text/FE/:

Clean up and dataset seperation

Files in /results directory:

Files

dataset.md

Latest commit

History

dataset.md

File metadata and controls

V1.0 Dataset

Location of the input files

Download and preprocess datasets

Download and uncompress

MD5sums of provided files:

Extract

Files in /cleanup_scripts/text/FE/:

Clean up and dataset seperation

Files in /results directory: