Tools to extract and clean the Wikipedia texts to transform them into a text corpus for self-supervised NLP model training. Includes also a prepared corpus for English and German language (see below).
We use WikiExtractor to extract the Wikipedia database dumps. The texts are split into sentences by using SoMaJo. Each line of the text corpus contains one single sentence. Between each Wikipedia article is a blank line.
If you want to remove the blank lines in the text corpus you can use this command: sed -i '/^$/d' <filename>
- size of the corpus (unzipped): 6.1G
- number of lines: 59,475,915
- download the single files:
- combine the parts:
cat dewiki-20220201-clean-part-* > dewiki-20220201-clean.zip
- optional check:
sha256sum dewiki-20220201-clean.zip
should return09c47abf6200ecc342e04902e360773f9ba2d92abb64bfa20f22c63fd660edcf
- unzip the textfile:
unzip -t dewiki-20220201-clean.zip
- size of the corpus (unzipped): 14G
- number of lines: 146,709,087
- download the single files:
- combine the parts:
cat enwiki-20220201-clean-part-* > enwiki-20220201-clean.zip
- optional check:
sha256sum enwiki-20220201-clean.zip
should return127e8645f1bc1944088df165b613333f84cdca6a24eee38b8cd7ac673352293b
- unzip the textfile:
unzip -t enwiki-20220201-clean.zip
- download the raw Wikipedia dump and store it in the
data
directory:- German language: Select the youngest directory from https://dumps.wikimedia.org/dewiki/ and download a file called
dewiki-<yyyymmdd>-pages-articles.xml.bz2
. Its is about 5.8 GB in size. We usedewiki-20220201-pages-articles.xml.bz2
. - English language: Select the youngest directory from https://dumps.wikimedia.org/enwiki/ and download a file called
dewiki-<yyyymmdd>-pages-articles.xml.bz2
. Its is about 18.1 GB in size. We useenwiki-20220201-pages-articles.xml.bz2
.
- German language: Select the youngest directory from https://dumps.wikimedia.org/dewiki/ and download a file called
- create and activate a new Python environment (for example with conda)
- install the dependencies:
pip install -r requirements.txt
- for de data run:
python -m wikiextractor.WikiExtractor data/dewiki-20220201-pages-articles.xml.bz2 -o data/dewiki-20220201
- for en data run:
python -m wikiextractor.WikiExtractor data/enwiki-20220201-pages-articles.xml.bz2 -o data/enwiki-20220201
- use the
process_wiki_files.py
script:- edit
INPUT_DIR
,OUTPUT_DIR
and if neededLANGUAGE
- run the script
- edit
- concatenate the output in
OUTPUT_DIR
by runningcat <OUTPUT_DIR> > my_clean_wiki_corpus.txt
As Wikipedia itself, the text corpus is published under Creative Commons Attribution-ShareAlike 3.0 Unported license.
Copyright (c) 2022 Philip May
Licensed under the MIT License (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License by reviewing the file LICENSE in the repository.