Skip to content

Latest commit

 

History

History
 
 

dataset

This submodule includes helper functions for downloading datasets and formatting them appropriately as well as utilities for splitting data for training / testing.

Data Loading

There are dataloaders for several datasets. For example, the snli module will allow you to load a dataframe in pandas from the SNLI dataset, with the option to set the number of rows to load in order to test algorithms and evaluate performance benchmarks. Most datasets may be split into train, dev, and test, for example:

from utils_nlp.dataset.snli import load_pandas_df

df = load_pandas_df(DATA_FOLDER, file_split ="train", nrows = 1000)

Dataset List

Dataset Dataloader script
Microsoft Research Paraphrase Corpus msrpc.py
The Multi-Genre NLI (MultiNLI) Corpus multinli.py
The Stanford Natural Language Inference (SNLI) Corpus snli.py
Wikigold NER wikigold.py
The Cross-Lingual NLI (XNLI) Corpus xnli.py
The STSbenchmark dataset stsbenchmark.py
The Stanford Question Answering Dataset (SQuAD) squad.py
CNN/Daily Mail(CNN/DM) Dataset cnndm.py
Preprocessed CNN/Daily Mail(CNN/DM) Dataset for Extractive Summarization cnndm.py

Dataset References

Please see Dataset References for notice and information regarding datasets used.