nlp-recipes/utils_nlp/dataset at master · hieptran1812/nlp-recipes

History

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
__init__.py		__init__.py
bbc_hindi.py		bbc_hindi.py
cnndm.py		cnndm.py
dac.py		dac.py
data_loaders.py		data_loaders.py
msrpc.py		msrpc.py
multinli.py		multinli.py
ner_utils.py		ner_utils.py
preprocess.py		preprocess.py
sentence_selection.py		sentence_selection.py
snli.py		snli.py
squad.py		squad.py
stsbenchmark.py		stsbenchmark.py
url_utils.py		url_utils.py
wikigold.py		wikigold.py
xnli.py		xnli.py
xnli_torch_dataset.py		xnli_torch_dataset.py

Dataset

This submodule includes helper functions for downloading datasets and formatting them appropriately as well as utilities for splitting data for training / testing.

Data Loading

There are dataloaders for several datasets. For example, the snli module will allow you to load a dataframe in pandas from the SNLI dataset, with the option to set the number of rows to load in order to test algorithms and evaluate performance benchmarks. Most datasets may be split into train, dev, and test, for example:

from utils_nlp.dataset.snli import load_pandas_df

df = load_pandas_df(DATA_FOLDER, file_split ="train", nrows = 1000)

Dataset List

Dataset	Dataloader script
Microsoft Research Paraphrase Corpus	msrpc.py
The Multi-Genre NLI (MultiNLI) Corpus	multinli.py
The Stanford Natural Language Inference (SNLI) Corpus	snli.py
Wikigold NER	wikigold.py
The Cross-Lingual NLI (XNLI) Corpus	xnli.py
The STSbenchmark dataset	stsbenchmark.py
The Stanford Question Answering Dataset (SQuAD)	squad.py
CNN/Daily Mail(CNN/DM) Dataset	cnndm.py
Preprocessed CNN/Daily Mail(CNN/DM) Dataset for Extractive Summarization	cnndm.py

Dataset References

Please see Dataset References for notice and information regarding datasets used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset

dataset

README.md

Dataset

Data Loading

Dataset List

Dataset References

Files

dataset

Directory actions

More options

Directory actions

More options

Latest commit

History

dataset

Folders and files

parent directory

README.md

Dataset

Data Loading

Dataset List

Dataset References