allies_llmt_data

Data for the ALLIES Lifelong Learning Machine Translation evaluation campaign

Description

This dataset contains the documents from Europarl v10 and NewsCommentary v15. Each document has been renamed to the following format: corpus-date-time1-time2-suffix

corpus is either 'ep' for Europarl or 'nc' for NewsCommentary.
date is the document date in the formnat YYYYMMDD.
time1 and time2 (both optional) are integer values making sure that each document has a unique date-time1-time2.
suffix is the remaining of the originial document name.

For Europarl, the renaming simply consists in merging year, month and day values from original name and add a 'time1' tag to avoid multiple files with same timestamps. For NewsCommentary, the timestamps correspond to the 'article:published_time' extracted from the original html files.

Recreating the data

Recreate the tgz file then extract it:

For EN-FR:

cat en-fr.tgz.part* > en-fr.tgz
tar xzvf en-fr.tgz

For EN-DE:

cat en-de.tgz.part* > en-de.tgz
tar xzvf en-de.tgz

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
en-de.tgz.part00		en-de.tgz.part00
en-de.tgz.part01		en-de.tgz.part01
en-de.tgz.part02		en-de.tgz.part02
en-de.tgz.part03		en-de.tgz.part03
en-de.tgz.part04		en-de.tgz.part04
en-fr.tgz.part00		en-fr.tgz.part00
en-fr.tgz.part01		en-fr.tgz.part01
en-fr.tgz.part02		en-fr.tgz.part02
en-fr.tgz.part03		en-fr.tgz.part03
en-fr.tgz.part04		en-fr.tgz.part04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

allies_llmt_data

Description

Recreating the data

For EN-FR:

For EN-DE:

About

Releases

Packages

loicbarrault/allies_llmt_data

Folders and files

Latest commit

History

Repository files navigation

allies_llmt_data

Description

Recreating the data

For EN-FR:

For EN-DE:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages