Skip to content

Latest commit

 

History

History
60 lines (41 loc) · 1.74 KB

Development.md

File metadata and controls

60 lines (41 loc) · 1.74 KB

The Tatoeba Challenge - Development Notes

Restore symbolic model links

git fetch origin <commit-has-with-all-files>
git checkout FETCH_HEAD
find models/ -type l | tar -cf models-links.tar -T -
git stash
git checkout master
tar -xf models-links.tar
git add models/*/*.yml
git commit -am 'restored symbolic links'
git push origin master

Prerequistes

Required software:

Optional software:

  • terashuf: efficiently shuffle massive data sets
  • pigz: multithreaded gzip

Data:

  • local copy of all OPUS data (set OPUS_HOME in the Makefile)

Compiling the corpus

  • make sure that the scripts in scripts/ work as they should and that all software is properly installed
  • run make all to compile the entire corpus and readme-files (or better using parallel threads with, for example four paralle jobs using make -j 4 all)
  • upload the data to ObjectStorage using a-tools at CSC:
module load allas
allas-conf
make upload

The data set can also be compiled in various steps, for example test/dev sets and training data sets separately:

make -j testdata
make -j traindata
make subsets

TODO