Skip to content

Subs2Strudel Processed Languages

Latest
Compare
Choose a tag to compare
@doomlab doomlab released this 21 Oct 19:25
· 72 commits to main since this release
f091e9b

Concept - Feature Norms

This release contains the final STRUDEL concept-feature combinations for each language. The
two letter codes at the beginning of the file name match the letter codes from the Open
Subtitles Project.

The columns include:

  • feature_token: original token from the subtitles
  • feature_lemma: the lemma for the feature token
  • feature_pos: the part of speech for the original feature
  • characteristics: the information about part of speech tagging/parsing, such as number, tense, etc.
  • dependency_relation: the tag from the dependency parsing
  • concept_token: original token from the subtitles
  • concept_lemma: the lemma for the concept
  • concept_pos: the part of speech for the concept feature
  • freq: the frequency of the concept, feature, part of speech combinations

Completed Languages

  • Afrikaans (af)
  • Arabic (ar)
  • Bulgarian (br)
  • Catalan (ca)
  • Czech (cs)
  • Danish (da)
  • German (de)
  • Greek, Modern (1453–) (el)
  • English (en)
  • Spanish (es)
  • Estonian (et)
  • Basque (eu)
  • Farsi (fa)
  • Finnish (fi)
  • French (fr)
  • Galician (gl)
  • Hindi (hi)
  • Hebrew (he)
  • Croatian (hr)
  • Hungarian (hu)
  • Armenian (hy)
  • Indonesian (id)
  • Italian (it)
  • Japanese (ja)
  • Korean (ko)
  • Lithuanian (lt)
  • Latvian (lv)
  • Dutch, Flemish (nl)
  • Norwegian (no)
  • Polish (pl)
  • Romanian, Moldavian, Moldovan (ro)
  • Russian (ru)
  • Portuguese (pt)
  • Brazilian Portuguese (pt_br)
  • Slovak (sk)
  • Slovenian (sl)
  • Serbian (sr)
  • Swedish (sv)
  • Tamil (ta)
  • Turkish (tr)
  • Ukrainian (uk)
  • Urdu (ur)
  • Vietnamese (vi)
  • Cantonese (zh_cn)
  • Mandarin (zh_tw)

Format

These files are in in UTF-8 format. They are not necessarily Excel friendly, but can be opened in regular text editing, source programming, etc.