Concept - Feature Norms
This release contains the final STRUDEL concept-feature combinations for each language. The
two letter codes at the beginning of the file name match the letter codes from the Open
Subtitles Project.
The columns include:
- feature_token: original token from the subtitles
- feature_lemma: the lemma for the feature token
- feature_pos: the part of speech for the original feature
- characteristics: the information about part of speech tagging/parsing, such as number, tense, etc.
- dependency_relation: the tag from the dependency parsing
- concept_token: original token from the subtitles
- concept_lemma: the lemma for the concept
- concept_pos: the part of speech for the concept feature
- freq: the frequency of the concept, feature, part of speech combinations
Completed Languages
- Afrikaans (af)
- Arabic (ar)
- Bulgarian (br)
- Catalan (ca)
- Czech (cs)
- Danish (da)
- German (de)
- Greek, Modern (1453–) (el)
- English (en)
- Spanish (es)
- Estonian (et)
- Basque (eu)
- Farsi (fa)
- Finnish (fi)
- French (fr)
- Galician (gl)
- Hindi (hi)
- Hebrew (he)
- Croatian (hr)
- Hungarian (hu)
- Armenian (hy)
- Indonesian (id)
- Italian (it)
- Japanese (ja)
- Korean (ko)
- Lithuanian (lt)
- Latvian (lv)
- Dutch, Flemish (nl)
- Norwegian (no)
- Polish (pl)
- Romanian, Moldavian, Moldovan (ro)
- Russian (ru)
- Portuguese (pt)
- Brazilian Portuguese (pt_br)
- Slovak (sk)
- Slovenian (sl)
- Serbian (sr)
- Swedish (sv)
- Tamil (ta)
- Turkish (tr)
- Ukrainian (uk)
- Urdu (ur)
- Vietnamese (vi)
- Cantonese (zh_cn)
- Mandarin (zh_tw)
Format
These files are in in UTF-8 format. They are not necessarily Excel friendly, but can be opened in regular text editing, source programming, etc.