All data available is stored in CoNLL-U format:
# newdoc
# newpar
# sent_id = 1
1 Кто кто PRON _ Case=Nom 3 nsubj _ _
2 нить нить NOUN _ Animacy=Inan|Case=Nom|Gender=Fem|Number=Sing 1 appos _ _
3 настраивал настраивать VERB _ Aspect=Imp|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act 0 root _ _
4 связку связка NOUN _ Animacy=Inan|Case=Acc|Gender=Fem|Number=Sing 3 obj _ _
5 VisualSVN Visualsvn PROPN _ Foreign=Yes 4 flat:foreign _ _
6 Server Server PROPN _ Foreign=Yes 5 flat:foreign _ _
7 + плюс PUNCT _ _ 6 punct _ _
8 Trac Trac ADP _ _ 11 case _ _
9 0.11 0.11 NUM _ _ 11 nummod _ _
10 на на ADP _ _ 11 case _ _
11 Windows? Windows? SYM _ _ 12 obl _ _
12 Можете мочь VERB _ Aspect=Imp|Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin|Voice=Act 4 acl:relcl _ _
13 подсказать, подсказать ADV _ Degree=Pos 12 advmod _ _
14 а а CCONJ _ _ 19 cc _ _
15 то то PRON _ Animacy=Inan|Case=Nom|Gender=Neut|Number=Sing 19 mark _ _
16 заводится заводиться VERB _ Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Tense=Fut|VerbForm=Fin|Voice=Mid 15 fixed _ _
17 никак никак ADV _ Degree=Pos 19 advmod _ _
18 не не PART _ _ 19 advmod _ _
19 хочет хотеть VERB _ Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 12 conj _ _
20 ((( ((( PUNCT _ _ 19 punct _ _
Enhanced dependencies and empty/secondary nodes should not be part of the output.
- news UD_MorphoRuEval2017 1K
- social networks UD_MorphoRuEval2017 1K
- wiki UD_GSD 1K
- fiction UD_SynTagRus 1K
- poetry UD_Taiga 1K
- 17th century UD_MidRussian-RNC 1K
UD_SynTagRus-v02 - a harmonized version with semi-manual corrections
Russian data from the SynTagRus corpus (1.1M tokens, fiction, news, wiki, nonficion). Source: UD Russian SynTagRus repository
Annotation:
- automatic (ETAP3), human correction in native SynTagRus, then re-tokenized and converted automatically to UD 2.x
- enhanced dependencies removed, minor fixes of lemmas, UPOS, features, and relations
- wiki UD_GSD
Russian Universal Dependencies Treebank annotated and converted by Google (96K tokens, wiki). Source: UD_Russian GSD repository
Annotation:
- automatic (GSD), human correction
Samples extracted from Taiga Corpus and MorphoRuEval-2017 text collections (38K tokens, blog, social, poetry, news). UD_Russian Taiga repository
Annotation:
- manual
- news UD_RuEval2017 (Lenta.ru, 5K)
- fiction UD_RuEval2017 (magazines.gorky.media, 7K)
- social UD_RuEval2017 (VK, 5K)
Russian Corpus Data with manual verification, including SynTagRus, OpenCorpora, GICR, RNC.
Annotation:
- unified automatic morphology (AOT, Mystem, ABBYY Compreno...)
- UDPipe
- historical UD_OldRussian-RNC (dependencies manually corrected, 39K)
- historical UD_OldRussian-RNC (dependencies auto, 3,2M)
A subcorpus of the Middle Russian corpus, texts of the 17th century (4M tokens, business&law, letters, church slavic, hybrid)
Annotation:
- upos and features: hybrid automatic with partial manual post-correction
- lemmas: TBA
- dependencies: automatic (UDpipe)
The organizers provide additional data with fully automatic annotation:
Russian Corpus Data with manual verification, including SynTagRus, OpenCorpora, GICR, RNC.
Annotation:
- unified automatic morphology (AOT, Mystem, ABBYY Compreno...)
- UDPipe
Updated link
Corpus of Russian tweets with sentiment annotation from http://study.mokoron.com/
Annotation:
- UDPipe pipeline (tokenization, morphology, syntax)
Actual dump of Russian Wikipedia, first 100000 articles (will be supplemented)
Annotation:
- UDPipe pipeline (tokenization, morphology, syntax)
Comments from Russian Youtube Trends, april 2019
Annotation:
- UDPipe pipeline (tokenization, morphology, syntax)
Lenta Ru news, up to 2018
Annotation:
- symbol unification
- UDPipe pipeline (tokenization, morphology, syntax)
Stihi ru poetry, part from from Taiga Corpus
Annotation:
- symbol unification
- UDPipe pipeline (tokenization, morphology, syntax)
Proza ru fiction, part from from Taiga Corpus
Annotation:
- symbol unification
- UDPipe pipeline (tokenization, morphology, syntax)
Materials from https://magazines.gorky.media/, Tiga Corpus
Annotation:
- symbol unification
- UDPipe pipeline (tokenization, morphology, syntax)