Sentence Collector

Notes and files for sentence collection for common voice in Esperanto



useful bash snippets for sentence extraction from old books

delete all lines with less than three digits

awk 'length>3' alico4.txt > alico5..txt

delete all lines containing certain letters

 sed '/[w|W|q|Q|x|X|y|Y]/d' marta6.txt > marta7.txt
 sed '/[ä|Ä|ö|Ö|ü|Ü|ß|ð|ð|À|Á|Â|Ã|Å|Æ|Ç|È|É|Ê|Ë|Ì|Í|İ|Î|Ï|Ð|Ñ|Ò|Ó|Ô|Õ|Ø|Ù|Ú|Û|Û|Ý|Ž|à|á|â|ã|å|æ|ç|è|é|ê|ë|ì|í|î|ï|ð|ñ|ò|ó|ô|õ|ø|ù|ú|û|ý|þ|ÿ|ā|ă|ą|ć|ċ|č|ď|đ|ē|ĕ|ė|ę|ě|ğ|ġ|ģ|ħ|ĩ|ī|ĭ|į|ı|ķ|ĸ|ĺ|ļ|ľ|ŀ|ł|ń|ņ|ņ|ṫ|š|Ў|ḃ|ḋ|ḟ|ṁ|ṗ|ṡ|ẁ|ẃ|ẅ|ẛ|ỳ|α|β|Γ|γ|Δ|δ|ε|ζ|η|Θ|θ|ι|κ|Λ|λ|μ|ν|Ξ|ξ|Π|π|ρ|Σ|σ|ς|τ|ș|ế|ō|ů|ū|Ł|ş|ş|ǐ|ż|ő|ň|ự|ň|ž|ị|Ō|ŏ|Č|Š|ř|ś]/d' europarl-de-wip.txt > eu2.txt

delete long lines

 sed '/^.\{102\}./d' alico3.txt > alico4.txt

Delete all lines with more or eqal 14 words

awk 'NF<=14' europarl-de-wip.txt > lees14.txt

Delete all lines beginning with lower case

 sed '/^[[:lower:][:punct:]]/d' marta.txt > marta2.txt

Delete all lines containing numbers

sed '/[0-9]/d' filename.txt

Delete all lines containing abbreviations

sed '/[A-Z][A-Z]/d' europarl-de-wip.txt > eu2.txt

Forgot what this does

sed -e 's/^[ \t]*//' alico2.txt > alico3.txt

Delete lines containing slashes

sed '\~\/~d'  OSCAR-corpus-eo_dedup.txt > 1.txt

Delete lines with some string

grep -v "our" t21.txt > t22.txt

Delete single letters in text

grep -v " [fF] "

Delete single letters plus full stops

grep -v " .\."

Select only lines containing . ? or !

awk '/\./ || /\?/ || /\!/'

Helpfull Regex:

Esperanto Alphabet:


Single letter in EO-Alphabet:

 All non EO-letters: [^\u0000-\u007BĉĈĝĜĥĤĵĴŝŜŭŬ„“‚‘’”–―—‑…«»]
 [^a-zA-ZĉĈĝĜĥĤĵĴŝŜŭŬ .,?!;-–―“„‚‘…]

German Alphabet:

All non german letters: [^\u0000-\u007BäöüßÄÖÜ„“‚‘’–]


Cyrilic Alphabet: [аАбБвВгГдДеЕёЁжЖзЗиИйЙкКлЛмМнНоОпПрРсСтТуУфФхХцЦчЧшШщЩъЪыЫьЬэЭюЮяЯ]
Arabic alphabet: [\u0600-\u06FF] or [ء-ي]
Extended arabic alphabet: /[\u0600-\u06ff]|[\u0750-\u077f]|[\ufb50-\ufbc1]|[\ufbd3-\ufd3f]|[\ufd50-\ufd8f]|[\ufd92-\ufdc7]|[\ufe70-\ufefc]|[\uFDF0-\uFDFD]/
Chinese and japanese alphabet: [\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD]

Word endings:


Word patterns:

any letter, point, any letter point
One letter word:
Question marks in the middle of words:
Point in the middle of words (mainly domains):

100 random sentences

sort -R wiki.eo.sort.txt |head -n 100 > wiki.eo.sample.txt

cut file in equal chunks

split -l 1000 prefix-

sort alphabetical and delete dublicates

sort -u wiki.eo.txt > wiki.eo.sort.txt

count characters

sed 's/\(.\)/\1\n/g' wiki.eo.txt | sort | uniq -ic > characters.eo.txt

awk -vFS="" '{for(i=1;i<=NF;i++)w[tolower($i)]++}END{for(i in w) print i,w[i]}' > signs.txt