Skip to content

Python data analysis project to visualize letter distributions across various text files

License

Notifications You must be signed in to change notification settings

goralczm/letters_statistics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Letter Statistics

Once I was curious to see what words can have hidden. If there is any patterns or just to have some statistics to train my skills.

Here is my ongoing project which is all about words and letters.

I am currently working with few txt files in both polish and english to have some differences.

Experiments

After few experiments with collecting letter occurances, most recursive letters, vowels, consonants etc. I have come to a conclusion that there is a pattern behind it all. Just take a look at this charts comparing various files. The overall shapes are similar and percentage amount of each letter in all the texts is mind-blowingly close.

Chart comparing Lalka and Hamlet-PL Chart comparing Pan Tadeusz and Lalka Chart comparing Hamlet-EN and 'Romeo and Juliet'

Statistics

These are few statistics that I gathered so far.

English Words Polish Words Hamlet-EN Hamlet-PL
Total Letters 3494707 36090123 122288 144927
Vowels 1438930 (~41.17%) 15835894 (~43.88%) 51284 (~41.94%) 59901 (~41.33%)
Consonants 2055777 (~58.83%) 20254229 (~56.12%) 71004 (~58.06%) 85026 (~58.67%)
Most recursive letter "e" with 376456 occurances (~10.77%) "a" with 3388277 occurances (~9.39%) "e" with 16335 occurances (~13.36%) "a" with 12103 occurances (~8.35%)

Sources of txt files:

About

Python data analysis project to visualize letter distributions across various text files

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages