This repo contains a sophisticated metric for evaluating the quality of eloquence and used vocabulary and the eloquence in text datasets like a collection of messages or entire mailboxes. The hirsch-index is adapted as measure for the eloquence of phrases. It is an useful parameter in terms of assessing the data's usefulness for the training of Neural Networks in Natural Language Processing (NLP). The metric is mainly based on the frequency of recurring phrases within the single text documents.
The core algorithm is implemented in C and also in Cython. It can be chosen to count phrases on message or on sentence level (meaning, that we don't allow phrases to consist of more than one sentence). The procedure for finding and counting a phrase obeys the following the rules:
- Start with the longest possible phrase length (= number of words in message/sentence) as
phr_len
- Search for a recurring sequence of words of this length
- If found, mark the sequences, so that they can't be part of shorter phrases, and count their number of occurrence.
- Continue with step 2 for
phr_len = phr_len - 1
untilphr_len = 1
matrix
, numpy-matrix with 3 columns: phrase length, number of different phrases with this length, sum of numbers of occurrences of all phrases with this lengthtuples
, list of collected phrases with its number of occurrence
From tuples the hirsch-index can be computed and a plotting function for the matrix is provided in order to visialize the phrase eloquence.
Here are some instructions for getting your own Verne running on your local computer.
- Make sure you have python 2.7 latest
- For the C-implementation of the algorithm instead of the python version a C-compiler is required
- numpy
- spacy
- cython (only required for the implementation in C, alternatively the implementation in python is used automatically)
git clone
this repo- From the root of the repo, run
python cythonize_numerics.py build_ext --inplace
to precompile the c core.
To start the phrase counting on a given mailbox / collection of messages/texts, run
$ python counter.py <<path_to_mailbox_folder>>
I love open-source! And I try to reply everyone needing help using my projects. Also, you are of cause free to integrate and my project in your applications. However, if you get some profit from this or just want to encourage me to continue creating stuff, there are few ways you can do it:
- Starring and sharing projects you like
- 🍲 Share your next meal with these unfortunate, because there is no reason not to do so!
- 📖 Buy me a book: I love books and I will always remember you 😉
- Bitcoin: You can send me bitcoins at this address:
xpub6DUNko8GTPePPgtbK1qfpiLCoujQXUBTi1qtfw7V2oBCdnk1H9d3if3pazmCy9QgENKSNPpHAXRZp8HLSG7pWwba5HRcHLC3TjbXYXXZh57
Thanks! ❤️