Phrase Frequency Counter

This repo contains a sophisticated metric for evaluating the quality of eloquence and used vocabulary and the eloquence in text datasets like a collection of messages or entire mailboxes. The hirsch-index is adapted as measure for the eloquence of phrases. It is an useful parameter in terms of assessing the data's usefulness for the training of Neural Networks in Natural Language Processing (NLP). The metric is mainly based on the frequency of recurring phrases within the single text documents.

The core algorithm is implemented in C and also in Cython. It can be chosen to count phrases on message or on sentence level (meaning, that we don't allow phrases to consist of more than one sentence). The procedure for finding and counting a phrase obeys the following the rules:

Start with the longest possible phrase length (= number of words in message/sentence) as phr_len
Search for a recurring sequence of words of this length
If found, mark the sequences, so that they can't be part of shorter phrases, and count their number of occurrence.
Continue with step 2 for phr_len = phr_len - 1 until phr_len = 1

Return values

matrix, numpy-matrix with 3 columns: phrase length, number of different phrases with this length, sum of numbers of occurrences of all phrases with this length
tuples, list of collected phrases with its number of occurrence

From tuples the hirsch-index can be computed and a plotting function for the matrix is provided in order to visialize the phrase eloquence.

Getting Up-And-Running

Here are some instructions for getting your own Verne running on your local computer.

Prerequisites

Make sure you have python 2.7 latest
For the C-implementation of the algorithm instead of the python version a C-compiler is required

Python requirements:

numpy
spacy
cython (only required for the implementation in C, alternatively the implementation in python is used automatically)

Installation and Configuration

git clone this repo
From the root of the repo, run python cythonize_numerics.py build_ext --inplace to precompile the c core.

Running the script

To start the phrase counting on a given mailbox / collection of messages/texts, run

$ python counter.py <<path_to_mailbox_folder>>

Support my projects 💝

I love open-source! And I try to reply everyone needing help using my projects. Also, you are of cause free to integrate and my project in your applications. However, if you get some profit from this or just want to encourage me to continue creating stuff, there are few ways you can do it:

Starring and sharing projects you like
🍲 Share your next meal with these unfortunate, because there is no reason not to do so!
📖 Buy me a book: I love books and I will always remember you 😉
Bitcoin: You can send me bitcoins at this address: xpub6DUNko8GTPePPgtbK1qfpiLCoujQXUBTi1qtfw7V2oBCdnk1H9d3if3pazmCy9QgENKSNPpHAXRZp8HLSG7pWwba5HRcHLC3TjbXYXXZh57

Thanks! ❤️

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
levenshtein		levenshtein
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE.txt		LICENSE.txt
README.md		README.md
counter.py		counter.py
cythonize_numerics.py		cythonize_numerics.py
levenshtein_distance.py		levenshtein_distance.py
numerics.pyx		numerics.pyx
requirements.txt		requirements.txt
src_numerics.c		src_numerics.c
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phrase Frequency Counter

Return values

Getting Up-And-Running

Prerequisites

Python requirements:

Installation and Configuration

Running the script

Support my projects 💝

About

Releases

Packages

Languages

License

harmening/phrase-frequency-counter

Folders and files

Latest commit

History

Repository files navigation

Phrase Frequency Counter

Return values

Getting Up-And-Running

Prerequisites

Python requirements:

Installation and Configuration

Running the script

Support my projects 💝

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages