Skip to content

A library to find potenitally misspelt or misworded quotes from a body of reference texts

License

Notifications You must be signed in to change notification settings

tahyonline/mismisquote

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MisMisQuote

Find Misspelt|Mispelt|Misspet and Misworded|Rewritten Quotes

A library to find potenitally misspelt or misworded text from a body of reference texts

Status

Proof of Concept (PoC) stage.

Top next step required: Optimisation is required for multi-layer (i.e., words of the sentence and letters of the word) mode currently in the two-layer-fuzzy branch.

Shift-And Algorithm

MisMisQuote is loosely based on the 'shift-and' algorithm for text matching.

The 'shift-and' algorithm uses a bitmap, i.e., a processor register, to represent how far we have matched the search string to the reference string.

At each new character, we shift the bitmap towards the more significant bits and put a 1 in the least significant bit.

Then we and the bitmap with the pre-calculated bitmaps of where that particular character comes up in the search string.

So, if we have matched 4 characters so far, then, after the shift we will have a 1 bit in the 5th position. Now if the current character matches, then we should have a 1 bit in the character's bitmap, too, so the bit would 'survive' the and and continue to 'bubble up'.

We have a match if we have a 1 in the nth bit, where n is the length of the search string.

Note: if you have a good reference for this algorithm, please let me know. I have used it about 20 years ago and back then it was easy to find.

Floats instead of Bits

To be able to score fuzzy matches, we use lists of floats to represent the bitmap, so at each position we can track not just full matches, but partial matches, too.

When we have some positive number in the nth position, we know that we have found 'something' and the score gives an idea how good the match is.

The threshold can be set to limit the quality of matches to the desired higher or lower accuracy.

About

A library to find potenitally misspelt or misworded quotes from a body of reference texts

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages