docsim

A Document Similarity Finder.

Problem

Required Document Similarity Algorithm Description

Given two documents, similarity is determined by comparing the ten most occurring words. For each document, determine the ten most occurring words. If the two documents have five or more words common in the set of ten most occurring words, then the documents are similar. Otherwise, the documents are not similar. When two different words occur the same amount of times, then the words are ordered alphabetically in the list of most occurring words. Since some words are less indicative of one’s writing style (e.g., I, is, the, a), an Exclusion List may be supplied containing words to exclude from the list of most occurring words.

Requirements

No third-party libraries may used (e.g., Spring, Apache Commons).
The document similarity algorithm should be pluggable.
Two different similarity algorithms must be implemented. One of the algorithms is described above and the other one is of your choosing.

Second Implementation

The second implementation evaluates the N most occurring phrases. It is similar to the word algorithm described above, but uses a sliding phrase window with a default length of 3 words.

Assumptions

Additional configuration parameters could be exposed to the command line interface as system properties. (i.e., numTopWords, matchThreshold, phraseSize.) Such parameters are available on the implementations but not exposed to the command line interface. Exposing these would require additional effort that does not seem pertinent to the exercise goals.

Usage Examples

java -jar target/docsim-${version}.jar src/test/resources/doc1.txt src/test/resources/doc2.txt

java -Dsimilarity.impl=com.timreardon.docsim.impl.MostOccurringPhrasesSimilarity -jar target/docsim-${version}.jar src/test/resources/doc1.txt src/test/resources/doc2.txt

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src		src
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docsim

Problem

Requirements

Second Implementation

Assumptions

Usage Examples

About

Releases

Packages

Languages

tequalsme/docsim

Folders and files

Latest commit

History

Repository files navigation

docsim

Problem

Requirements

Second Implementation

Assumptions

Usage Examples

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages