Skip to content

Program reads a given input which contains articles, chapters, books about Applied sciences,Mathematics, Information science published from 1900 to 2021 and finds the top 10 frequent words in it.

Notifications You must be signed in to change notification settings

iclalsonmez/top-10-frequent-words

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

top-10-frequent-words

Program reads a given input which contains articles, chapters, books about Applied sciences,Mathematics, Information science published from 1900 to 2021 and finds the top 10 frequent words in it. Since the input has a large data and has a very suitable format for it, thought the best way to do it is using hashing algorithms and linked lists. So created a simple hashmap and added someuseful methods. Started with creating classes named “HashNode” and “HashMap”. ”HashNode”has a simple constructor which creates a basic linked list. ”Hashmap” is the main class of our headerfile. ”Hashmap” has the methods and the size of the map. Given a big prime number as the size of the map in the constructor so we can have the best chance of obtaining a unique value whenmultiplying values by the chosen prime number and adding them all up. In the main.cpp file, created two functions first, lower() and isSuitable().lower() functionconverts the letters of the words I am going to define later to lower case and isSuitable() checkswhether the word has a number or any quotation marks in it. So that we can get a correct input.In main function, the program reads and stores the files “PublicationsDataSet.txt” and “stopwords.txt” into two different hashmaps created according to their sizes. I parsed the input data by adding ‘newline’ every time we see curly brackets and commas. After finding the line “UnigramCount”, I have written the part about declaring what is a word and splitting themcorrectly. If the word is not in the list yet and returns true in isSuitable() function, store them inthe hashmap by using the methods of the class. If it is already in the map, update and sum the count value. So in the end, we call all entries and create a list called named TopTenEntries. First fillTopTenEntries with 0s. With a for loop, we compare every entry in AllEntries with TopTenEntries one by one by iterating, and if we have a number bigger than zero in AllEntries we add the number to the 9th line in our TopTenEntries list. And then check to see if we have a number bigger than our 9th line, if we do then we add that word to the line above. This goes on until we have all the top 10words sorted with the number of times they are used in the file.

About

Program reads a given input which contains articles, chapters, books about Applied sciences,Mathematics, Information science published from 1900 to 2021 and finds the top 10 frequent words in it.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages