Program reads a given input which contains articles, chapters, books about Applied sciences,Mathematics, Information science published from 1900 to 2021 and finds the top 10 frequent words in it. Since the input has a large data and has a very suitable format for it, thought the best way to do it is using hashing algorithms and linked lists. So created a simple hashmap and added someuseful methods. Started with creating classes named “HashNode” and “HashMap”. ”HashNode”has a simple constructor which creates a basic linked list. ”Hashmap” is the main class of our headerfile. ”Hashmap” has the methods and the size of the map. Given a big prime number as the size of the map in the constructor so we can have the best chance of obtaining a unique value whenmultiplying values by the chosen prime number and adding them all up. In the main.cpp file, created two functions first, lower() and isSuitable().lower() functionconverts the letters of the words I am going to define later to lower case and isSuitable() checkswhether the word has a number or any quotation marks in it. So that we can get a correct input.In main function, the program reads and stores the files “PublicationsDataSet.txt” and “stopwords.txt” into two different hashmaps created according to their sizes. I parsed the input data by adding ‘newline’ every time we see curly brackets and commas. After finding the line “UnigramCount”, I have written the part about declaring what is a word and splitting themcorrectly. If the word is not in the list yet and returns true in isSuitable() function, store them inthe hashmap by using the methods of the class. If it is already in the map, update and sum the count value. So in the end, we call all entries and create a list called named TopTenEntries. First fillTopTenEntries with 0s. With a for loop, we compare every entry in AllEntries with TopTenEntries one by one by iterating, and if we have a number bigger than zero in AllEntries we add the number to the 9th line in our TopTenEntries list. And then check to see if we have a number bigger than our 9th line, if we do then we add that word to the line above. This goes on until we have all the top 10words sorted with the number of times they are used in the file.
-
Notifications
You must be signed in to change notification settings - Fork 0
Program reads a given input which contains articles, chapters, books about Applied sciences,Mathematics, Information science published from 1900 to 2021 and finds the top 10 frequent words in it.
iclalsonmez/top-10-frequent-words
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Program reads a given input which contains articles, chapters, books about Applied sciences,Mathematics, Information science published from 1900 to 2021 and finds the top 10 frequent words in it.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published