To get the details for the term project instructions, click here.
Source Code (Licensed with GPL-3.0)
Visual Studio with C++ is needed to compile
Here is my report:
Used Libraries
#include <iostream>#include <string>#include <fstream>#include <sstream>#include <chrono> // to measure the elapsed time
Getting The Stop Words
Creating a Chained Hash Table
- I've made a chained hash table from scratch to easily insert the words.
- The size of the hash table is defined in main(). I preferred to use the size 1000003 because it is a prime number, useful in modulus operations to differentiate the words’ locations(avoid collisions) and I found out that the total number of words to be inserted is 804718.
- Every point of the hash table consists of linked links. If there is more than one node on a specific location, I would have a collision which would slow down the searching process but I would be sure every "proper" word is counted.
- Linked lists consists of nodes which has the following elements;
- string word
- unsigned long count
- node *link
Cleaning The Words
The program starts reading the test.csv file with ifstream
After program gets to the 6th comma separated “string” field in the current reading line from the file (tweet section from test.csv) and separates them into words, then it cleans the unnecessary characters from the words and organize them for insertion.
- If the length of the word ever becomes or is already one character do nothing and go to the next word.
- Erase every non-letter characters unless if it is apostrophe(‘).
- If the apostrophe is in the beginning of the word, erase it.
- After cleaning is done, make every letter lowercase for both searching for it in the stop words and the hash table.
- Search the word in stop words between the indexes we created earlier.
- If the stop word flag isn’t raised we insert it(The word’s count increment of already existing words will be handled in the insertNew function), otherwise go to the next word.
Hash Table Function
- I used the function below to insert the words into the most correct position possible with the help of prime numbers (to avoid collisions as much as possible).
- If the word is already in the list, search function increases the word’s
count
automatically.insertNew
function adds the new word withcount
1 to the beginning of the linked list. - I did not use another function to get the hash key because it is unnecessary in this case and this is really easy to understand in my opinion.
Getting The Most Used Word From The Linked List
- Since both integer value “count” and string value “word” can’t be returned from a function, I used addressing in the function of the list.
- After getting the most used word and its count from the currently reading array, hash table’s displayMostUsed function stores them to local arrays if and only if the word’s count is greater than any other count in the local array.
- I didn’t have to sort the top 10 array in displayMostUsed function because it shifts the counts towards the end of the array, as the new values come in.
My Output
I ran the program on a laptop with a i7-7700Q CPU, DDR4 SODIMM 16GB(2x8GB) RAM and a 256GB NVMe SSD to get this running time.
No comments:
Post a Comment