Ekin Karadag

I was given a term project for the "Data Structures and Algorithms (C++)" class which was to write a program to search through every word in 1.6M tweets, record every used word with the number of times they are used and display the top 10 words and the number of times they are used as fast as possible.

To get the details for the term project instructions, click here.

Source Code (Licensed with GPL-3.0)
Visual Studio with C++ is needed to compile

Here is my report:

Used Libraries

#include <iostream>

#include <string>

#include <fstream>

#include <sstream>

#include <chrono> // to measure the elapsed time

Getting The Stop Words

The program starts with reading the stopwords.txt file which consists of stop words at every line. Stop words can be found here

Then, it indexes the starting points of word’s first letters(e.g. starting point index of the stop words that start with letter a is 0, for b 61, for c 89 and so on)

Indexing point helps the program to decrease the time to search the word in the stop words array.

Program will only search the stop words between the starting index and until the next one (e.g. if the program wants to search the words that start with letter b it would search from stopWords[61] to stopWords[89])

I created a virtual ending point because if z is the starting letter, it would not now when to stop.

Creating a Chained Hash Table

I've made a chained hash table from scratch to easily insert the words.
The size of the hash table is defined in main(). I preferred to use the size 1000003 because it is a prime number, useful in modulus operations to differentiate the words’ locations(avoid collisions) and I found out that the total number of words to be inserted is 804718.
Every point of the hash table consists of linked links. If there is more than one node on a specific location, I would have a collision which would slow down the searching process but I would be sure every "proper" word is counted.
Linked lists consists of nodes which has the following elements;

string word
unsigned long count
node *link

Cleaning The Words

The program starts reading the test.csv file with ifstream

After program gets to the 6th comma separated “string” field in the current reading line from the file (tweet section from test.csv) and separates them into words, then it cleans the unnecessary characters from the words and organize them for insertion.

If the length of the word ever becomes or is already one character do nothing and go to the next word.
Erase every non-letter characters unless if it is apostrophe(‘).
If the apostrophe is in the beginning of the word, erase it.
After cleaning is done, make every letter lowercase for both searching for it in the stop words and the hash table.
Search the word in stop words between the indexes we created earlier.
If the stop word flag isn’t raised we insert it(The word’s count increment of already existing words will be handled in the insertNew function), otherwise go to the next word.

Hash Table Function

I used the function below to insert the words into the most correct position possible with the help of prime numbers (to avoid collisions as much as possible).

If the word is already in the list, search function increases the word’s count automatically. insertNew function adds the new word with count 1 to the beginning of the linked list.
I did not use another function to get the hash key because it is unnecessary in this case and this is really easy to understand in my opinion.

Getting The Most Used Word From The Linked List

Since both integer value “count” and string value “word” can’t be returned from a function, I used addressing in the function of the list.
After getting the most used word and its count from the currently reading array, hash table’s displayMostUsed function stores them to local arrays if and only if the word’s count is greater than any other count in the local array.
I didn’t have to sort the top 10 array in displayMostUsed function because it shifts the counts towards the end of the array, as the new values come in.

My Output

I ran the program on a laptop with a i7-7700Q CPU, DDR4 SODIMM 16GB(2x8GB) RAM and a 256GB NVMe SSD to get this running time.

Click here to contact me

Ekin Karadag

Pages

25 February 2019

I'm on HackerRank

24 February 2019

Chained Hash Table Data Structure