CSOC 2436: Project: Building a Text Search Engine from Scratch

Introduction:

In this project, you will build a barebones Text Search Engine with the help of your knowledge in hashing. Imagine, you are a developer at a company which sells thousands of product everyday, and receives millions of reviews as well. Your CEO wants you to retrieve all the documents that have the word "iphone" but not the word "camera" in it. Or it has either the word "amazing" or "awesome". That's what we want to build here.

Your Task:

Given a list of documents, you have to build the following:

  • A document indexing: If the given text file has N lines, you have to index it from 1 to N (i.e., assume line 1 is document at index 1, line 2 refers to the document with index 2, etc.). These indices or document numbers are important to be followed as the original order because your output of text queries will output these document indices to refer to the fact that the queried term appears in that document (referred by its index).
  • A list of vocabulary: A data structure, which will contain all the unique words, and all lowercased without any special characters. The words must be sorted in an ascending order. You have to perform a lexicographic sort (i.e., compare the strings character by character from left to right, and the comparison should be on the ASCII value).
  • A dictionary: You will build a dictionary, where every word will be mapped with an unique integer value. You will use the concept of hashing here.
  • A document matrix: You will build a document matrix, which will contain the set of the hashmapped integer values (i.e., hash values) for every unique word. The set will be sorted in an ascending order of hash values. You can use heap/quick/merge sort or even selection/insertion/bubble sort here. Keep in mind this refers to integer sorting as you are sorting hash values (integers).
  • Search result: Given a list of queries, you have to provide the list of documents that match the query. The document list of the result of the query must be sorted in the ascending order of document indices.
  • You will be tested on the document matrix and the search result only.

    Implementation Detail:

    You will be given two input files in argv[1] and argv[2]. Let's assume, we have document.txt and instruction.txt as input. document.txt will contain several lines with a linebreak after each line. First, you have to build a way of indexing the document starting from 1 (and not 0), to N, where N is the length of total lines. Make sure, you are reading every line sequentially.

    Next, you build a list of vocabulary, where there will be unique words only. Make sure, all the words are in lowercase, and if there are characters in uppercase, convert them to lowercase. Since we are not doing an advanced parsing, words may often contain special characters, e.g., a period, a comma, a semicolon etc. In other words, we define "word" as any entities separated by a space. You need to make sure, your words would only contain alphaneumerics (i.e., alphabets from a-z, and numerics 0-9), and remove any other special characters. Also, make sure the vocabulary list must be sorted lexicographically.

    Hints: check out the function std::isalnum.

    Next, we wll build a dictionary using our knowledge of linear probe in hashing. For each word, we will compute the hash key by summing the ASCII values of the characters, and find the reaminder by dividing it with the largest six digit prime number 999,883, which is our "bucket size". You will never have unique words list more than this number. That also means, you will never end up being unsuccessful to put a word in the dictionary. Let's assume, we want to find the hash value of the word "game":

    A Simple Hash Function for English words with Collision Handling via Linear Probing:

    int hash = 0;
    int N = 999883;
    string s = "game";
    for (int i = 0; i < s.length(); i++)
        hash = hash + (int)s[i];
    hash_value = hash % N //using modulo operator
    In case of a collision, you will have to use the linear probe, i.e., keep trying from i=0 to N-1 for (hash + i)%N

    Example: Consider the sentence: "Stop spot and post sIlent listen". We will now calculate the hash values for all the owrds.

  • Step 1: lowercase every letter: "stop the spot and post silent listen"
  • Step 2: split the sentence in words
  • Step 3: sort the words alphabatically. This is sorting based on lexicographical string ordering. We have, sorted wordlist = ["and", "listen", "post", "silent", "spot", "stop"]
  • Step 4: Now, compute the hash_value for each of them (using the above detailed hash function):
  • hash("and") = (int('a') + int('n') + int('d') ) % N = ( 97 + 110 + 100 ) % 999883 = 307

    Since, this is the first word entry, there is no chance of a collision

    hash("listen") = (int('l') + int('i') + int('s') + int('t') + int('e') + int('n') ) % N = (108 + 105 + 115 + 116 + 101 + 110) % 999883 = 655

    Check if there is a collision in the 655th position. If not, move to the next.

    hash("post") = (int('p') + int('o')) % N = (112 + 111 + 115 + 116) % 999883 = 454

    Check if there is a collision in the 454th position. If not, move to the next.

    hash("silent") = (int('s') + int('i') + int('l') + int('e') + int('n') + int('t') ) % N = (...) % 999883 = 655

    Now, there IS a collision in the 655th position. So, we check for an available position in (hash + i)%N, where i is 1 to N-1. at i=1, we find an available position

    hash("silent") = (655 + 1) % 999883 = 656

    For the next word, "spot", we will again have a collision, and i=1 will resolve that

    hash("spot") = (454 + 1) % 999883 = 455

    Now, for the next word, "stop", we will have a collision, but i=1 will not resolve that, since it was occupied by the "spot". So, we check for i=2, and we got an availability

    hash("stop") = (454 + 2) % 999883 = 456

    Next, you will build a document matrix, and this is the first output where you will be evaluated.The document matrix will contain the list of words, but not as a string, rather in their hash value. You will not use repeated values (i.e., you will treat each document as a "set of unique words"), and the hash values of the words will be sorted in ascending order.

    Finally, you need to implement the instructions provided in the second input (instruction.txt). You need to account for three operations: "AND", "OR", and "NOT". In each operation, you will encounter at most two word.
    One word: If only one word is given as instruction, print out all the document indices that contains the word (ofcourse, you will first compute the hash value of the word)
    Two words: There could be multiple cases for two words scenario, each with different operations. Assume, we have two words in our queries: w1 and w2
  • w1 AND w2: List all the document indices that contain both words w1 and w2 in them
  • w1 OR w2: List all the document indices that contain either word w1 or w2 in them
  • w1 AND (NOT w2): List all the document indices that contain word w1 but does not contain w2 in them
  • (NOT w1) AND (NOT w2): List all the document indices that both doesnot contain word w1 and does not contain w2 in them
  • (NOT w1) OR (NOT w2): List all the document indices that either doesnot contain word w1 or doesnot contain w2 in them
  • Please note, the final list of results of the AND, OR, NOT operations MUST be provided in the ascending order of document indices.

    Additonally, the following details need to be adhered to, so as to ensure correctness:

    1. Use the hash function provided above with linear probing and bucket size 999,883.
    2. Compute the hash values of words (strings) of all documents in the original sequence of of their document indices (i.e. compute hash values of words in document 1 (line 1 refers to document 1) , then compute hash values of words in document 2 (line 2 refers to document 1), then document 3 (line 3) , ...etc.)... The order is critical as hash valaues of strings are resolved by linear probing so the sequence of hashing matters in order to obtain the correct hash value. So we hash words of documents in the original sequence of their document indices. Also, the sequence of words in a document matters which is taken care by sorting words in a document prior ot hashing (see point 3 below).
    3. To compute the hash values of words (strings) in a document, first sort all words in that document as a set of strings (i.e. no duplicate words). Refer to Step 3 above, you can directly use string objects to sort as string objects natively follow lexicographical ordering. Then apply linear probing with the hash function provided above so that the hash values match. You can use std::set if needed however since words (strings) in a document are first sorted, removing duplicates is rather easy and can be done by a simple scan.


    Input and Output

    The program will take a document as input. Additionally, there will be an instruction file which will instruct you which search operations to be performed. The sample input output data with queries are available here. We also provide a Complete Walkthrough of the first example Input and Queries below.

    For example, consider doc1.txt and ins1.txt as input. where doc1.txt contains
    This is just a test.
    We will build a search engine.
    It will be a bit pristine, but trust me, it's fun.
    Some fun but not it will still be graded.
    and ins1.txt contains
    a
    will AND a
    fun OR pristine
    (NOT be) AND a
    engine
    Your output will have document_matrix.txt document_matrix.txt and instruction_output.txt . The document_matrix.txt will have indices and corresponding sorted word list values.
    1->[97, 220, 440, 448, 454]
    2->[97, 222, 441, 528, 630, 631]
    3->[97, 199, 210, 221, 319, 329, 331, 336, 441, 578, 878]
    4->[199, 221, 329, 331, 337, 436, 441, 552, 615]
    And instruction_output.txt will have the output values of each instructions.
    a-->[1, 2, 3]
    will AND a-->[2, 3]
    fun OR pristine-->[3, 4]
    (NOT be) AND a-->[1, 2]
    engine-->[2]

    The program takes 4 arguments. 1st arg: documents list input file name, 2nd arg: Query instruction file name, 3rd arg: Filename where you write the document matrix and 4th arg: Filename where you provide the results of queries in the instruction file or query results file.
    For example,:

    $ ./PROJ1 doc1.txt ins1.txt doc1_mat.txt ins1_out.txt
    

    Others

  • There will be ONLY three operations: AND, OR, NOT
  • AND and OR operation will not appear together
  • Every NOT operation will have two parentheses, one start, and the other is closing. There will not be any other parentheses
  • You are NOT allowed to use algorithm or map header files for this project.

    Run the program on Linux:

    Create a directory on the Linux server, its name must be proj1

    $ mkdir proj1
    

    Change your current directory to the proj1

    $ cd proj1
    

    Run the shell script to test the program

    $ sh test_cpp.sh 
    

    Algorithm for operations

    You must perform AND operations in no more than O(n+m) complexity, where the two terms which are being operated via boolean AND have m, n documents in their posting lists. We explain how to perform AND operation in O(n+m) complexity below. The same algorithm can be adapted for OR and NOT operations. OR should be done in O(m+n), NOT should not exceed O(N), where N is the total documents

        [p1 = posting list of documents (document indices) that contains the first word (sorted)
        [p2 = posting list of documents (document indices) that contains the second word (sorted)
    
        p1 AND p2
        1. answer ← { }
        2. while p1 ≠ NULL AND p2 ≠ NULL
        3.    do if docID(p1) = docID(p2)  then
        4.        ADD(answer, docID(p1))
        5.        p1 ← next(p1)
        6.        p2 ← next(p2)
        6.    else if docID(p1) < docID(p2) then
        7.        p1 ← next(p1)
        9.    else
        10.       p2 ← next(p2)
        11. return answer
    

    The OR operation should be done in a similar way with O(m+n) complexity.