CS337 Lab 4

CS 337: Lab 4

Inverted Indexes

At the core of any search engine is an indexing system, most commonly this takes the form of an "inverted index". In this lab you will implement and test the algorithms described in Chapter 2 of MacCormack for building and using an inverted index.

Data

To give you something to index, I have created a data set with the following characteristics. There are 1373 files that correspond, roughly, to the chapters in a collection of books by Edward Gibon, Walter Scott and Jane Austen. The files are named [Gibon,Scott,Austen]One_NUMBER.txx where NUMBER is a number. For instance, GibonOne_23.txx. The number are: Gibon 1..482; Scott 1..366; Austen 1..414. These files are available on the web at

    https://cs.brynmawr.edu/cs337/Lab04Data/FILENAME

where FILENAME is a name as described above. In Java and Go it is quite easy to read a file from the web; I assume it is similarly easy in Python. C, not so much (you are better off reading a local copy, see below).

Perhaps better, these files are also on the lab machines at

    /home/gtowell/Public/337/Lab04Data/FILENAME

If you want to make a local copy of all of these files, the easiest thing will be to to the following:

    scp YOURUNIXNAME@goldengate.cs.brynmawr.edu:/home/gtowell/Public/337/lab04data.tgz . 
    tar fvx lab04data.tgz

I know that Macs have the tar command, not sure about windows.

Each file is arranged with exactly one word per line. All punctuation has been removed. This created some new words as removing "-" resulted in splitting things in ways the authors might not have intended. Note that some of these files are very short. The program I wrote to make the files did not understand things like the table of contents. Also the program made a bunch of mistakes, for instance, in one case it turned the words "made" and "when" into a single string "madewhen".

Here is a complete Java class for reading the contents of a file (formatted as described above) into an ArrayList. You are not required to use Java in this lab, I simply make this code available as an example. (If you use this class as a starting point, it will need adjustments (depending on how you choose to use it).)

    import java.io.BufferedReader;
    import java.io.FileReader;
    import java.util.ArrayList;
    
    public class GGReader {
        public ArrayList readFile(String fileName) {
            ArrayList aaa = new ArrayList<>();
            try (BufferedReader br = new BufferedReader(new FileReader(fileName))) {
                String line;
                while ((line = br.readLine()) != null) {
                    aaa.add(line.trim());
                }
                return aaa;
            } catch (Exception ee) {
                System.err.println(ee);
                return aaa;
            }
        }
        public static void main(String[] args) {
            GGReader myReader = new GGReader();
            ArrayList words = myReader.readFile(args[0]);
            System.out.println(words);
        }
    }

Inverted Indices

An inverted index is a data structure that makes it easy to determine all of the documents that contain a given word. The idea is that for each word you keep a record of every document containing the word. MacCormack also recommends keeping track of the position of every occurrence of a word in the document. You will use locations so do keep track. In Java you might set up a class something the following to store the document id / location pairs.

        class WordDocumentLoc {
            String documentId;
            int wordLoc;
        }

For each occurrence of a word, case insensitive, create an instance of this class to hold the id of the document in which the word occurred (you can use the file name as the document ID) and the location within that file of the word as the word location. The location within the file should be the line number. For instance, if you start counting line numbers at 1, then in AustenOne_30.txx "possible" would have a wordLoc of 6 and a second entry with a word loc of 559. In the same file, there would be 226 entries for the word "the" the first of which is at position 47.

Expanding this to the entire dataset, here are {document id, word location} pairs for two words: caligula and bonnie.

caligula [{GibonOne_275.txx 4043} {GibonOne_275.txx 4938} {GibonOne_276.txx 446} {GibonOne_276.txx 550} {GibonOne_276.txx 948} {GibonOne_276.txx 1100} {GibonOne_276.txx 4379} {GibonOne_285.txx 2018} {GibonOne_285.txx 2228} {GibonOne_306.txx 1031} {GibonOne_311.txx 1225} {GibonOne_332.txx 280} {GibonOne_399.txx 2853} {GibonOne_432.txx 1712} {GibonOne_448.txx 729} {GibonOne_448.txx 3538} {GibonOne_451.txx 3804} {GibonOne_451.txx 3823} {GibonOne_472.txx 178} {ScottOne_21.txx 7004} {ScottOne_21.txx 7021} {ScottOne_25.txx 1245} {ScottOne_25.txx 1268} {ScottOne_368.txx 1176}]
bonnie [{ScottOne_207.txx 41813} {ScottOne_29.txx 847} {ScottOne_291.txx 949} {ScottOne_311.txx 5611} {ScottOne_317.txx 145} {ScottOne_33.txx 420} {ScottOne_362.txx 1791} {ScottOne_385.txx 1334} {ScottOne_386.txx 2127} {ScottOne_392.txx 2092} {ScottOne_393.txx 13225} {ScottOne_45.txx 2162}]

Examining this data, "caligula" appeared twice in document GibonOne_275.txx and 5 times in document GibonOne_276.txx. "caligula" appeared in 11 other documents, never more than twice. "bonnie" never appeared more than once in any document and did not appear in in any document in that contained "caligula".

With all of this said, the algorithm for building an inverted index is as follows

Given: F a set of documents IDs, one document per file 

Return: A map containing an inverted index. --- Keys in the map are words. Values are a list of instances of something like WordDocumentLoc 

Let: M : a map -- initially empty.  

ForEach filename in F 
    let wordloc = 0
    ForEach w in [the words in F]
        add to M for w a new record {filename, wordloc} 
        set wordloc = wordloc + 1
return M

Question: How much memory does your inverted index use? This calculation might be tricky, neither Java nor Python will tell you just by asking how many bytes are in a data structure (Go will, but you need to worry a lot about pointers). Compare that to the original data. You may ignore overhead, for example, the space required for just the class in Java. How does the size of the inverted index compare to that of the data being indexed? What can you do to reduce the size of the inverted index? (The total space used on disk by the data is 44.6M)

Using an Inverted Index

Getting a list of all the documents containing a word is trivial. Ask the map for the list associated with a word, and then get the unique document IDs from that list. Similarly, getting the documents in which a set of words all occur is simply a matter of getting the intersection of the unique document IDs for each word. The intersection can be computed efficiently with the following algorithm:

    Given: W a set of words 
         : II an inverted index as above
    Return: a list of document ids for documents containing all of the words in W

         Let DI : a map from document ID to number (initially empty)
         ForEach w in W 
            Let FW be the list of WordDocumentLoc objects for w from II
            Let UF = the unique Document IDs in FW

            ForEach uf in UF
                if uf is NOT in DI 
                   add uf to DI with a count of 1
                else
                   update uf in DI by incrementing its value
        Let R : a list of document IDs, initially empty
        for docID, count in DI
            if count==len(W)
               add docID to R
        return R

Sorting

The only remaining issue is that this set of documents a simple listing of all of the documents that contain all of the words is difficult to use because the list is unordered. We would like to present the documents in an order such that "better" documents are first. Better is hard to precisely define, but for this lab we will define "better" to be given by the distance between instances of each word. Documents in which the closest pair of words in the document is smaller are "better".

Initially consider this question only for pairs of words. A brute force algorithm for finding the closest pair of words is as follows:

    Given: W1: a list of the locations of word 1 in a document
           W2: a list of the locations of word 2 in the same document
    Return: the smallest difference between a number in W1 and an number in W2.
    
    Let: d = abs(first item in W1 - first item in W2)
    ForEach w1 in W1 
        ForEach w2 in W2 
           if abs(w1-w2) < d 
              d = abs(w1-w2)

The following algorithm is far quicker than the brute force one above:

    Given: W1: a list of the locations of word 1 in a document
           W2: a list of the locations of word 2 in the same document
    Return: the smallest difference between a number in W1 and an number in W2.

    sort the contents of W1
    sort the contents of W2
    
    Let: idx1 = 0 // index in W1
         idx2 = 0 // index in W2
         len1 = length of W1
         len2 = length of W2
         d    = abs(W1[idx1] - W2[idx2]) // distance between the first items in W1 and W2

    While idx1<len1 and idx2<len2
        if abs(W1[idx1]-W2[idx2]) < d
           d = abs(W1[idx1]-W2[idx2])
        if W1[idx1] < W2[idx2]
           idx1++
        else
           idx2++

Putting all of this together, the goal is to take a pair of words and present a list of all of the documents containing both words, with the list sorted by proximity of the closest pair. Here are a couple of examples. Each example starts with a pair of words (contained in []). This is followed by zero or more rows of data. Each row contains: an index, the distance between the closest pair of items in the document, and a document ID (file name). The rows are sorted by minimum distance between a pair of words in the same document, ascending.

[elizabeth emma]
  1    16 AustenOne_23.txx
  2    17 AustenOne_22.txx
  3    40 AustenOne_4.txx
  4    87 AustenOne_7.txx
  5  2646 ScottOne_320.txx
  6  3062 AustenOne_180.txx

  [roy clan]
  1     4 ScottOne_277.txx
  2    60 ScottOne_316.txx
  3    69 ScottOne_207.txx
  4   173 ScottOne_311.txx
  5   187 ScottOne_393.txx
  6   239 ScottOne_306.txx
  7   360 ScottOne_313.txx
  8   740 ScottOne_336.txx
  
  [legend legion]
  1    49 GibonOne_423.txx
  2   816 ScottOne_178.txx
  3  3647 GibonOne_270.txx

Questions

How can this algorithm be used to find phrases. E.g., find only documents that contain "elizabeth hated" rather than just the documents that contain both of the words elizabeth and hated?
Without using information outside of the document (for instance, PageRank) how can the definition of "better" be improved?
Many systems choose to ignore very common words ("the", "of", ...). Ignore the problems that doing so creates (ie what to do if the query is "the of"), why is this a good idea?

Extra Credit

(The extra credit does not require actually implementing this algorithm.) Below is an algorithm for extending the above minimum distance finder to handle N words. Describe the algorithm in sufficient detail that a reader can understand. Be sure that your description addresses at least the following questions:

what is the time complexity of this algorithm? (it sucks)
What can you do to improve the running time of the algorithm (perhaps while not changing the worst case time complexity)?
Are the numbers 0 and -1 "magic" in the line "best = closest(I, W, 0, d, -1, -1)"?
Why is the algorithm recursive? (I.e., can you write it non-recursively without using a stack?)

Illustrative examples will be useful.

Improvements are certainly possible. For instance, on my computer and on this dataset, a query for [rob rich the and] took about 223 seconds. After some optimization -- but without changing the basic algorithm -- the same query took about 60 milliseconds. (yes, a speedup of about 4000 times!)

Given: I : an inverted index
       W : a list of words
       D : a list of document IDs for all document in the collection that contain all of the words.
Return: a list containing [distance, document ID] where distance is the minimum length of a subsequence of the document that contains all of the words in W 

func findD(I, W, D)
    let R : be an initially empty list
    for d in D
        best = closest(I, W, 0, d, -1, -1)
        add to R [best, d]
    return R

------------------------------------
Given: I : an inverted index
       W : a list of words
       wi : location in W of the next word to use 
       d : a document ID
       lower : the lower bound 
       upper : the upper bound 
Return : a distance apart of the set of words
func closest(I,W,wi,d,lower,upper)
    if len(W) <= wi
        return upper-lower 
    wl = locations of W[wi] in d extracted from I
    best = length of d (in words)
    for wwll in wl
        let tl=lower 
        let tu=upper
        if wwll < tl or tl < 0
            tl=wwll 
        if wwll > tu
            tu=wwll
        let q = closest(I, W, wi+1, d, tl, tu)
        if q < best
            best = q
    return best

What to hand in

Write a report describing the problem of finding words or phrases or sets of words in a large collection of documents. The document should describe all of the algorithms and data structures you employed -- in a language agnostic manner -- and why those data structures and algorithms are required. Well chosen examples will highlight the report. Further discuss topics like the implications of the size of the inverted index for a company like Google.

The target audience of this report should be someone familiar with programming but not the indexing problem.

Include, in an appendix, your implementations of each of the algorithms described above (or at least, those algorithms that you used). You need not include boilerplate code.

Include, as a separate appendix, your results for the following queries:

all good --- for this query just show documents in which the words are at a distance of 1
north heroism
mary contrary