/home/gtowell/Public/CS337/lab04data.tgzMake a local copy of this file as you did for the words file in the last Lab. Then unpack the file
tar fvzx lab04data.tgzI know that Macs have the tar command, not sure about windows. Alternately, if you are working on the lab computers you can use the files from my public space. Namely,
/home/gtowell/Public/CS337/Lab04Data/*Each file is arranged with exactly one word per line. All punctuation has been removed. This created some new words as removing "-" resulted in splitting things in ways the authors might not have intended. Note that some of these files are very short. The program I wrote to make the files did not understand things like the table of contents. Also the program made a bunch of mistakes, for instance, in one case it turned the words "made" and "when" into a single string "madewhen". Here is a complete Java class for reading the contents of a file (formatted as described above) into an ArrayList. You are not required to use Java in this lab, I simply make this code available as an example. (If you use this class as a starting point, it will need adjustments (depending on how you choose to use it).)
import java.io.BufferedReader; import java.io.FileReader; import java.util.ArrayList; public class GGReader { public ArrayListNote that this code is exactly the code that I gave last week for reading the dictionary.readFile(String fileName) { ArrayList aaa = new ArrayList<>(); try (BufferedReader br = new BufferedReader(new FileReader(fileName))) { String line; while ((line = br.readLine()) != null) { aaa.add(line.trim()); } return aaa; } catch (Exception ee) { System.err.println(ee); return aaa; } } public static void main(String[] args) { GGReader myReader = new GGReader(); ArrayList words = myReader.readFile(args[0]); System.out.println(words); } }
class WordDocumentLoc { String documentId; int wordLoc; }For each occurrence of a word, case insensitive, create an instance of this class to hold the id of the document in which the word occurred (you can use the file name as the document ID) and the location within that file of the word as the word location. The location within the file should be the line number. For instance, if you start counting line numbers at 1, then in AustenOne_30.txx "possible" would have a wordLoc of 6 and a second entry with a word loc of 559. In the same file, there would be 226 entries for the word "the" the first of which is at position 47. Expanding this to the entire dataset, here are {document id, word location} pairs for two words: caligula and bonnie.
caligula [{GibonOne_275.txx 4043} {GibonOne_275.txx 4938} {GibonOne_276.txx 446} {GibonOne_276.txx 550} {GibonOne_276.txx 948} {GibonOne_276.txx 1100} {GibonOne_276.txx 4379} {GibonOne_285.txx 2018} {GibonOne_285.txx 2228} {GibonOne_306.txx 1031} {GibonOne_311.txx 1225} {GibonOne_332.txx 280} {GibonOne_399.txx 2853} {GibonOne_432.txx 1712} {GibonOne_448.txx 729} {GibonOne_448.txx 3538} {GibonOne_451.txx 3804} {GibonOne_451.txx 3823} {GibonOne_472.txx 178} {ScottOne_21.txx 7004} {ScottOne_21.txx 7021} {ScottOne_25.txx 1245} {ScottOne_25.txx 1268} {ScottOne_368.txx 1176}] bonnie [{ScottOne_207.txx 41813} {ScottOne_29.txx 847} {ScottOne_291.txx 949} {ScottOne_311.txx 5611} {ScottOne_317.txx 145} {ScottOne_33.txx 420} {ScottOne_362.txx 1791} {ScottOne_385.txx 1334} {ScottOne_386.txx 2127} {ScottOne_392.txx 2092} {ScottOne_393.txx 13225} {ScottOne_45.txx 2162}]Examining this data, "caligula" appeared twice in document GibonOne_275.txx and 5 times in document GibonOne_276.txx. "caligula" appeared in 11 other documents, never more than twice. "bonnie" never appeared more than once in any document and did not appear in in any document in that contained "caligula". With all of this said, the algorithm for building an inverted index is as follows
Given: F a set of documents IDs, one document per file Return: A map containing an inverted index. --- Keys in the map are words. Values are a list of instances of something like WordDocumentLoc Let: M : a map -- initially empty. ForEach filename in F let wordloc = 0 ForEach w in [the words in F] add to M for w a new record {filename, wordloc} set wordloc = wordloc + 1 return MQuestion: How much memory does your inverted index use? This calculation might be tricky, neither Java nor Python will tell you just by asking how many bytes are in a data structure (Go will, but you need to worry a lot about pointers). Compare that to the original data. You may ignore overhead, for example, the space required for just the class in Java. How does the size of the inverted index compare to that of the data being indexed? What can you do to reduce the size of the inverted index? (The total space used on disk by the data is 44.6M)
Given: W a set of words : II an inverted index as above Return: a list of document ids for documents containing all of the words in W Let DI : a map from document ID to number (initially empty) ForEach w in W Let FW be the list of WordDocumentLoc objects for w from II Let UF = the unique Document IDs in FW ForEach uf in UF if uf is NOT in DI add uf to DI with a count of 1 else update uf in DI by incrementing its value Let R : a list of document IDs, initially empty for docID, count in DI if count==len(W) add docID to R return R
Given: W1: a list of the locations of word 1 in a document W2: a list of the locations of word 2 in the same document Return: the smallest difference between a number in W1 and an number in W2. Let: d = abs(first item in W1 - first item in W2) ForEach w1 in W1 ForEach w2 in W2 if abs(w1-w2) < d d = abs(w1-w2)The following algorithm is far quicker than the brute force one above:
Given: W1: a list of the locations of word 1 in a document W2: a list of the locations of word 2 in the same document Return: the smallest difference between a number in W1 and an number in W2. sort the contents of W1 sort the contents of W2 Let: idx1 = 0 // index in W1 idx2 = 0 // index in W2 len1 = length of W1 len2 = length of W2 d = abs(W1[idx1] - W2[idx2]) // distance between the first items in W1 and W2 While idx1<len1 and idx2<len2 if abs(W1[idx1]-W2[idx2]) < d d = abs(W1[idx1]-W2[idx2]) if W1[idx1] < W2[idx2] idx1++ else idx2++Putting all of this together, the goal is to take a pair of words and present a list of all of the documents containing both words, with the list sorted by proximity of the closest pair. Here are a couple of examples. Each example starts with a pair of words (contained in []). This is followed by zero or more rows of data. Each row contains: an index, the distance between the closest pair of items in the document, and a document ID (file name). The rows are sorted by minimum distance between a pair of words in the same document, ascending.
[elizabeth emma] 1 16 AustenOne_23.txx 2 17 AustenOne_22.txx 3 40 AustenOne_4.txx 4 87 AustenOne_7.txx 5 2646 ScottOne_320.txx 6 3062 AustenOne_180.txx [roy clan] 1 4 ScottOne_277.txx 2 60 ScottOne_316.txx 3 69 ScottOne_207.txx 4 173 ScottOne_311.txx 5 187 ScottOne_393.txx 6 239 ScottOne_306.txx 7 360 ScottOne_313.txx 8 740 ScottOne_336.txx [legend legion] 1 49 GibonOne_423.txx 2 816 ScottOne_178.txx 3 3647 GibonOne_270.txx
Improvements are certainly possible. For instance, on my computer and on this dataset, a query for [rob rich the and] took about 223 seconds. After some optimization -- but without changing the basic algorithm -- the same query took about 60 milliseconds. (yes, a speedup of about 4000 times!)
Given: I : an inverted index W : a list of words D : a list of document IDs for all document in the collection that contain all of the words. Return: a list containing [distance, document ID] where distance is the minimum length of a subsequence of the document that contains all of the words in W func findD(I, W, D) let R : be an initially empty list for d in D best = closest(I, W, 0, d, -1, -1) add to R [best, d] return R ------------------------------------ Given: I : an inverted index W : a list of words wi : location in W of the next word to use d : a document ID lower : the lower bound upper : the upper bound Return : a distance apart of the set of words func closest(I,W,wi,d,lower,upper) if len(W) <= wi return upper-lower wl = locations of W[wi] in d extracted from I best = length of d (in words) for wwll in wl let tl=lower let tu=upper if wwll < tl or tl < 0 tl=wwll if wwll > tu tu=wwll let q = closest(I, W, wi+1, d, tl, tu) if q < best best = q return best
Include, in an appendix, your implementations of each of the algorithms described above (or at least, those algorithms that you used). You need not include boilerplate code.
Include, as a separate appendix, your results for the following queries: