CS383: Lab 11 -- TF-IDF weights

In this lab you will be extending the Heaps' law lab from last week to create a TF-IDF vector space representation of every document in the Dickens collection.

Data

The data you will be using for this lab is identical to the data from last week. It is the full text of chapters from six novels by Charles Dickens. These files are available on the UNIX system at /home/gtowell/Public/CS383/Dickens. If you are so motivated, there is also a tar file of the complete file set that would be easy to copy to your own machine at /home/gtowell/Public/CS383/split.tar. Each of the files in the Dickens directory is formatted to make the task as easy as possible. Specifically, one word per line, all lower case, all punctuation removed. So all you need to do is read the file and build the concordance. You do not need to do anything to modify the text after reading.

There are 402 files in the dataset.

Computing TF-IDF weights

To compute TF-IDF do the following (I describe this in terms of vectors. In many cases it may be more convenient to use hash tables.):
  1. Get a list of every token (word) that occurs at least 3 times across all of the documents.
  2. Put the tokens from the previous step into a fixed ordering. So, for example, the word "the" might be at position 42 in this ordering and the word "word" might be at position 4034. The exact positions do not matter. This is your "vector space"
  3. For each word in the vector space, determine the number of documents in which the word appears. Call this value dfi
  4. for each document (j) count the number of times each word in the vector space occurs in the document. Call this tfij where i = index of a token i the vector space and j = index of document in collection
  5. Finally, compute the vector space representation of each document as
                            wij = tfij*idfi = (tfij / (maxi(tfij))) * log2(Ndocs / dfi)
                
                            where i = index of a token i the vector space
                                  j = index of document in collection
                        

What to hand in

Send to gtowell@brynmawr.edu your vector representation of Bleak1_40.txt. This is is purely positional, so also send a mapping from token to position.