CS383: Lab 11 -- TF-IDF weights

In this lab you will be extending the Heaps' law lab from last week to create a TF-IDF vector space representation of every documentin the Dickens collection.

Data

The data you will be using for this lab is identical to the data from last week. It is the full text of chapters from six novels by Charles Dickens. These files are available on the UNIX system at /home/gtowell/Public/383/Dickens. If you are so motivated, there is also a tar file of the complete file set that would be easy to copy to your own machine at /home/gtowell/Public/383/split.tar. Each of the files in the Dickens directory is formatted to make the task as easy as possible. Specifically, one word per line, all lower case, all punctuation removed. So all you need to do is read the file and build the concordance. You do not need to do anything to modify the text after reading.

There are 402 files in the dataset.

Computing TF-IDF weights

To compute TF-IDF first read through the documents to determine the document frequency of appearance of any token (i.e. word) in the Dickens collections that appears at least 3 times (note this could be 3 times in the same document).

Assign each of these tokens to a position in the vector space. (Realistically this step could be skipped through some modestly clever programming.) These indices define your Vector Space.

Then read through the documents again. This time count the number of times a token in the Vector Space appears in the document. Finally, use these counts to compute the TF-IDF weight for each token where TF-IDF is computed as:

            wij = TFij*idfi = (tfij / (maxi(tfij))) * log2(Ndocs / dfi)

            where i = index of a token i the vector Space
                  j = index of document in collection
        

What to hand in

Send to gtowell@brynmawr.edu your vector representation of Bleak1_40.txt. This is is purely positional, also send a mapping from token to position.