CS383: Lab 12 - Using TF-IDF weights
In this lab you will be using the TF-IDF vector space representation of every document in the Dickens collection that you built last week to answer the following two questions:
- For each document in the collection, what is the most similar document to that document (other than itself) according to the cosine distance metric?
- For each document in the collection, what is the highest weighted term in its vector space?
The data you will be using for this lab is identical to the data from last week. It is the full text of chapters from six novels by Charles Dickens. These files are available on the UNIX system at /home/gtowell/Public/383/Dickens. If you are so motivated, there is also a tar file of the complete file set that would be easy to copy to your own machine at /home/gtowell/Public/383/split.tar. Each of the files in the Dickens directory is formatted to make the task as easy as possible. Specifically, one word per line, all lower case, all punctuation removed.
There are 402 files in the dataset.
Some of you may not have completed last week's lab. So, in addition to the text of each document, I have computed a vector space representation. That representation is available at /home/gtowell/Public/383/VS and in a tar file at /home/gtowell/Public/383/Dickens_VS.tar My vector space representation has two parts:
- vs_terms.csv this is a file that simply lists a location and a term. For instance, here is a part of the file:
The location gives the position of that term in the vector space. More importantly, it is the term associated with a particular line in the files described next.
- *.txt.vs There are 402 of these files. One for every text file. These files contain one number per line. That number of the TF-IDF weight of a term for the associated text file. Hence, the numbers correspond in a one-to-one fashion with the terms in the vs_terms.csv file. For instance, the number in the first row of a .vs file is the vector space weight for the term "frowning".
What to hand in
Send to email@example.com with the answers to the above two questions for each document in the collection. If you do not complete this within 80 minutes, then send the code you developed towards a solution.