Homework 2: TF-iDF & SVD

Deadline: 2023-02-08 23:59:00EDT

credit: Jordan Boyd-Graber, Matthew J. Lavin

The goal of this assignment is for you to implement tf-idf to represent documents using tf-idf values. You will then use an existing of SVD to project the document term matrix to 2 dimensions. Here, we will be analyzing obituarires.

The starter code tfidf.pyis available here and the tests (tfidf_test.py) is available here. The data needed can be found under the data/ directory.

Implementing Tf-iDF

Code (20 Points)

You’ll need to complete several functions in the TfIdf class:

constructor
train_seen
finalize
add_document
term_freq
inv_docfreq

I’ll talk about each of these and what you have to do for them. Each of these should be fairly easy to do. If you find yourself spending hours on one of these, you’re probably overthinking it or doing something that Python can do for you.

They’re listed roughly in the order that you should complete them, but you obviously need to think about all of them first before you can start.

constructor

You don’t need to do too much here except for creating datastructures that you may need to count things up later. I’d suggest taking a look at NLTK’s FreqDist and/or refresh your memory on Python collections.

train_seen

Here, you’ll need to take the string representations that you’ve seen and keep track of how often you’ve seen them.

finalize

Once we’ve done a scan over all of the documents, we can create a vocabulary, taking all of the words that have appeared more than or equal to unk_cutoff times into the vocabulary. You make want to free up the memory you used for train_seen here, but not necessary for getting a good grade.

The most important thing is that after this function has run, the vocab data member of TfIdf should map words (in string form) to their vocabulary id (an integer).

add_document

Take in a document and keep track of the words that you see in appropriate datastructures so that the next two functions work.

term_freq

Return the frequency of a word.

inv_docfreq

Return the inverse document frequency of a word.

Write Up (10 Points)

Please answer the following questions in a file called README.txt:

What are words that, specifically for the collection in data/obits.json, appear in a lot of documents and are thus not helpful query terms?
Modify the main method in tfidf.py so that you convert Karen Spark Jones’ obituary into a document vector based on the vocabularies developed in training. Based on tf-idf, what words are indicative of her obituary? Why might you think these words are indicative of Karen Spark Jones’s obituary.

Sklearn

For the 2nd part of this assignment, we are going to use Sklearn’s implementation of tf-idf and SVD. From the programming historian tf-idf example, download the lesson-files.zip. I’d highly recommend reading that webpage but it is optional.

TF-iDF (10 points)

In practice, you will not be using your own implementation of tf-idf in your projects. You will more likely use an implementation from a library like sklearn.

Run the notebook called TF-IDF-lesson-code and answer the following questions: Please put your answers in README.txt

Why do you think it might be better to use sklearn (or another libraries) implementation compared to your own?
Make sure you understand what is happening in each cell. Starting at the 4th cell (the one with the from sklearn.feature_extraction.text import TfidfVectorizer line), write a brief comment explaning what is happening in each cell.
Choose any obituary and compare the document representation in that notebook with the document representation for the same obituary from your tf-idf.py implementation. What differences do you see?
What changes do you think you’d have to make to the settings of the TfidVectorizer object in the notebook to make the results be more similar to the results in your implementation.

Dimensionality Reduction (10 points)

In the same notebook, use sklearn’s SVD to reduce the document-term matrix to just 2 dimensions. Then plot the dimensions in a graph. You can keep the plot in the notebook.

Below your graph, make sure describe the graph. Do you see any distinct groups in the 2 dimensions.

Extra Credit

We only covered SVD but try using other dimensionality reduction techniques from sklearn, e.g. PCA, tsne, etc. Make sure to include a brief description of the ther dimensionality reduction technique you use.
Unfortunately we did not apply any labels to the obituaries. Ideally it would be great if the name of each file indicated who the obit was about or if we provided metadata that mapped the file names to name of each obituatry’s subject. When working with real word data, this is often the case. Come up with an algorithm to determine who each obituary is about and then apply that algorithm to the data. You can use existing libraries like spacy to help you with this. When doing this, remember the Pareto principle (80-20 rule). It is very likely there are edge cases where your method fails or produces results that are not accurate. This is okay and will often happen in research. An important question to constantly ask yourself is how much fine-tuning your methods is necessary and how much are small mistakes tolerable. The goal in this assignment is to understand obituaries using tf-idf. We could spend lots of time developing a fool-proof heurstic that correctly and perfectly extracts all the subjects from all of our obituatires. However, that probably would be a waste of time. The Pareto principle states that for many outcomes, roughly 80% of consequences come from 20% of the causes (the “vital few”).[1] Other names for this principle are the 80/20 rule, the law of the vital few, or the principle of factor sparsity. - https://en.wikipedia.org/wiki/Pareto_principle For our purposes, we apply the Pareto Principle to mean that we want a 20% effort solution that covers 80% of our data. Spending time for the remaining 20% edge cases often might not be worth it. In most Computer Science classes your code and solutions are evaluated by how well they cover edge cases. However, for our purposes here, the edge cases don’t necessarily matter and arent always worthwhile. It is important to remember the goal and keep our eyes on the prize. With that said, it is important to sample our data to see whether our 20% effort approach covers 80% of the examples.
In lecture 6 we briefly spoke about clustering. Apply K-Means to the DTM (or the representations with reduced dimensions), you can use the sklearn implementation. Do you find distinct clusters of obituaries? What other patterns do you see?

Feedback (3 points)

In a README file, called README.txt, answer the following questions:

How long did you spend on the assignment?
What did you learn by doing this assignment?
Briefly describe a Computational Text Analysis/Text as Data research question where using tf-idf would be useful (keep this answer to a maximum of 3 sentences).

Feel free to add additional/optional feedback, e.g. what did you like or dislike about the assignment?

Submitting

Submit the following files to the assignment called HW02 on Gradescope:

tfidf.py
TF-IDF-lesson-code.ipynb
README.txt

Make sure to name these files exactly what we specify here. Otherwise, our autograders might not work and we might have to take points off.