credit: Jordan Boyd-Graber, Matthew J. Lavin
The goal of this assignment is for you to implement tf-idf to represent documents using tf-idf values. You will then use an existing of SVD to project the document term matrix to 2 dimensions. Here, we will be analyzing obituarires.
The starter code tfidf.py
is available here and the tests (tfidf_test.py
) is available here. The data needed can be found under the data/
directory.
You’ll need to complete several functions in the TfIdf class:
I’ll talk about each of these and what you have to do for them. Each of these should be fairly easy to do. If you find yourself spending hours on one of these, you’re probably overthinking it or doing something that Python can do for you.
They’re listed roughly in the order that you should complete them, but you obviously need to think about all of them first before you can start.
You don’t need to do too much here except for creating datastructures that you may need to count things up later. I’d suggest taking a look at NLTK’s FreqDist and/or refresh your memory on Python collections.
Here, you’ll need to take the string representations that you’ve seen and keep track of how often you’ve seen them.
Once we’ve done a scan over all of the documents, we can create a
vocabulary, taking all of the words that have appeared more than or
equal to unk_cutoff times into the vocabulary. You make want to free
up the memory you used for train_seen
here, but not necessary for
getting a good grade.
The most important thing is that after this function has run, the vocab data member of TfIdf should map words (in string form) to their vocabulary id (an integer).
Take in a document and keep track of the words that you see in appropriate datastructures so that the next two functions work.
Return the frequency of a word.
Return the inverse document frequency of a word.
Please answer the following questions in a file called README.txt:
data/obits.json
, appear in a lot
of documents and are thus not helpful query terms?tfidf.py
so that you convert Karen Spark Jones’ obituary into a document vector based on the vocabularies developed in training. Based on tf-idf, what words are indicative of her obituary? Why might you think these words are indicative of Karen Spark Jones’s obituary.For the 2nd part of this assignment, we are going to use Sklearn’s implementation of tf-idf and SVD. From the programming historian tf-idf example, download the lesson-files.zip
.
I’d highly recommend reading that webpage but it is optional.
In practice, you will not be using your own implementation of tf-idf in your projects. You will more likely use an implementation from a library like sklearn.
Run the notebook called TF-IDF-lesson-code
and answer the following questions: Please put your answers in README.txt
Why do you think it might be better to use sklearn (or another libraries) implementation compared to your own?
Make sure you understand what is happening in each cell. Starting at the 4th cell (the one with the from sklearn.feature_extraction.text import TfidfVectorizer
line), write a brief comment explaning what is happening in each cell.
Choose any obituary and compare the document representation in that notebook with the document representation for the same obituary from your tf-idf.py
implementation. What differences do you see?
What changes do you think you’d have to make to the settings of the TfidVectorizer
object in the notebook to make the results be more similar to the results in your implementation.
In the same notebook, use sklearn’s SVD
to reduce the document-term matrix to just 2 dimensions. Then plot the dimensions in a graph. You can keep the plot in the notebook.
Below your graph, make sure describe the graph. Do you see any distinct groups in the 2 dimensions.
We only covered SVD but try using other dimensionality reduction techniques from sklearn, e.g. PCA
, tsne
, etc. Make sure to include a brief description of the ther dimensionality reduction technique you use.
Unfortunately we did not apply any labels to the obituaries. Ideally it would be great if the name of each file indicated who the obit was about or if we provided metadata that mapped the file names to name of each obituatry’s subject. When working with real word data, this is often the case. Come up with an algorithm to determine who each obituary is about and then apply that algorithm to the data. You can use existing libraries like spacy
to help you with this. When doing this, remember the Pareto principle (80-20 rule).
It is very likely there are edge cases where your method fails or produces results that are not accurate. This is okay and will often happen in research. An important question to constantly ask yourself is how much fine-tuning your methods is necessary and how much are small mistakes tolerable.
The goal in this assignment is to understand obituaries using tf-idf. We could spend lots of time developing a fool-proof heurstic that correctly and perfectly extracts all the subjects from all of our obituatires. However, that probably would be a waste of time.
The Pareto principle states that for many outcomes, roughly 80% of consequences come from 20% of the causes (the “vital few”).[1] Other names for this principle are the 80/20 rule, the law of the vital few, or the principle of factor sparsity. - https://en.wikipedia.org/wiki/Pareto_principle
For our purposes, we apply the Pareto Principle to mean that we want a 20% effort solution that covers 80% of our data. Spending time for the remaining 20% edge cases often might not be worth it. In most Computer Science classes your code and solutions are evaluated by how well they cover edge cases. However, for our purposes here, the edge cases don’t necessarily matter and arent always worthwhile. It is important to remember the goal and keep our eyes on the prize.
With that said, it is important to sample our data to see whether our 20% effort approach covers 80% of the examples.
In lecture 6 we briefly spoke about clustering. Apply K-Means to the DTM (or the representations with reduced dimensions), you can use the sklearn implementation. Do you find distinct clusters of obituaries? What other patterns do you see?
In a README file, called README.txt
, answer the following questions:
Feel free to add additional/optional feedback, e.g. what did you like or dislike about the assignment?
Submit the following files to the assignment called HW02
on Gradescope:
tfidf.py
TF-IDF-lesson-code.ipynb
README.txt
Make sure to name these files exactly what we specify here. Otherwise, our autograders might not work and we might have to take points off.