Skip to main content

Homework 3: Topic Modeling - LDA

Deadline: 2023-02-15 23:59:00EDT

credit: Jordan Boyd-Graber, Maria Antoniak

The goal of this assignment is for you to 1) implement part of gibbs sampling to train a LDA model, 2) apply LDA to a corpus of Wikipedia documents, 3) investigate some practical considerations when using LDA. You will not need to write much code, but you will need to understand how the other methods and instance variables in lda.py work.

The starter code lda.pyis available here and the tests (test.py) is available here. The data needed can be found in this tar.gz.

Implementing a Gibbs Sampler

Code (20 Points)

You’ll need to complete several functions in the LdaTopicCounts class:

change_topic


Finish implementing the change topic function so that it keeps track of the necessary counts. Remember that we need to remember the association of terms to topics, documents to topics, and the specific assignments. Make sure to update these three things.

sample_probs


Finish implementing the sample probs function so that it returns a dictionary that provides the conditional distribution of a token. To make it easier to debug, we would recommend this being an unnormalized distribution. In addition to using the unit tests (as usual), there’s also a directory of toy data you can try the code out on. After you’ve done these things, turn in your completed lda.py file to gradescope.

Running on Real Data (10 Points)

In the data/ directory we provide a directory called wiki. Run you topic model on this set of 400 random wikipedia pages and examine the topics. Upload your results as topics.pdf. You’ll need to run it at least for 1000 iterations. Make sure to describe your results.

Run python lda.py --help to figure out what command line args are necessary. If you havent used argparse before, I’d recommend looking it up, its super powerful and helpful!

Design choices (10 Points)

In the end of lecture 6, we briefly discussed different design choices, e.g. choosing hyperparameters like the Dirchelet priors or the number of topics, whether to deal with stop words, etc.

Maria Antoniak’s great blog post Topic Modeling for the People provides more details about these considerations. Read the blog post and choose one of the following options from the blog post to explore and implement:

For any of these options, read one of the Learn more papers and write a brief summary of the paper in your README.txt. Then, modify lda.py so that you can test out the design choice that you are exploring. Make sure that lda.py still works on the tests - I’d recommend adding command line arguments (using argparse) and then adding optional arugments to any of the existing methods that you want to modify.

Note: if you choose Evaluate the topics, please implement one of the automatic metrics to evaluate the coherence of the resulting topics.

Extra Credit (5 points)

Mallet is a popular NLP toolkit maintained by David Mimno, a Cornell professor in Information Science. Mallet is widely used for its implementation of LDA. Run Mallet on the same data and take a note of the comparative runtime. Upload the mallet file as mallet.txt.

For how to use Mallet and its Python wrapper written by Maria Antoniak, I’d recommend going through the Topic Modeling chapter in Melanie Walsh’s Culturual Analytics textbook. If you plan on using topic modeling for your final project, I’d highly recommend doing this now anyway.

Feedback (3 points)

In a README file, called README.txt, answer the following questions:

  1. How long did you spend on the assignment?
  2. What did you learn by doing this assignment?
  3. Briefly describe a Computational Text Analysis/Text as Data research question where using LDA would be useful (keep this answer to a maximum of 3 sentences).

Feel free to add additional/optional feedback, e.g. what did you like or dislike about the assignment?

Submitting

Submit the following files to the assignment called HW03 on Gradescope:

  1. lda.py
  2. README.txt
  3. topics.pdf

Make sure to name these files exactly what we specify here. Otherwise, our autograders might not work and we might have to take points off.