Skip to main content

Homework 1: Language Modeling

Deadline: 2023-01-30 23:59:00EDT

The goal of this assignment is for you to implement statistical methods for training a language model. Remember, training a language model just means computing n-gram probabilities. In this assignment, we will be using a bi-gram model. Remember, this means we are computing the probability of a sentence. We will train the LM on 57,340 sentences from the Brown corpus.

Note: You will need to download the Brown corpus from nltk. Make sure you are using a nltk version that is at least 3.5 or higher. When you run lm.py, you will see instructions on how to install the Brown corpus in python.

The starter code lm.pyis available here and the tests (test.py) is available here

Preparing Data (10 points)

We will use the Brown corpus (nltk.corpus.brown) as our training set and the Treebank (nltk.corpus.treebank) as our test set. Eventually, we’ll want to build a language model from the Brown corpus and apply it on the Treebank corpus. First, however, we need to prepare our corpus.

Note: In lecture we did not discuss how to deal with unknown words. Make sure to read section 3.4.1 Unknown Words of the textbook.

Estimation (20 points)

After you’ve finalized the vocabulary, then you need to add training data to the model. This is the most important step! Modify the add_train function so that given a sentence it keeps track of the necessary counts you’ll need for the probability functions later. You will probably want to use default dictionaries or probability distributions. Finally, given the counts that you’ve stored in add_train, you’ll need to implement probability estimates for contexts. These are the required probability estimates you’ll need to implement:

Now if you run the main section of the lm file, you’ll get per-sentence reports of perplexity. Take a look at what sentences are particularly hard or easy (you don’t need to turn anything in here, however).

Exploration (10 points)

Try finding sentences from the test dataset that get really low perplexities for each of the estimation schemes (you may want to write some code to do this). Can you find any patterns? Turn in your findings and discussion in your README.txt.

Extra Credit

Extra Credit (make sure they don’t screw up required code / functions that will be run by the autograder):

FAQ

Q: Why are there two passes over the data?

A: The first pass establishes the vocabulary, the second pass accumulates the counts. You could in theory do it in one pass, but it gets much more complicated.

Q: What if the counts of <s> and <\/s> fall below the threshold?

A: They should always be included in the vocabulary.

Q: And what about words that fall below the threshold?

A: They must also be represented, so the vocab size will be the number of tokens at or above the UNK threshold plus three (one for UNK, one for START, and one for END).

Q: What happens when I try to take the log of zero?

A: Return kNEG_INF instead.

Q: Do I tune the hyperparameters for interpolation, discount, etc.?

A: No, that’s not part of this assignment.

Feedback (3 points)

In a README file, called README.txt, answer the following questions:

  1. How long did you spend on the assignment?
  2. What did you learn by doing this assignment?
  3. Briefly describe a Computational Text Analysis/Text as Data research question where language modeling would be useful (keep this answer to a maximum of 3 sentences).

Feel free to add additional/optional feedback, e.g. what did you like or dislike about the assignment?

Submitting

Submit the following files to the assignment called HW01 on Gradescope:

  1. lm.py
  2. README.txt

Make sure to name these files exactly what we specify here. Otherwise, our autograders might not work and we might have to take points off.