Homework 1: Language Modeling

Deadline: 2023-01-30 23:59:00EDT

The goal of this assignment is for you to implement statistical methods for training a language model. Remember, training a language model just means computing n-gram probabilities. In this assignment, we will be using a bi-gram model. Remember, this means we are computing the probability of a sentence. We will train the LM on 57,340 sentences from the Brown corpus.

Note: You will need to download the Brown corpus from nltk. Make sure you are using a nltk version that is at least 3.5 or higher. When you run lm.py, you will see instructions on how to install the Brown corpus in python.

The starter code lm.pyis available here and the tests (test.py) is available here

Preparing Data (10 points)

We will use the Brown corpus (nltk.corpus.brown) as our training set and the Treebank (nltk.corpus.treebank) as our test set. Eventually, we’ll want to build a language model from the Brown corpus and apply it on the Treebank corpus. First, however, we need to prepare our corpus.

We need to collect word counts so that we have a vocabulary. This is done by the train_seen function. Modify this function so that it will keep track of all of the tokens in the training corpus and their counts.
After that is done, you can complete the vocab_lookup function. This should return a unique identifier for a word, or a common “unknown” identifier for words that do not meet the unk_cutoff threshold. You can use strings as your identifier (e.g., leaving inputs unchanged if they pass the threshold) or you can replace strings with integers (this will lead to a more efficient implementation). The unit tests are engineered to accept both options.
After you do this, then the finalize and censor functions should work (but you don’t need to do anything). But check that the appropriate unit tests are working correctly.

Note: In lecture we did not discuss how to deal with unknown words. Make sure to read section 3.4.1 Unknown Words of the textbook.

Estimation (20 points)

After you’ve finalized the vocabulary, then you need to add training data to the model. This is the most important step! Modify the add_train function so that given a sentence it keeps track of the necessary counts you’ll need for the probability functions later. You will probably want to use default dictionaries or probability distributions. Finally, given the counts that you’ve stored in add_train, you’ll need to implement probability estimates for contexts. These are the required probability estimates you’ll need to implement:

mle: Simple division of counts for that observation by total counts for the context
laplace: Add one to all counts
add\_k: Add a specified parameter greater than zero to all counts
jelinek_mercer: Interpolate between probability distributions with parameter lambda

Now if you run the main section of the lm file, you’ll get per-sentence reports of perplexity. Take a look at what sentences are particularly hard or easy (you don’t need to turn anything in here, however).

Exploration (10 points)

Try finding sentences from the test dataset that get really low perplexities for each of the estimation schemes (you may want to write some code to do this). Can you find any patterns? Turn in your findings and discussion in your README.txt.

Extra Credit

Extra Credit (make sure they don’t screw up required code / functions that will be run by the autograder):

kneser_ney: Use discounting and prefixes with discount parameter $\delta$ and concentration parameter alpha to implement interpolated Kneser-Ney.
Implement a function to produce English-looking output (return an iterator or list) from your language model (function called sample)
Make the code really efficient for reading in sequences of characters

FAQ

Q: Why are there two passes over the data?

A: The first pass establishes the vocabulary, the second pass accumulates the counts. You could in theory do it in one pass, but it gets much more complicated.

Q: What if the counts of <s> and <\/s> fall below the threshold?

A: They should always be included in the vocabulary.

Q: And what about words that fall below the threshold?

A: They must also be represented, so the vocab size will be the number of tokens at or above the UNK threshold plus three (one for UNK, one for START, and one for END).

Q: What happens when I try to take the log of zero?

A: Return kNEG_INF instead.

Q: Do I tune the hyperparameters for interpolation, discount, etc.?

A: No, that’s not part of this assignment.

Feedback (3 points)

In a README file, called README.txt, answer the following questions:

How long did you spend on the assignment?
What did you learn by doing this assignment?
Briefly describe a Computational Text Analysis/Text as Data research question where language modeling would be useful (keep this answer to a maximum of 3 sentences).

Feel free to add additional/optional feedback, e.g. what did you like or dislike about the assignment?

Submitting

Submit the following files to the assignment called HW01 on Gradescope:

lm.py
README.txt

Make sure to name these files exactly what we specify here. Otherwise, our autograders might not work and we might have to take points off.