The goal of this assignment is for you to implement statistical methods for training a language model. Remember, training a language model just means computing n-gram probabilities. In this assignment, we will be using a bi-gram model. Remember, this means we are computing the probability of a sentence.
We will train the LM on 57,340 sentences from the Brown
corpus.
Note: You will need to download the
Brown
corpus fromnltk
. Make sure you are using anltk
version that is at least3.5
or higher. When you runlm.py
, you will see instructions on how to install theBrown corpus
in python.
The starter code lm.py
is available here and the tests (test.py
) is available here
We will use the Brown corpus (nltk.corpus.brown) as our training set and the Treebank (nltk.corpus.treebank) as our test set. Eventually, we’ll want to build a language model from the Brown corpus and apply it on the Treebank corpus. First, however, we need to prepare our corpus.
train_seen
function. Modify this function so that it will keep track of all of the tokens in the training corpus and their counts.vocab_lookup
function. This should return a unique identifier for a word, or a common “unknown” identifier for words that do not meet the unk_cutoff threshold. You can use strings as your identifier (e.g., leaving inputs unchanged if they pass the threshold) or you can replace strings with integers (this will lead to a more efficient implementation). The unit tests are engineered to accept both options.Note: In lecture we did not discuss how to deal with unknown words. Make sure to read section
3.4.1 Unknown Words
of the textbook.
After you’ve finalized the vocabulary, then you need to add training
data to the model. This is the most important step! Modify the
add_train
function so that given a sentence it keeps track of the
necessary counts you’ll need for the probability functions later. You
will probably want to use default dictionaries or probability
distributions. Finally, given the counts that you’ve stored in
add_train
, you’ll need to implement probability estimates for
contexts. These are the required probability estimates you’ll need to
implement:
mle
: Simple division of counts for that observation by total counts for the contextlaplace
: Add one to all countsadd\_k
: Add a specified parameter greater than zero to all countsjelinek_mercer
: Interpolate between probability distributions with parameter lambdaNow if you run the main section of the lm
file, you’ll
get per-sentence reports of perplexity. Take a look at what sentences
are particularly hard or easy (you don’t need to turn anything in
here, however).
Try finding sentences from the test dataset that get really low perplexities for each of the estimation schemes (you may want to write some code to do this). Can you find any patterns? Turn in your findings and discussion in your README.txt
.
Extra Credit (make sure they don’t screw up required code / functions that will be run by the autograder):
kneser_ney
: Use discounting and prefixes with discount parameter $\delta$ and concentration parameter alpha to implement interpolated Kneser-Ney.sample
)Q: Why are there two passes over the data?
A: The first pass establishes the vocabulary, the second pass accumulates the counts. You could in theory do it in one pass, but it gets much more complicated.
Q: What if the counts of <s> and <\/s> fall below the threshold?
A: They should always be included in the vocabulary.
Q: And what about words that fall below the threshold?
A: They must also be represented, so the vocab size will be the number of tokens at or above the UNK threshold plus three (one for UNK, one for START, and one for END).
Q: What happens when I try to take the log of zero?
A: Return kNEG_INF instead.
Q: Do I tune the hyperparameters for interpolation, discount, etc.?
A: No, that’s not part of this assignment.
In a README file, called README.txt
, answer the following questions:
Feel free to add additional/optional feedback, e.g. what did you like or dislike about the assignment?
Submit the following files to the assignment called HW01
on Gradescope:
lm.py
README.txt
Make sure to name these files exactly what we specify here. Otherwise, our autograders might not work and we might have to take points off.