Homework 5: Pytorch (Logistic Regression, CBoW)

Deadline: 2023-03-20 23:59:00EDT

credit: Jordan Boyd-Graber

In this homework you’ll get more experience with pytorch. You will be completing the notebook from Monday (03/13 lab) and implementing a stochastic gradient ascent for logistic regression and you’ll apply it to the task of determining whether documents are talking about hockey or baseball. Sound familiar? It should be!

Indeed, it will be doing exactly the same thing on exactly the same data as the previous homework. The only difference is that while you had to do logistic regression yourself, this time you’ll be able to use Pytorch directly.

What you have to do

1. Complete the notebook from lab (20 points)

Complete the notebook from lab. Fill in the missing code and answer the questions in the notebook.

1.1 Similar words

In the last cells, we look at what words are most similar to father, king, africa, and baltimore according to the trained CBOW model.

In a new cell, look at the most similar word to any 5 of your choosing. Then, briefly in a new text/markdown cell, describe what you found.

1.2 Biases in word embeddings

Some of the options in Reading 05 looked at biases in word embeddings. For example, the popular paper called Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings showed that the vector from subtracting woman from man was very close to the vector created by subtracting homemaker from computer programmer.

Using the man to woman analogy, come up with three additional words that might then show gender-based biases in the word embeddings you trained. Make sure to create a new code cell where you test the new word. Then, in a new text/markdown cell briefly describe the biases you discovered (if you discovered any).

Installing pytorch and necessary packages

You can install pytorch on one of the CS lab machines by running:

conda install pytorch=1.13 torchvision -c pytorch

This will install the cpu version of pytorch 1.13

If you run pip install -r requirements.txt, it will install the necessary packages.

2. Coding (15 points):

Load in the data and create a data iterator. This will be the most difficult bit. You may use the sklearn feature creation functions or you can do it yourself to directly create a matrix.
Create a logistic regression model with a softmax/sigmoid activation function. To make unit tests work, we had to initialize a member of the SimpleLogreg class. Replace the none object with an appropriate nn.Module.
Optimize the function (remember to zero out gradients) and analyze the output.

3 Analysis (10 points):

How does the setup differ from the model you learned “by hand” in terms of initialization, number of parameters, activation?
Look at the top features again. Do they look consistent with your results for the last homework?

What to turn in

Feedback (3 points)

In your README, answer the following questions:

How long did you spend on the assignment?
What did you learn by doing this assignment?
Briefly describe a Computational Text Analysis/Text as Data research question where using LDA would be useful (keep this answer to a maximum of 3 sentences).

Feel free to add additional/optional feedback, e.g. what did you like or dislike about the assignment?

Submitting

Submit the following files to the assignment called HW05 on Gradescope:

The completed notebook as a pdf (make sure all cells have been run)
The completed notebook as a .ipynb file (make sure all cells have been run)
lr_pytorch.py
README (this can be a pdf if you include figures, which are better than text). The README should include a link to your notebook as well.

Make sure to name the python file exactly what we specify here. Otherwise, our autograders might not work and we might have to take points off.