Skip to main content

Homework 4: Logistic Regression

Deadline: 2023-03-01 23:59:00EDT

credit: Jordan Boyd-Graber

The goal of this assignment is for you to 1) implement stochastic gradient ascent for logistic regression, 2) apply it to the task of determining whether documents are talking about hockey or baseball, 3) discover discriminative covariates that are indicative of categories.

The starter code logreg.pyis available here and the tests (tests.py) is available here. The data needed can be found in this tar.gz.

You should not use any libraries that implement any of the functionality of logistic regression for this assignment; logistic regression is implemented in scikit learn and many other places, but you should do everything by hand now. You’ll be able to use library implementations of logistic regression in the future.

Code (35 Points)

  1. Understand how the code is creating feature vectors (this will help you code the solution and to do the later analysis). You don’t actually need to write any code for this.
  2. (Optional) Store necessary data in the constructor so you can do classification later.
  3. You’ll likely need to write some code to get the best/worst features (see below).
  4. Modify the sg update function to perform non-regularized updates.

Analysis (15 points)

In your README, make sure to answer the following questions:

  1. What is the role of the learning rate?
  2. How many passes over the data do you need to complete for the model to stabilize? The various metrics can give you clues about what the model is doing, but no one metric is perfect.
  3. Pretend you need to explain to your boss what features you use to discover a post is about baseball or hockey. How would you answer this question? What words are the best predictors of each class? How (mathematically) did you find them?
  4. What words aren’t good predictors of either class? How (mathematically) did you find them?

Recommendation: I’d highly recommend using numpy to help answer questions 3 and 4.

Extra Credit

  1. Use a schedule to update the learning rate.
    • Supply an appropriate argument to step parameter
    • Support it in your sg update
    • Show the effect in your analysis write up
  2. Use document frequency (provided in the vocabulary file) to modify the feature values to tf-idf. There are many variants of tf-idf; you are welcome to implement any of them. Please specify in your analysis which variant you choose.
    • Modify the Example to store the tfidf vector in a variable named x_tfidf
    • With the appropriate flag, use the x_tfidf vector rather than x in the update
    • Show the effect in your analysis write up
  3. Modify the sg update function to perform lazy regularized updates, which only update the weights of features when they appear in an example.
    • Show the effect in your analysis write up

Caution: When implementing extra credit, make sure your implementation of the regular algorithms doesn’t change.

Feedback (3 points)

In your README, answer the following questions:

  1. How long did you spend on the assignment?
  2. What did you learn by doing this assignment?
  3. Briefly describe a Computational Text Analysis/Text as Data research question where using LDA would be useful (keep this answer to a maximum of 3 sentences).

Feel free to add additional/optional feedback, e.g. what did you like or dislike about the assignment?

Submitting

Submit the following files to the assignment called HW04 on Gradescope:

  1. logreg.py
  2. README (this can be a pdf if you include figures, which are better than text).

Make sure to name the python file exactly what we specify here. Otherwise, our autograders might not work and we might have to take points off.

Hints

  1. As with the previous assignment, make sure that you debug on small datasets first (we’ve provided toy text in the data directory to get you started).
  2. Certainly make sure that you do the unregularized version first and get it to work well.
  3. Use numpy functions whenever you can to make the computation faster.
  4. For the tf-idf extra credit, use the vocab file: the second element is the document frequency. This requires you to modify the read_dataset function.

Unit tests

Before running your code on the data, make sure your solution passes the unit tests. Below is the result of running the unit test prior to implementing stochastic gradient ascent.

[0. 0. 0. 0. 0.]
[1. 4. 3. 1. 0.]
F
======================================================================
FAIL: test_unreg (__main__.TestKnn)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/apoliak/BMC/teaching/CTA/assignments/logistic_regression/release/tests.py", line 19, in test_unreg
    self.assertAlmostEqual(beta[0], .5)
AssertionError: 0.0 != 0.5 within 7 places (0.5 difference)

----------------------------------------------------------------------
Ran 1 test in 0.002s

FAILED (failures=1)

Once the tests run successfully, you should see the following printed out:

[0. 0. 0. 0. 0.]
[1. 4. 3. 1. 0.]
[0.5 2.  1.5 0.5 0. ]
[1. 4. 3. 1. 0.]
.
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK

Example

This is an example of what your runs should look like:

python logreg.py
Read in 1064 train and 133 test
Update 1	TP -728.031877	HP -89.799782	TA 0.497180	HA 0.533835
Update 6	TP -900.538122	HP -115.634421	TA 0.517857	HA 0.488722
Update 11	TP -704.708605	HP -96.961868	TA 0.579887	HA 0.518797
Update 16	TP -679.129712	HP -92.263830	TA 0.588346	HA 0.541353
Update 21	TP -722.469178	HP -89.796534	TA 0.608083	HA 0.639098
Update 26	TP -656.324879	HP -75.516536	TA 0.660714	HA 0.714286
Update 31	TP -658.569795	HP -76.169386	TA 0.654135	HA 0.691729
Update 36	TP -633.161013	HP -72.344206	TA 0.684211	HA 0.706767
Update 41	TP -701.607619	HP -80.218928	TA 0.679511	HA 0.669173
Update 46	TP -588.567683	HP -66.271052	TA 0.722744	HA 0.766917
Update 51	TP -582.332356	HP -67.719479	TA 0.719925	HA 0.714286
Update 56	TP -506.524431	HP -62.377003	TA 0.759398	HA 0.781955
Update 61	TP -513.759593	HP -63.221052	TA 0.761278	HA 0.729323
Update 66	TP -644.013282	HP -76.702718	TA 0.718985	HA 0.706767
Update 71	TP -608.637740	HP -84.982997	TA 0.737782	HA 0.706767
Update 76	TP -567.831518	HP -81.089895	TA 0.759398	HA 0.736842
Update 81	TP -529.522052	HP -77.503906	TA 0.774436	HA 0.729323
Update 86	TP -522.190913	HP -75.391605	TA 0.779135	HA 0.744361
Update 91	TP -545.689829	HP -79.972120	TA 0.763158	HA 0.736842
Update 96	TP -474.988277	HP -70.802779	TA 0.798872	HA 0.759398
Update 101	TP -477.839626	HP -70.785230	TA 0.796992	HA 0.759398
Update 106	TP -440.214535	HP -65.508160	TA 0.809211	HA 0.804511
Update 111	TP -440.076552	HP -64.801832	TA 0.798872	HA 0.729323
Update 116	TP -713.863979	HP -89.852378	TA 0.694549	HA 0.676692
Update 121	TP -535.636455	HP -74.382158	TA 0.769737	HA 0.744361
Update 126	TP -424.582210	HP -64.657032	TA 0.818609	HA 0.774436
Update 131	TP -405.196644	HP -60.364998	TA 0.825188	HA 0.759398
Update 136	TP -408.284788	HP -59.994036	TA 0.812970	HA 0.766917
Update 141	TP -385.104092	HP -55.891346	TA 0.835526	HA 0.819549
Update 146	TP -399.842426	HP -56.306737	TA 0.817669	HA 0.804511
Update 151	TP -382.629258	HP -53.963372	TA 0.836466	HA 0.804511
Update 156	TP -381.663687	HP -52.937715	TA 0.838346	HA 0.804511
Update 161	TP -394.553812	HP -53.425808	TA 0.827068	HA 0.812030
Update 166	TP -413.874440	HP -54.035435	TA 0.818609	HA 0.796992
Update 171	TP -351.797283	HP -46.774927	TA 0.854323	HA 0.864662
Update 176	TP -364.208360	HP -48.756012	TA 0.854323	HA 0.819549
Update 181	TP -344.609405	HP -47.936714	TA 0.858083	HA 0.842105
Update 186	TP -339.372977	HP -48.645582	TA 0.853383	HA 0.834586
Update 191	TP -332.315167	HP -45.289989	TA 0.874060	HA 0.864662
Update 196	TP -335.748269	HP -44.054527	TA 0.867481	HA 0.857143
Update 201	TP -330.232184	HP -43.688826	TA 0.868421	HA 0.857143
Update 206	TP -329.351848	HP -43.176223	TA 0.875000	HA 0.857143
Update 211	TP -338.508390	HP -45.469553	TA 0.869361	HA 0.849624
Update 216	TP -336.950406	HP -44.084169	TA 0.867481	HA 0.849624
Update 221	TP -330.332779	HP -44.034070	TA 0.875000	HA 0.864662
Update 226	TP -342.730943	HP -47.352928	TA 0.866541	HA 0.864662
Update 231	TP -352.033720	HP -49.484280	TA 0.859023	HA 0.842105
Update 236	TP -362.172670	HP -53.178104	TA 0.859023	HA 0.842105
Update 241	TP -342.646803	HP -50.095402	TA 0.866541	HA 0.864662
Update 246	TP -349.859715	HP -51.710507	TA 0.859962	HA 0.849624
Update 251	TP -330.852584	HP -47.375257	TA 0.871241	HA 0.849624
Update 256	TP -337.695348	HP -46.598531	TA 0.868421	HA 0.849624
Update 261	TP -327.696433	HP -47.230397	TA 0.881579	HA 0.842105
Update 266	TP -321.175652	HP -46.978466	TA 0.875940	HA 0.842105
Update 271	TP -316.744286	HP -46.907174	TA 0.875000	HA 0.849624
Update 276	TP -318.625257	HP -47.229407	TA 0.875940	HA 0.849624
Update 281	TP -317.328799	HP -47.629807	TA 0.877820	HA 0.849624
Update 286	TP -313.285825	HP -44.597210	TA 0.881579	HA 0.849624
Update 291	TP -316.658230	HP -44.702502	TA 0.875000	HA 0.849624
Update 296	TP -309.389970	HP -44.404765	TA 0.888158	HA 0.849624
Update 301	TP -312.964923	HP -45.210895	TA 0.886278	HA 0.842105
Update 306	TP -307.419423	HP -43.342098	TA 0.889098	HA 0.864662
Update 311	TP -293.251319	HP -44.257542	TA 0.894737	HA 0.827068
Update 316	TP -283.531862	HP -42.377066	TA 0.896617	HA 0.857143
Update 321	TP -295.098046	HP -43.586514	TA 0.888158	HA 0.857143
Update 326	TP -286.864337	HP -42.486508	TA 0.891917	HA 0.864662
Update 331	TP -288.522475	HP -42.915298	TA 0.893797	HA 0.857143
Update 336	TP -283.714986	HP -41.928494	TA 0.893797	HA 0.857143
Update 341	TP -285.996595	HP -40.765043	TA 0.895677	HA 0.857143
Update 346	TP -271.787967	HP -39.220396	TA 0.896617	HA 0.857143
Update 351	TP -273.281431	HP -40.019914	TA 0.895677	HA 0.849624
Update 356	TP -270.152709	HP -39.683134	TA 0.895677	HA 0.872180
Update 361	TP -270.811528	HP -39.983171	TA 0.897556	HA 0.872180
Update 366	TP -263.316336	HP -39.107378	TA 0.895677	HA 0.864662
Update 371	TP -260.932851	HP -37.912378	TA 0.903195	HA 0.864662
Update 376	TP -264.846119	HP -38.946195	TA 0.898496	HA 0.864662
Update 381	TP -255.685867	HP -37.146460	TA 0.906955	HA 0.857143
Update 386	TP -266.240217	HP -38.572019	TA 0.901316	HA 0.864662
Update 391	TP -252.600101	HP -37.053851	TA 0.906955	HA 0.864662
Update 396	TP -277.531764	HP -40.205220	TA 0.897556	HA 0.849624
Update 401	TP -299.160168	HP -42.171900	TA 0.887218	HA 0.842105
Update 406	TP -255.168044	HP -32.911552	TA 0.907895	HA 0.902256
Update 411	TP -255.919891	HP -32.992732	TA 0.907895	HA 0.909774
Update 416	TP -255.636806	HP -32.776485	TA 0.911654	HA 0.917293
Update 421	TP -249.305313	HP -32.098639	TA 0.906015	HA 0.872180
Update 426	TP -249.095167	HP -31.809436	TA 0.907895	HA 0.872180
Update 431	TP -247.576211	HP -31.154342	TA 0.913534	HA 0.894737
Update 436	TP -244.828296	HP -31.023336	TA 0.913534	HA 0.902256
Update 441	TP -244.647621	HP -31.268950	TA 0.912594	HA 0.902256
Update 446	TP -246.675548	HP -33.054302	TA 0.904135	HA 0.887218
Update 451	TP -237.280527	HP -31.704781	TA 0.917293	HA 0.909774
Update 456	TP -233.663726	HP -31.525557	TA 0.918233	HA 0.887218
Update 461	TP -233.673719	HP -31.507317	TA 0.918233	HA 0.887218
Update 466	TP -237.187700	HP -30.951204	TA 0.923872	HA 0.909774
Update 471	TP -237.968934	HP -31.225037	TA 0.922932	HA 0.909774
Update 476	TP -241.493792	HP -34.315493	TA 0.909774	HA 0.864662
Update 481	TP -234.887174	HP -32.732465	TA 0.914474	HA 0.872180
Update 486	TP -240.785778	HP -30.537650	TA 0.922932	HA 0.924812
Update 491	TP -236.276425	HP -32.770371	TA 0.917293	HA 0.894737
Update 496	TP -235.417560	HP -32.366469	TA 0.918233	HA 0.902256
Update 501	TP -240.166284	HP -33.564283	TA 0.911654	HA 0.909774
Update 506	TP -258.218223	HP -38.384819	TA 0.896617	HA 0.872180
Update 511	TP -229.093576	HP -31.213094	TA 0.916353	HA 0.902256
Update 516	TP -225.074420	HP -31.455123	TA 0.921053	HA 0.909774
Update 521	TP -224.186134	HP -31.001963	TA 0.925752	HA 0.917293
Update 526	TP -224.216961	HP -30.654317	TA 0.922932	HA 0.917293
Update 531	TP -224.760694	HP -30.463913	TA 0.921053	HA 0.917293
Update 536	TP -244.502547	HP -35.925176	TA 0.905075	HA 0.894737
Update 541	TP -216.257792	HP -29.450287	TA 0.925752	HA 0.909774
Update 546	TP -219.859926	HP -29.173745	TA 0.925752	HA 0.917293
Update 551	TP -238.202289	HP -31.547143	TA 0.918233	HA 0.924812
Update 556	TP -223.192968	HP -28.827227	TA 0.932331	HA 0.924812
Update 561	TP -229.294454	HP -30.189520	TA 0.922932	HA 0.909774
Update 566	TP -223.212876	HP -29.143003	TA 0.925752	HA 0.932331
Update 571	TP -220.508124	HP -28.311058	TA 0.930451	HA 0.917293
Update 576	TP -221.648705	HP -28.206051	TA 0.931391	HA 0.924812
Update 581	TP -218.971886	HP -27.944724	TA 0.930451	HA 0.924812
Update 586	TP -213.005446	HP -27.288646	TA 0.935150	HA 0.924812
Update 591	TP -209.871761	HP -27.600190	TA 0.936090	HA 0.924812
Update 596	TP -207.993175	HP -27.908688	TA 0.932331	HA 0.924812
Update 601	TP -219.682294	HP -29.891741	TA 0.928571	HA 0.924812
Update 606	TP -224.083987	HP -30.756422	TA 0.926692	HA 0.909774
Update 611	TP -198.293977	HP -28.263903	TA 0.939850	HA 0.932331
Update 616	TP -210.225359	HP -30.256640	TA 0.934211	HA 0.917293
Update 621	TP -189.285680	HP -26.920397	TA 0.938910	HA 0.932331
Update 626	TP -185.885916	HP -26.693071	TA 0.934211	HA 0.932331
Update 631	TP -184.754707	HP -26.536890	TA 0.936090	HA 0.924812
Update 636	TP -184.401972	HP -26.635137	TA 0.936090	HA 0.924812
Update 641	TP -243.733762	HP -41.992132	TA 0.916353	HA 0.879699
Update 646	TP -202.614916	HP -34.416480	TA 0.930451	HA 0.909774
Update 651	TP -195.117687	HP -32.646085	TA 0.936090	HA 0.909774
Update 656	TP -191.641899	HP -32.069876	TA 0.937030	HA 0.924812
Update 661	TP -197.845320	HP -34.776900	TA 0.936090	HA 0.894737
Update 666	TP -211.962363	HP -37.951163	TA 0.930451	HA 0.872180
Update 671	TP -215.861813	HP -38.958660	TA 0.928571	HA 0.872180
Update 676	TP -209.104598	HP -37.592822	TA 0.929511	HA 0.872180
Update 681	TP -200.823860	HP -35.991893	TA 0.931391	HA 0.887218
Update 686	TP -197.625074	HP -35.499161	TA 0.932331	HA 0.894737
Update 691	TP -198.688203	HP -35.909041	TA 0.934211	HA 0.879699
Update 696	TP -195.728332	HP -35.311607	TA 0.934211	HA 0.887218
Update 701	TP -184.278031	HP -32.210576	TA 0.941729	HA 0.894737
Update 706	TP -183.843677	HP -32.103085	TA 0.941729	HA 0.894737
Update 711	TP -183.431062	HP -31.912981	TA 0.942669	HA 0.894737
Update 716	TP -176.952392	HP -30.010054	TA 0.943609	HA 0.917293
Update 721	TP -177.748702	HP -29.913118	TA 0.945489	HA 0.924812
Update 726	TP -177.481537	HP -30.264986	TA 0.943609	HA 0.917293
Update 731	TP -188.138807	HP -34.021290	TA 0.945489	HA 0.917293
Update 736	TP -280.139552	HP -52.129680	TA 0.894737	HA 0.834586
Update 741	TP -201.545726	HP -38.295314	TA 0.937030	HA 0.894737
Update 746	TP -218.586819	HP -42.233102	TA 0.929511	HA 0.887218
Update 751	TP -202.794987	HP -39.780847	TA 0.938910	HA 0.887218
Update 756	TP -181.017139	HP -34.575664	TA 0.943609	HA 0.909774
Update 761	TP -170.328651	HP -30.073088	TA 0.944549	HA 0.909774
Update 766	TP -169.360699	HP -29.281683	TA 0.943609	HA 0.917293
Update 771	TP -169.122115	HP -28.121091	TA 0.951128	HA 0.924812
Update 776	TP -174.043514	HP -31.168049	TA 0.943609	HA 0.902256
Update 781	TP -168.128413	HP -24.533371	TA 0.945489	HA 0.947368
Update 786	TP -168.382129	HP -24.665370	TA 0.945489	HA 0.947368
Update 791	TP -167.152851	HP -24.577424	TA 0.948308	HA 0.947368
Update 796	TP -177.801247	HP -24.737861	TA 0.944549	HA 0.924812
Update 801	TP -168.469453	HP -23.295863	TA 0.948308	HA 0.947368
Update 806	TP -164.565087	HP -22.868267	TA 0.951128	HA 0.947368
Update 811	TP -163.537213	HP -22.877532	TA 0.953008	HA 0.947368
Update 816	TP -160.231437	HP -22.648307	TA 0.952068	HA 0.947368
Update 821	TP -157.483278	HP -22.685490	TA 0.949248	HA 0.939850
Update 826	TP -157.006715	HP -22.836286	TA 0.953008	HA 0.939850
Update 831	TP -156.877774	HP -22.625474	TA 0.953008	HA 0.939850
Update 836	TP -154.113836	HP -21.634553	TA 0.955827	HA 0.947368
Update 841	TP -153.803587	HP -22.432015	TA 0.954887	HA 0.939850
Update 846	TP -154.842451	HP -22.816448	TA 0.950188	HA 0.939850
Update 851	TP -151.456616	HP -25.947794	TA 0.960526	HA 0.917293
Update 856	TP -152.898779	HP -27.128185	TA 0.958647	HA 0.909774
Update 861	TP -151.672465	HP -26.257596	TA 0.960526	HA 0.924812
Update 866	TP -153.140256	HP -26.597070	TA 0.960526	HA 0.909774
Update 871	TP -152.001947	HP -27.514782	TA 0.960526	HA 0.924812
Update 876	TP -150.523445	HP -27.637190	TA 0.961466	HA 0.924812
Update 881	TP -149.950688	HP -27.628529	TA 0.959586	HA 0.932331
Update 886	TP -149.398130	HP -27.826662	TA 0.958647	HA 0.924812
Update 891	TP -157.314101	HP -28.664242	TA 0.953947	HA 0.932331
Update 896	TP -150.317655	HP -27.502255	TA 0.960526	HA 0.924812
Update 901	TP -147.562471	HP -28.082296	TA 0.962406	HA 0.917293
Update 906	TP -147.225982	HP -28.005794	TA 0.963346	HA 0.917293
Update 911	TP -147.406911	HP -27.964395	TA 0.961466	HA 0.917293
Update 916	TP -145.498971	HP -27.292979	TA 0.962406	HA 0.917293
Update 921	TP -159.307199	HP -28.638735	TA 0.954887	HA 0.917293
Update 926	TP -156.089930	HP -27.468569	TA 0.959586	HA 0.917293
Update 931	TP -148.359705	HP -27.892735	TA 0.965226	HA 0.924812
Update 936	TP -164.892126	HP -29.170329	TA 0.958647	HA 0.917293
Update 941	TP -161.556264	HP -28.584273	TA 0.959586	HA 0.917293
Update 946	TP -157.434058	HP -27.941456	TA 0.960526	HA 0.917293
Update 951	TP -140.535551	HP -23.529732	TA 0.966165	HA 0.954887
Update 956	TP -150.600977	HP -29.336520	TA 0.962406	HA 0.924812
Update 961	TP -146.216163	HP -27.631492	TA 0.964286	HA 0.917293
Update 966	TP -146.450307	HP -27.202680	TA 0.965226	HA 0.924812
Update 971	TP -143.741654	HP -26.406012	TA 0.964286	HA 0.932331
Update 976	TP -142.306727	HP -26.397807	TA 0.962406	HA 0.924812
Update 981	TP -146.722993	HP -27.991503	TA 0.965226	HA 0.924812
Update 986	TP -143.086637	HP -27.803336	TA 0.965226	HA 0.917293
Update 991	TP -144.620445	HP -28.091146	TA 0.964286	HA 0.924812
Update 996	TP -140.363950	HP -28.288579	TA 0.963346	HA 0.917293
Update 1001	TP -139.373096	HP -28.215446	TA 0.965226	HA 0.917293
Update 1006	TP -138.083490	HP -27.983285	TA 0.965226	HA 0.917293
Update 1011	TP -137.176496	HP -29.208996	TA 0.962406	HA 0.909774
Update 1016	TP -136.126425	HP -29.141944	TA 0.962406	HA 0.909774
Update 1021	TP -130.769707	HP -26.930035	TA 0.967105	HA 0.924812
Update 1026	TP -129.979771	HP -26.882370	TA 0.965226	HA 0.917293
Update 1031	TP -129.148996	HP -26.717335	TA 0.965226	HA 0.917293
Update 1036	TP -126.848435	HP -25.697901	TA 0.970865	HA 0.932331
Update 1041	TP -125.353078	HP -25.143287	TA 0.973684	HA 0.932331
Update 1046	TP -125.650765	HP -25.450108	TA 0.969925	HA 0.939850
Update 1051	TP -125.489756	HP -25.138841	TA 0.974624	HA 0.939850
Update 1056	TP -126.303435	HP -24.990931	TA 0.971805	HA 0.939850
Update 1061	TP -123.245353	HP -24.924609	TA 0.974624	HA 0.947368
Update 1064	TP -123.245353	HP -24.924609	TA 0.974624	HA 0.947368