Due: 2024-10-01 23:59:00EDT
The goals of this week’s assignment are:
In this exercise, you will work through linear and polynomial regression. Our data consists of inputs \(x_{i} \in \mathbb{R}\) and outputs \(y_{i} \in \mathbb{R}\), which are related through a target function \(y = f(x) + \epsilon\). Your goal is to learn a linear predictor \(h_{w}(x)\) that best approximates \(f(x)\).
This assignment is to be done individually. Discussing high-level ideas with others in the class is encouraged, but you should not be sharing code in any way.
PolynomialRegression.py, run_regression.pyregression_train.csv, regression_test.csv (in the data folder)writeup.tex or README.md (choose one format for your writeup)We used numpy before in Homework01, but we’ll use more of its functionality in this homework. It is a good skill to pick up since you will inevitably use \(\texttt{numpy}\) if you plan to do math in Python, e.g. for machine learning. You may find it useful to work through a numpy tutorial first.1
Here are some things to keep in mind as you complete this problem:
If you are seeing many errors at runtime, inspect your matrix operations to make sure that you are adding and multiplying matrices of compatible dimensions. Printing the dimensions of variables with the \(\texttt{X.shape}\) command will help you debug.
When working with \(\texttt{numpy}\) arrays, note that \(\texttt{numpy}\) interprets the \(\texttt{*}\) operator as element-wise multiplication. This is a common source of size incompatibility errors. If you want matrix multiplication, you need to use the \(\texttt{np.dot}\) function. For example, \(\texttt{A*B}\) does element-wise multiplication while \(\texttt{np.dot(A,B)}\) does a matrix multiply.
Be careful when handling \(\texttt{numpy}\) vectors (rank-1 arrays): the vector shapes \(1 × n\), \(n × 1\), and \(n\) are all different things. For simplicity in this assignment (unless otherwise indicated in the code), both column and row vectors are rank-1 arrays of shape \(n\), not rank-2 arrays of shape \(n × 1\) or shape \(1×n\).
Useful numpy functions:
\(\texttt{np.array([...])}\): takes in a list (or list-of-lists, etc) and turns it into a \(\texttt{numpy}\) array. Note that after this step, you cannot \(\texttt{append}\), etc – you have to use \(\texttt{numpy}\) operations to create new array objects.
\(\texttt{np.dot(A,B)}\), \(\texttt{np.matmul(A,B)}\): perform matrix multiplication. Inner dimensions of \(A\) and \(B\) should match.
\(\texttt{np.ones((p,q))}\): creates a 2D array of 1’s with the given shape (here it would be \(p×q\)). Similarly, you can use \(\texttt{np.zeros((p,q))}\) to create an array of 0’s. If you include only one dimension (i.e. \(\texttt{np.zeros(p)}\)), this will yield a vector.
\(\texttt{np.concatenate((A, B), axis=0)}\): concatenate two arrays along the given axis. Here it would be along the rows (axis 0), so \(A\) would be “on top”, and \(B\) below. If we concatenate along axis 1 then \(A\) would be on the “left” and \(B\) would be on the right.
\(\texttt{np.reshape(A, (p,q))}\): will reshape array \(A\) to have dimensions \(p × q\) (or whatever dimensions you need, provided the values of \(A\) can actually be reshaped in this way).
\(\texttt{np.linalg.pinv(A)}\): returns the pseudo-inverse of matrix \(A\) (we will use this just in case we have issues with a singular matrix).
Visualizations can help us understand the data. For this data set, you can use a scatter plot since the data has only two properties to plot (\(x\) and \(y\)).
plot_data(...) function. What do you observe? For example, can you make an educated guess on the effectiveness of linear and polynomial regression in predicting the data? Include your responses and plots of both training and test data in your writeup.Hint: As you implement the remaining exercises, use this plotting function as a debugging tool.
Recall that the objective of linear regression is to minimize the cost function:
\[J(w) = \dfrac{1}{2} \sum_{i=1}^{n} (h_{w}(x_{i}) - y)^{2}\]In this problem, we will use the matrix-vector form where
\[y = \begin{pmatrix} y_{1} \\ y_{2} \\ \ldots \\ y_{n} \end{pmatrix}, \hspace{25pt} X = \begin{pmatrix} --x_{1}-- \\ --x_{2}--\\ \ldots \\ --x_{n}-- \end{pmatrix}, \hspace{25pt} w = \begin{pmatrix} w_{1} \\ w_{2} \\ \ldots \\ w_{p} \end{pmatrix}\]where each example \(x_{i} = [1, x_{i1}, \ldots , x_{ip}]\). Our linear regression model is: \(h_{w}(x) = w^{T}x\).
Beginning with the simple linear regression case (\(p = 1\)), we have: \(h_{w}(x) = w_{0} + w_{1}x\)
(b)
Modify generate_polynomial_features(...) in PolynomialRegression to create the matrix \(X\) for a simple linear model.
Note that to take into account the intercept term (\(w_{0}\)), we can add an additional ‘‘feature’’ to each example and set it to one, e.g. \(x_{i0} = 1\). This is equivalent to adding an additional first column to \(X\) and setting it to all ones.
(c) Before tackling the harder problem of training the regression model, complete predict(...) in PolynomialRegression to predict \(y\) from \(X\) and \(w\).
(d) One way to solve linear regression is through stochastic gradient descent (SGD).
Recall that the parameters of our model are the \(w_{j}\) values. These are the values we will adjust to minimize cost \(J(w)\). In SGD, each iteration runs through the training set and performs the update
With each step of gradient descent, our parameters \(w_{j}\) come closer to the optimal values that will achieve the lowest cost \(J(w)\).
As we perform gradient descent, it is helpful to monitor the convergence by computing the
cost. Complete cost(...) in PolynomialRegression to calculate \(J(w)\).
Next, implement the gradient descent step in fit_SGD(...) in PolynomialRegression. The loop structure has been written for you, so you only need to supply the updates to w and the new predictions \(\hat{y}\) within each iteration. In your writeup, include a plot of the final fit to the training data for \(\alpha = 0.01\).
Hint: A good way to verify that gradient descent is working correctly is to look at the value of \(J(w)\) and check that it is decreasing with each step. If you set verbose=True when calling fit_SGD(...), then, assuming you have implemented gradient descent and cost(...) correctly, your value of J(w) should never increase and should converge to a steady value by the end of the algorithm.
Hint: With verbose=True, you may also find it useful to look at the 2D plots of the training data and the output of the trained regression model to verify that the model looks reasonable and is improving on each step.
So far, we have used a default learning rate (or step size) of \(\alpha = 0.01\). Try different \(\alpha = 10^{-4},10^{-3},10^{-2},10^{-1}\), and make a table of the coefficients and number of iterations until convergence (in your writeup). How do the coeffcients compare? How quickly does each algorithm converge?
(e) In class, we learned that the closed-form solution to linear regression is \(w = (X^{T} X)^{-1}X^{T}y.\) Using this formula, you will get an exact solution in one calculation: there is no “loop until convergence” like in gradient descent.
fit(...) in PolynomialRegression.time module to measure the run time of the two approaches. You can capture the current time with time.time() (differences between these values are measured in seconds). How do the run times compare? writeupNow let us consider the more complicated case of polynomial regression, where, for a polynomial of degree \(d\), our hypothesis is
\[h_{w}(x) = w^{T}x = w_{0} + w_{1}x + w_{2}x^{2} + \ldots +w_{d}x^d.\]Recall that polynomial regression can be considered as an extension of linear regression in which we replace our input matrix \(X\) with
\[\Phi = \begin{pmatrix} --\phi(x_{1})-- \\ --\phi(x_{2})--\\ \ldots \\ --\phi(x_{n})-- \end{pmatrix},\]where \(\phi(x)\) is a function such that \(\phi_{k}(x) = x^k\) for \(k = 0, \ldots , d\).
generate_polynomial_features(...) in PolynomialRegression to create an \(d + 1\) dimensional feature vector for each example (okay to assume \(p = 1\) for this part).Given \(n\) training examples, it is always possible to obtain a “perfect fit” (a fit in which all the data points are exactly predicted) by setting the degree of the regression to \(n - 1\) (for example, in Handout 2 we had \(n = 2\) points and a polynomial of deg \(d = 1\) fit them perfectly). Of course, we would expect such a fit to generalize poorly. In the remainder of this problem, you will investigate the problem of overfitting as a function of the degree of the polynomial, \(d\). To measure overfitting, we will use the Root-Mean-Square (RMS) error, defined as
\[E_{RMS} = \sqrt{\dfrac{2\mathbb{E}[w]}{n}},\]where
\[\mathbb{E}[w] = \dfrac{1}{2}\sum_{i=1}^{n}(\sum_{k=0}^{d} w_{k}\phi_{k}(x_{i}) - y)^2 = \dfrac{1}{2}\sum_{i=1}^{n}(h_{w}(x_{i})-y_{i})^2\]and \(n\) is the number of examples.
(g) Implement rmse_error(...) in PolynomialRegression.
Why do you think we might prefer RMSE over \(J(w)\) as a metric? writeup
(h) For \(d = 0, \ldots , 10\), use the closed-form solver to determine the best-fit polynomial regression model on the training data, and with this model, calculate the RMSE on both the training data and the test data. Generate a plot depicting how RMSE varies with model complexity (polynomial degree) – you should generate a single plot with both training and test error, and include this plot in your writeup. Which degree polynomial would you say best fits the data? Was there evidence of under/overfitting the data? Use your plot to defend your answer. writeup
Finally, for the degree you chose, show a plot of the final fit to the training data. writeup
Finally, we will explore the role of regularization. For this problem, we will use \(L_{2}\)-regularization so that our regularized objective function is
\[J(w) = \dfrac{1}{2}\sum_{i=1}^{n}(h_{w}(x_{i})- y_{i})^{2} + \dfrac{\lambda}{2}\sum_{k=1}^{d}w_{k}^2,\]again optimizing for parameters \(w\).
(i) Modify fit(...) in PolynomialRegression to incorporate \(L_{2}\)-regularization.
(j) Use your updated solver to find the coefficients that minimize the error for a tenth-degree polynomial (\(d = 10\)) given regularization factor \(\lambda = 0, 10^8, 10^7, \ldots , 10^1, 10^0\). Now use these coefficients to calculate the RMS error (unregularized) on both the training data and test data as a function of \(\lambda\). Generate a plot depicting how RMS error varies with \(\lambda\) (for your x-axis, let \(x = [0,1,2,\ldots,10]\) correspond to \(\lambda = [0,10^8,10^7,\ldots,10^0]\) so that \(\lambda\) is on a logistic scale, with regularization increasing as \(x\) increases). How does regularization affect training and test error? Which \(\lambda\) value appears to work best? writeup
How do I know if my code is working?
Run the provided unit tests provided in tests.py.
Initially, many of them will fail.
If your program is working correctly, all of them will pass. However, the converse is not true: passing all the tests is not sufficient to demonstrate that your code is working.
Using tests.py is strongly encouraged, this is similar to how your code will be graded on gradescope.
In a text file called README.md answer the following questions:
Submit the following files to the assignment called HW02 on Gradescope:
PolynomialRegression.pyrun_regression.pyMake sure to name these files exactly what we specify here. Otherwise, our autograders might not work and we might have to take points off. Add any other files that you created that are needed for your code to run.
Modified from lab by: Sara Mathieson, Ameet Soni.
Try out Numpy’s tutorial (https://numpy.org/doc/stable/user/quickstart.html). Those familiar with Matlab may find the “Numpy for Matlab Users” documentation (https://numpy.org/doc/stable/user/numpy-for-matlab-users.html) more helpful. ↩