Skip to main content

Homework 7: Support Vector Machine

Due: 2024-11-27 23:59:00EDT

Overview

The goals of this assignment:

Introduction

You will have or create the following files:

Dataset and Command line arguments

Note: the lab computers will have sklearn version 1.0.2, so that is the documentation you should be using for this assignment. Throughout the homework, consult the documentation frequently to find the appropriate methods for this assignment. Reading and using documentation is a very important skill that we will practice in homework 7, homework 8, and the final project. Make sure you really understand each line of code you’re writing and each method you’re using - there are fewer lines of code for this lab, but each line is doing a lot!

When importing sklearn modules, avoid importing with the * - this imports everything in the library (and these functions can be confused with user-defined functions). Instead, import functions and classes directly. For example:

from sklearn.datasets import fetch_openml, load_breast_cancer

Imports should be sorted and grouped by type (for example, usually I group imports from python libraries vs. my own libraries).

Rather than parse and process data sets, you will use sklearn’s pre-defined data sets. Details can be found here. At a minimum, your experiments will require using the MNIST and 20 Newsgroup datasets. Both are multi-class tasks (10 and 20 classes, respectively). Note that both of these are large and take time to run, so I recommend developing using the Wisconsin Breast Cancer dataset (example below).

For this assignment, your run_pipeline.py file should take in one command line argument (using the argparse library), the dataset name. This does not refer to a file, but allows the program to import the correct dataset (options: cancer, mnist, and news). Here is an example:

if args.dataset == "cancer":
    data = load_breast_cancer()

X = data['data']
y = data['target']
print(X.shape)
print(y.shape)

which outputs 569 examples with 30 features each:

(569, 30)
(569,)

The MNIST dataset is very large and takes a lot of time to run, so you can randomly select 1000 examples; you should also normalize the pixel values between 0 and 1 (instead of 0 and 255):

data = fetch_openml('mnist_784', data_home="/home/apoliak/Public/cs383-ml/sklearn-data/")
X = data['data']
y = data['target']
X,y = utils.shuffle(X,y) # shuffle the rows (utils is from sklearn)
X = X[:1000] # only keep 1000 examples
y = y[:1000]
X = X/255 # normalize the feature values

The newsgroup dataset in vector form (i.e., bag of words) is obtained using:

data = fetch_20newsgroups_vectorized(subset='all', data_home="/home/apoliak/Public/cs383-ml/sklearn-data/")

No normalization is required; I also suggest randomly sampling 1000 examples for this dataset as well. The data object also contains headers and target information which you should examine for understanding. For your analysis, it may be helpful to know the number of features, their types, and what classes are being predicted.

Coding Requirements

The coding portion is flexible - the goal is to be able to execute the experiments below. However, you should keep these requirements in mind:

$ python3 run_pipeline.py -d cancer
$ python3 run_pipeline.py -d mnist
$ python3 run_pipeline.py -d news

You may have cases for each one (since they need to be treated differently). If the user does not enter a dataset, rely on argparse to print a helpful message.

knn_clf = KNeighborsClassifier()
parameters = {"weights": ["uniform", "distance"], "n_neighbors": [1, 5, 11]}
test_results = runTuneTest(clf, parameters, X, y)

Note that the hyper-parameters match the API for KNeighborsClassifier. In the dictionary, the key is the name of the hyper-parameter and the value is a list of values to try.

Experiment 1: Random Forest vs SVM Generalization Error

Using run_pipeline.py, you will run both Random Forests and SVMs and compare which does better in terms of estimated generalization error.

Coding Details

Your program should read in the dataset using the command line, as discussed above. You should specify your parameters and classifier and call runTuneTest (see the above example), which follows this sequence of steps:

In main(), you should print the test accuracies for all 5 folds for both classifiers (pair up the accuracies for each fold for ease of comparison). The classifiers/hyper-parameters are defined as follows:

Code incrementally, and be sure to examine the results of your tuning (what were the best hyper-parameter settings? what were the scores across each parameter?) to ensure you have the pipeline correct. Since the analysis below is dependent on your results, I cannot provide sample output for this task. However, this is what is generated if I change my classifier to K-Nearest Neighbors using the parameters listed in the previous section (you can try to replicate this using a random_state of 42):

$ python3 run_pipeline.py -d cancer

-------------
KNN
-------------

Fold 1:
{'n_neighbors': 5, 'weights': 'distance'}
Training Score: 1.0

Fold 2:
{'n_neighbors': 11, 'weights': 'uniform'}
Training Score: 0.9317180616740088

Fold 3:
{'n_neighbors': 11, 'weights': 'uniform'}
Training Score: 0.9385964912280702

Fold 4:
{'n_neighbors': 11, 'weights': 'uniform'}
Training Score: 0.9429824561403509

Fold 5:
{'n_neighbors': 5, 'weights': 'uniform'}
Training Score: 0.9473684210526315

Fold, Test Accuracy
0, 0.9217391304347826
1, 0.9391304347826087
2, 0.9380530973451328
3, 0.9203539823008849
4, 0.911504424778761

Analysis

In Part 1 of your writeup (must be a PDF), you will analyze your results. At a minimum, your submission should include the following type of analysis:

Experiment 2: Learning Curves

Using generate_curves.py, you will generate learning curves for the above two classifiers. We will vary one of the hyper-parameters and see how the train and test error accuracy changes.

Coding Requirements

$ python3 generate_curves.py -d cancer
Neighbors,Train Accuracy,Test Accuracy
 1  1.0000  0.9051
 3  0.9561  0.9209
 5  0.9473  0.9227
 7  0.9438  0.9191
 9  0.9429  0.9315
11  0.9385  0.9315
13  0.9385  0.9280
15  0.9376  0.9245
17  0.9367  0.9174
19  0.9323  0.9174
21  0.9306  0.9157

Analysis

Analyze your results for experiment 2. At a minimum, you should have:

Optional Extensions

For the SVM method, do some investigation into the support vectors (the SVC class in sklearn has some attributes that allow you to see the support vectors). How many support vectors are typical for these datasets? How can you determine the size of the margin? Does there seem to be a relationship between the size of the margin and the quality of the test results?

There are many other parameters we did not tune in these methods, and many values we did not consider. Expand your analysis. Some suggestions: entropy vs. Gini for Random Forests, max tree depth for Random Forests, and other kernels for SVMs.

Submitting

Submit all files to HW07 on Gradescope.

Acknowledgements

Modified from assignments by: Sara Mathieson, Ameet Soni.