Skip to main content

Homework 2: Decision Trees

Due: 2024-09-22 23:59:00EDT

Overview

The goals of this week’s assignment are:

The main datasets we will be investigating are the heart disease and diabetes datasets (more info at the end of the assignment). This assignment is to be done individually. Discussing high-level ideas with others in the class is encouraged, but you should not be sharing code in any way.

Provided Code

Implement a Decision Tree

You will implement an ID3-like decision tree in this part of the homework. A few steps have been provided in the starter code. First investigate these steps and make sure the code is clear to you:

The next steps you will implement:

Usage

Your program should take in 2-3 command-line arguments:

  1. Path to the training dataset -r
  2. Path to the testing dataset -e
  3. (optional) Maximum depth of the tree -d (if none, unlimited depth)

For example, the following command will train a decision tree of max depth of 1 on the movies dataset and test it on the movies dataset:

python run_dtree.py -r input/movies_train.arff -e input/movies_test.arff -d 1

Investigate the function parse_args in util.py, where the arguments are parsed. This function makes use of the argarse library, which we will be using through out the semester to manage command-line arguments.

Program Style Requirements

Program Inputs

Program Main Behavior

Your program should process inputs, parse and index training and test examples, and induce a decision tree. There are very few restrictions on how you do this, but be sure your tree fits the behavior from class:

TENNIS

Info Gain:
     Outlook, 0.246750
 Temperature, 0.029223
    Humidity, 0.151836
        Wind, 0.048127

MOVIES

Info Gain:
         Type, 0.306099
       Length, 0.306099
     Director, 0.557728
Famous_actors, 0.072780

Unit tests

dt_test.py contains unit tests that you should use to compute your entropy functions. You should include helper methods in Parition.py for computing entropy. Some of these (not all) help methods are listed in dt_test.py.

Program output

When run, your program should print out the following:

For example, here is the result of the tennis dataset from Handout 3 (sort the child branch labels so that your ordering is consistent). At the root, there are 9 days tennis is played and 5 days it is not.

TENNIS, no max depth

$ python run_dtree.py -r input/tennis_train.arff -e input/tennis_test.arff
[5, 9]
Outlook=Overcast [0, 4]: 1
Outlook=Rain [2, 3]
|   Wind=Strong [2, 0]: -1
|   Wind=Weak [0, 3]: 1
Outlook=Sunny [3, 2]
|   Humidity=High [3, 0]: -1
|   Humidity=Normal [0, 2]: 1

14 out of 14 correct
accuracy = 1.0000

TENNIS, max depth = 1

$ python run_dtree.py -r input/tennis_train.arff -e input/tennis_test.arff -d 1
[5, 9]
Outlook=Overcast [0, 4]: 1
Outlook=Rain [2, 3]: 1
Outlook=Sunny [3, 2]: -1

10 out of 14 correct
accuracy = 0.7143

MOVIES, no max depth

$ python run_dtree.py -r input/movies_train.arff -e input/movies_test.arff
[3, 6]
Director=Adamson [0, 3]: 1
Director=Lasseter [3, 1]
|   Type=Animated [2, 0]: -1
|   Type=Comedy [1, 0]: -1
|   Type=Drama [0, 1]: 1
Director=Singer [0, 2]: 1

9 out of 9 correct
accuracy = 1.0000

MOVIES, max depth = 1

$ python run_dtree.py -r input/movies_train.arff -e input/movies_test.arff -d 1
[3, 6]
Director=Adamson [0, 3]: 1
Director=Lasseter [3, 1]: -1
Director=Singer [0, 2]: 1

8 out of 9 correct
accuracy = 0.8889

Implementation suggestions

Analysis

Hyperparameter Tuning

Run experiments at various depth levels including, at a minimum, depths of 1, 2, 4, 7, and 10. What happens to the training set accuracy and test set accuracy as the depth varies? Is this what you expected? Discuss your observations in your README.md.

Plotting

Generate two learning curve graphs (i.e., line graphs) with an accompanying analysis of the graphs. On the x-axis, show the depth of a tree. On the y-axis, show the accuracy of the tree on both the training set and test set. You should have one graph for the diabetes data set and one graph for the heart disease data set. Describe what you can conclude from each graph. Be sure that your graphs are of scientific quality - including captions, axis labels, distinguishable curves, and legible font sizes.

Here is an example of creating a plot using matplotlib, but you are welcome to use a different program for this part.

# import plotting library
import matplotlib.pyplot as plt

# set up sequences of x and y values (make sure they are of the same length)
x = range(10)
y = [val**2 for val in x]

# plot the (x,y) values (considered as pairs), using blue (b) circles (o),
# connected by lines (-)
plt.plot(x,y,'bo-')

# title, axis labels, and legend
plt.title("My Plot")
plt.xlabel("x-axis label")
plt.ylabel("y-axis label")
plt.legend(["line 1"])

# I usually use "show" at first to see the plot, then save it later
plt.show()
#plt.savefig("my_plot.png", format='png')

Extensions (Optional)

Multi-class prediction

Both data sets I have provided have two class labels. Decision trees easily extend to multi-class prediction. Find a relevant data set (e.g., at Kaggle or the UC Irvine Machine Learning Repository) and evaluate your algorithm with multiple discrete values allowed for labels.

Min Leaf Size

Another approach to prevent overfitting is setting a minimum number of examples in a leaf. If a split causes results in children below the threshold, the recursion is stopped at the current node and a leaf is created with the plurality label. This is often more flexible than maximum depth - it allows variable depth branches while still trying to prevent overly detailed hypotheses that only fit to a few examples. Add minimum leaf size as an optional command line argument.

Learning curves for training set size

Machine learning algorithms are often sensitive to the amount of training data - some algorithms thrive when there is large amounts of data but struggle when data is sparse (e.g., deep neural networks) while others plateau in performance even if more data is available. Evaluate your decision tree on the given training data by randomly subsampling the data to create smaller training sets e.g., use 10% of training data. Choose at least 5 training set sizes and plot them on the x-axis with the y-axis describing the accuracy. You’ll probably want to run each training set size 3-5 times and average the results. Describe what you see when you compare training and test accuracy as the training set grows.

FAQ

How do I know if my code is working?
Run the provided unit tests provided in dt_test.py. Initially, many of them will fail. If your program is working correctly, all of them will pass. However, the converse is not true: passing all the tests is not sufficient to demonstrate that your code is working. Using dt_test.py is strongly encouraged, this is similar to how your code will be graded on gradescope.

README.md

In a text file called README.md answer the following questions:

  1. How much time did you spend on the homework
  2. What did you learn from this homework
  3. optional: What did you struggle with during this homework
  4. optional: any other feedback you would like to share

Submitting

Submit the following files to the assignment called HW02 on Gradescope:

README.md

  1. run_dtree.py
  2. Parition.py
  3. DecisionTree.py
  4. README.md

Make sure to name these files exactly what we specify here. Otherwise, our autograders might not work and we might have to take points off. Add any other files that you created that are needed for your code to run.

Acknowledgements

Modified from lab by: Sara Mathieson, Ameet Soni.

Both data sets were made available by the UC Irvine Machine Learning Repository. The heart disease data set is the result of a study by the Cleveland Clinic to study coronary heart disease. The patients were identified as having or not having heart disease and the resulting feature set is a subset of the original 76 factors that were studied. The diabetes data set is the result of a study of females at least 21 years old of Pima Indian heritage to understand risk factors of diabetes. Both data sets were converted to ARFF format by Mark Craven at the University of Wisconsin.