Machine Learning Project

Project Description

This is an opportunity for you to explore an interesting machine learning problem of your choice. Your project may be based on a real-world data set, or it may be theoretical in nature but grounded on a real problem.

One of the best ways to identify a project topic is to choose a domain that interests you and identify problems in that domain. Let the problem drive your choice of technique, rather than the other way around.

You may complete the project as an individual or with a partner; however, I strongly encourage you to work with a partner on this project.

Your project will include three deliverables (turn in only one copy per team):

A one-page (single spaced) project proposal, due in hard copy on April 4th.
A presentation of your project to the class during the final exam slot during the week of May 1st.
A final project report in the format of a 4-6 page AAAI paper.

This final project is worth 25% of your course grade. The breakdown of that 25% is as follows:

Initial project proposal and meeting - 15%
Presentation - 25%
Final paper and project - 60%

You only have a little over one month to complete the project, so keep the scope small and start early!!

Project Proposal (due April 4th)

Read the list of project ideas and potential data sets, and then describe your proposed project in a one-page (single spaced) proposal. This proposal is due in hard copy on the date listed above.

If you are doing a project based on a real-world data set, you are encouraged to use one of the data sets described below, because they have been successfully used for machine learning in the past. If you prefer to use a different data set, I will consider your proposal, but you must have access to this data already and present a clear proposal for what you would do with it.

Your proposals should include the following information

Project title
Teaming information (if any)
Data set - one sentence description and source
Project idea, including a clear description of the problem and your approach to solving it
A brief description of the steps you will take to complete the project
A list of 1-3 related references that you will read.

Each individual/team will be required to meet with me for ~20mins during the week of April 4th to discuss your project.

Project Presentation (during the scheduled final exam slot; late submissions will not be accepted)

Your project presentation should be 15 minutes long (this is a hard cut-off) with 5 minutes for questions. You should cover all the topics described below for the project report.

Project Report (due Friday, May 6th for seniors and Wednesday, May 11th for non-seniors; late submissions will not be accepted)

Your final project report must be in the format of a 4-6 page AAAI paper. You should use one of the templates available at http://www.aaai.org/Publications/Templates/AuthorKit.zip. The strict 6 page limit includes all references. Your paper should sufficiently describe your project, including:

An abstract
An introduction, describing the problem your are solving, the motivation for it, and a brief summary of your approach
A brief survey of related work and background material on your project. This related work must include at least 2 conference or journal papers outside of the class readings.
A description of your technical approach, using proper mathematical notation and formatting
Your experimental methodology, description of your data set, the results you found (formatted as either tables or plots with proper labels and captions), and a discussion of your results. If you did a theoretical project, you should have an expanded technical approach instead of this section.
A brief conclusion
A list of references in properly formatted AAAI style

Project Ideas

You are welcome to use one of these ideas or come up with your own.

Extend an active learning technique (which queries the user for labels) to use other sources of feedback that are richer than binary labels, such as equivalence sets, distribution examples, measures of "typicality" of the instance, or some other idea of your own.
There are multiple ways to combine kernels together to create new kernels (addition, multiplication, etc.). Develop an SVM-based learning algorithm that tries a number of kernels and their combinations in a principled manner to find the optimal separator for a data set.
Write a supervised or semi-supervised algorithm for image segmentation and compare its performance to k-means-based image segmentation on the Berkeley image segmentation data set.
Multi-view learning is typically applied to supervised or semi-supervised classification scenarios. Instead, apply it to unsupervised clustering or constrained clustering.
Write a reinforcement learning agent to play Mario or Tetris using the RL-Glue framework. The framework is available at http://2009.rl-competition.org/software.php#download, and you might be interested in the steps described in http://www.cs.lafayette.edu/~taylorm/cs414/Project1.pdf (note that you only need to implement a single learner for this project).
Use the 20 newsgroups data set and write an algorithm for semi-supervised text classification based on a method besides naive Bayes. You might consult this page, which includes code for the semi-supervised naive Bayes text classifier discussed in class.
Design an algorithm for transfer learning that improves image classification in some categories of the Caltech 256 data set based on transfer from other categories, or object recognition in the MIT objects and scenes data set, or indoor scene recognition. Transfer could also be used to improve image segmentation in the Berkeley image segmentation data set.
Often times users have an idea of the classifier they are looking for, even if the data does not directly support it. Design an interactive method for building a model in collaboration with a user. For example, perhaps the user knows that particular attributes should be in the first few splits of the decision tree, even if there isn't enough data to support it, so the tree could be interactively built in collaboration with the user. Or, perhaps the user knows that particular factors are especially important, which could bias the weights learned by logistic regression.

Here are some other sources of project ideas and data:

Carlos Guestrin's site for the CMU machine learning class includes further project ideas that will definitely need scaling down if you decide to use them for this class. This site also has a lot of interesting data sets.
Ray Mooney's lists of project ideas: newer list and older list.
Amy McGovern's list of project ideas.
The UC Irvine ML Repository