CS 337: Algorithms: Design & Practice

Homework 9: K-Means Clustering

Clustering is used in pattern recognition to automatically group data based on a similarity metric. Clustering is also used in "unsupervised learning" in a similar fashion; given that you do not know how data is grouped, just define how to recognize how different two items are, then use that the find groupings.

Data

The data for this lab is available on Unix in the directory

  /home/gtowell/Public/CS337/Lab09Data

or in a compressed tar file at:

  /home/gtowell/Public/CS337/lab09data.tgz

The directory contains 3 files:

  cluster0.csv
  cluster1.csv 
  cluster2.csv

which all have data in csv format. Each line contains numbers separated by commas. Every line has the same number of numbers (and the same number of commas). For example, the cluster0.csv file is:

1,2
1,3
10,11
11,12
12,11
10,10
1,4
4,1
6,6
4,4

(The other two file contain real numbers rather than integers). You should interpret the numbers on a line as the coordinates of a point in M dimensional space (in this case M=2).

K-Means clustering

A simple but effective clustering algorithm is known as K-means. (This is somewhat described in MacCormack p 83-89) At a high level, the K-means algorithm operates by identifying K "centers"; then collecting all of the data closest to each of the K centers; a then moving the center to be in the center of the close data. Repeat with the new centers until the centers stop moving. In more detail, the algorithm is as follows:

Given: K : the number of clusters to create (a small integer)
       Close_Enough : a stopping criterion (a small floating point number)
       DATA: a MxN array of floats; M is the number of floats in a data item, N is the number of data items
Return: A MxK array of floats; That is, K cluster centers (centroids).

Let: CM = a KxM array of floats
     OLD_CM = a KxM array of floats

Initialize OLD_CM to all 0
Initialize CM to K randomly chosen elements from DATA

while true
    if Clus_Distance(CM, OLD_CM) < Close_Enough
        break
    Let: groups be an array of size K of something that can hold 1xM arrays, initially empty
    For each item in DATA
        Find the citem in CM to which item is closest
            add item to group corresponding to closest citem
    Copy CM into OLD_CM
    For j in 0..(K-1) 
        CM[j] = centroid of group[j] (the average of the group)
return CM

Clus_Distance(cOne, cTwo) // cOne and cTwo are KxM 2-D arrays 
   Let d=0
   foreach c1,c2 in cOne, cTwo
       d += Distance(c1,c2)
   return d

Distance(item1, item2) // item1 and item 2 and Mx1 vectors
    return the distance between the two vectors

See the output in Appendix A (below) for an example of an instance of this algorithm running on the cluster0 dataset. The 1-Norm run is very verbose and should help with understanding what this algorithm is doing.

There are 2 tricky parts of K-means.

Deciding the k value for k-means is difficult because increasing the number of clusters almost always makes the clusters appear "tighter" (more formally, the clusters have lower variance.)
Defining a distance metric based on the features is difficult when the data has one than one feature and/or has non-numeric features. For instance,
- Suppose the data has a feature with the domain {"A", "S", "D", "F"}. What is the distance between each of these?
- Consider a dataset in which items have two numeric features; one feature is a number with an average of 10 and standard deviation of 5 and another has an average of 1000 and a standard deviation of 50. Should you rescale the features so both have an average of 100? Doing so might be helpful, but you still have a unequal variance.
Even assuming away these problems there still remains a decision about the appropriate form of the distance metric. So, in this lab you will implement 3 (highly-related) metrics.

Task 1

Implement K-means using the language of your choice. Your implementation should be closely follow the algorithm above so that you can investigate cluster sizes from two to six. In addition, you should implement each of the following distance metrics:

1 Norm aka manhattan distance: the distance between two items is the sum of the absolute values of the differences
2 Norm aka Euclidean distance: The distance between two items is the square root of the sum of the squares of the differences
∞ Norm: The distance between two items is the largest absolute value difference.

You may switch between these distances functions by actually changing your code.

Test your K-means clusterer on the cluster0 data. To test, set K=2 and, rather than randomly picking two starting items, use items 0 and 1. At the end of this lab is the output of my program (with a lot of extra printing) for the 1,2 and ∞ Norms.

Task 2

Use this algorithm to test two datasets: cluster1 and cluster2 Here is what you have to do:

Use k-means with k of from 2 to 6 on each of the cluster1.csv data set. Run k-means on the same k for the same data set multiple times (The only difference between your runs will be the randomly chose initial centroids.)
Repeat 1, but this time use the cluster2.csv data
Repeat 1 and 2 for each of the norms you implemented.

Questions

Which value of k do you think is the best? Why? Your answer here will be aided by a pictorial representation of the data in which you "project" the data onto a 2-D plane (by project, I mean select 2 dimensions in the data and show only those).
Even when you have the "right" number of clusters, K-means does not find always find the same set of clusters. Why not?
What was the average number of iterations for the means to stabilize? What factors affect this value?
Does K-means always converge to the same clusters? Explain.
Why does rescaling the data so that all dimensions have the same range help with k-means (or any) clustering system?
Above, I indicated there being different variances along different dimensions could cause problems for clustering. Explain. Is there anything you can do to alleviate this issue? Explain.

Figures, properly used, may significantly aid answers to some of these questions

What to hand in

1. A short (no more than 3 pages) formal report describing your findings. Use the outputs of your programs on the questions above to frame your narrative and conclusions. Do not simply put the answers as bullet points. Do include at least one figure or table in your report.

2. In the Appendix of the report include: A printout of the program that you wrote. Indicate, with comments, how parts of the program were used in your solution.

Appendix A

Output of my program for Task #1 with the cluster0 dataset

Note: the ID column is just the line number in the data file. The ID is used in the rest of the output to identify items. For instance "Randomly picked items 0 and 1 as initial centers" means that this initial centers are [1 2] and [1 3].

  Item   Content
  ID   Content
----   -------
   0   [1 2]
   1   [1 3]
   2   [10 11]
   3   [11 12]
   4   [12 11]
   5   [10 10]
   6   [1 4]
   7   [4 1]
   8   [6 6]
   9   [4 4]


  Doing K-Means with k=2
Stopping when old centers differ from new centers by less than 0.100000
Using 1-Norm
Randomly picked items 0 and 1 as initial centers

Iteration: 0
Old Cluster centers    : [[0 0] [0 0]]
Current Cluster centers: [[1 2] [1 3]]
distance from old centers to new centers: 7.00
Data item assignments ('groups' in algorithm): [[0 7] [1 2 3 4 5 6 8 9]]

Iteration: 1
Old Cluster centers    : [[1 2] [1 3]]
Current Cluster centers: [[2.5 1.5] [6.875 7.625]]
distance from old centers to new centers: 12.50
Data item assignments ('groups' in algorithm): [[0 1 6 7 9] [2 3 4 5 8]]

Iteration: 2
Old Cluster centers    : [[2.5 1.5] [6.875 7.625]]
Current Cluster centers: [[2.2 2.8] [9.8 10]]
distance from old centers to new centers: 6.90
Data item assignments ('groups' in algorithm): [[0 1 6 7 8 9] [2 3 4 5]]

Iteration: 3
Old Cluster centers    : [[2.2 2.8] [9.8 10]]
Current Cluster centers: [[2.8333333333333335 3.3333333333333335] [10.75 11]]
distance from old centers to new centers: 3.12
Data item assignments ('groups' in algorithm): [[0 1 6 7 8 9] [2 3 4 5]]
STOPPING distance from old centers to new centers: 0.00

Final cluster centers: [[2.8333333333333335 3.3333333333333335] [10.75 11]]
  
//Same algorithm, just less output
  Using 2-Norm:
  Iteration: 0  distance from old centers:5.40
  Current Cluster centers: [[1 2] [1 3]]
  Data item assignments: [[0 7] [1 2 3 4 5 6 8 9]]
  Iteration: 1  distance from old centers:9.06
  Current Cluster centers: [[2.5 1.5] [6.875 7.625]]
  Data item assignments: [[0 1 6 7 9] [2 3 4 5 8]]
  Iteration: 2  distance from old centers:5.10
  Current Cluster centers: [[2.2 2.8] [9.8 10]]
  Data item assignments: [[0 1 6 7 8 9] [2 3 4 5]]
  Iteration: 3  distance from old centers:2.21
  Current Cluster centers: [[2.8333333333333335 3.3333333333333335] [10.75 11]]
  Data item assignments: [[0 1 6 7 8 9] [2 3 4 5]]
  Final cluster centers: [[2.8333333333333335 3.3333333333333335] [10.75 11]]
  
  Using ∞ Norm 
  Iteration: 0  distance from old centers:5.00
  Current Cluster centers: [[1 2] [1 3]]
  Data item assignments: [[0 2 3 4 5 7 8 9] [1 6]]
  Iteration: 1  distance from old centers:6.75
  Current Cluster centers: [[7.25 7.125] [1 3.5]]
  Data item assignments: [[2 3 4 5 8] [0 1 6 7 9]]
  Iteration: 2  distance from old centers:4.08
  Current Cluster centers: [[9.8 10] [2.2 2.8]]
  Data item assignments: [[2 3 4 5] [0 1 6 7 8 9]]
  Iteration: 3  distance from old centers:1.63
  Current Cluster centers: [[10.75 11] [2.8333333333333335 3.3333333333333335]]
  Data item assignments: [[2 3 4 5] [0 1 6 7 8 9]]
  Final cluster centers: [[10.75 11] [2.8333333333333335 3.3333333333333335]]

Appendix B

Rough Grading Rubric

Topic	Point Value
Introduction/Conclusion	10
K-means textual description	10
Answers to Questions 6 questions,8 points each	48
Figures	10
Appendices	22

Appendix C

Secret identity

For this lab you secret identity should be two extant cities that you would like to visit.