Clustering is used in pattern recognition to automatically group data based on a similarity metric. Clustering is also used in "unsupervised learning" in a similar fashion; given that you do not know how data is grouped, just define how to recognize how different two items are, then use that the find groupings.
A simple but effective clustering algorithm is known as K-means. At a high level, the K-means algorithm operates by identifying K "centers"; then collecting all of the data closest to each of the K centers; a then moving the center to be in the center of the close data. Repeat with the new centers until the centers stop moving. In more detail, the algorithm is as follows:Given: K : the number of clusters to create (a small integer) Close_Enough : a stopping criterion (a small floating point number) DATA: a MxN array of floats; M is the number of floats in a data item, N is the number of data items Return: A MxK array of floats; That is, K cluster centers (centroids). Let: CM = a KxM array of floats OLD_CM = a KxM array of floats Initialize OLD_CM to all 0 Initialize CM to K randomly chosen elements from DATA while true if Clus_Distance(CM, OLD_CM) < Close_Enough break Let: groups be an array of size K of something that can hold 1xM arrays, initially empty For each item in DATA Find the citem in CM to which item is closest add item to group corresponding to closest citem Copy CM into OLD_CM For j in 0..(K-1) CM[j] = centroid of group[j] (the average of the group) return CM Clus_Distance(cOne, cTwo) // cOne and cTwo are KxM 2-D arrays Let d=0 foreach c1,c2 in cOne, cTwo d += Distance(c1,c2) return d Distance(item1, item2) // item1 and item 2 and Mx1 vectors return the distance between the two vectorsSee the output in Appendix A (below) for an example of an instance of this algorithm running on the cluster0 dataset. The 1-Norm run is very verbose and should help with understanding what this algorithm is doing.
There are 2 tricky parts of K-means.
What to hand in
1. A short formal report describing your findings. Use the outputs of your programs on the questions above to frame your narrative and conclusions.
2. In the Appendix of the report include: A printout of the program that you wrote. Indicate (possibly with a one, or more, line comment) how parts of the program were used in your solution.
Item Content ID Content ---- ------- 0 [1 2] 1 [1 3] 2 [10 11] 3 [11 12] 4 [12 11] 5 [10 10] 6 [1 4] 7 [4 1] 8 [6 6] 9 [4 4] Doing K-Means with k=2 Stopping when old centers differ from new centers by less than 0.100000 Using 1-Norm Randomly picked items 0 and 1 as initial centers Iteration: 0 Old Cluster centers : [[0 0] [0 0]] Current Cluster centers: [[1 2] [1 3]] distance from old centers to new centers: 7.00 Data item assignments ('groups' in algorithm): [[0 7] [1 2 3 4 5 6 8 9]] Iteration: 1 Old Cluster centers : [[1 2] [1 3]] Current Cluster centers: [[2.5 1.5] [6.875 7.625]] distance from old centers to new centers: 12.50 Data item assignments ('groups' in algorithm): [[0 1 6 7 9] [2 3 4 5 8]] Iteration: 2 Old Cluster centers : [[2.5 1.5] [6.875 7.625]] Current Cluster centers: [[2.2 2.8] [9.8 10]] distance from old centers to new centers: 6.90 Data item assignments ('groups' in algorithm): [[0 1 6 7 8 9] [2 3 4 5]] Iteration: 3 Old Cluster centers : [[2.2 2.8] [9.8 10]] Current Cluster centers: [[2.8333333333333335 3.3333333333333335] [10.75 11]] distance from old centers to new centers: 3.12 Data item assignments ('groups' in algorithm): [[0 1 6 7 8 9] [2 3 4 5]] STOPPING distance from old centers to new centers: 0.00 Final cluster centers: [[2.8333333333333335 3.3333333333333335] [10.75 11]] //Same algorithm, just less output Using 2-Norm: Iteration: 0 distance from old centers:5.40 Current Cluster centers: [[1 2] [1 3]] Data item assignments: [[0 7] [1 2 3 4 5 6 8 9]] Iteration: 1 distance from old centers:9.06 Current Cluster centers: [[2.5 1.5] [6.875 7.625]] Data item assignments: [[0 1 6 7 9] [2 3 4 5 8]] Iteration: 2 distance from old centers:5.10 Current Cluster centers: [[2.2 2.8] [9.8 10]] Data item assignments: [[0 1 6 7 8 9] [2 3 4 5]] Iteration: 3 distance from old centers:2.21 Current Cluster centers: [[2.8333333333333335 3.3333333333333335] [10.75 11]] Data item assignments: [[0 1 6 7 8 9] [2 3 4 5]] Final cluster centers: [[2.8333333333333335 3.3333333333333335] [10.75 11]] Using ∞ Norm Iteration: 0 distance from old centers:5.00 Current Cluster centers: [[1 2] [1 3]] Data item assignments: [[0 2 3 4 5 7 8 9] [1 6]] Iteration: 1 distance from old centers:6.75 Current Cluster centers: [[7.25 7.125] [1 3.5]] Data item assignments: [[2 3 4 5 8] [0 1 6 7 9]] Iteration: 2 distance from old centers:4.08 Current Cluster centers: [[9.8 10] [2.2 2.8]] Data item assignments: [[2 3 4 5] [0 1 6 7 8 9]] Iteration: 3 distance from old centers:1.63 Current Cluster centers: [[10.75 11] [2.8333333333333335 3.3333333333333335]] Data item assignments: [[2 3 4 5] [0 1 6 7 8 9]] Final cluster centers: [[10.75 11] [2.8333333333333335 3.3333333333333335]]
Topic | Point Value |
---|---|
Introduction/Conclusion | 10 |
K-means textual description | 10 |
Answers to Questions 6 questions,8 points each | 48 |
Figures | 10 |
Appendices | 22 |