Clustering is used in pattern recognition to automatically group data based on a similarity metric. Clustering is also used in "unsupervised learning" in a similar fashion; given that you do not know how data is grouped, just define how to recognize how different two items are, then use that the find groupings.
/home/gtowell/Public/CS337/Lab09Dataor in a compressed tar file at:
/home/gtowell/Public/CS337/lab09data.tgzThe directory contains 3 files:
cluster0.csv cluster1.csv cluster2.csvwhich all have data in csv format. Each line contains numbers separated by commas. Every line has the same number of numbers (and the same number of commas). For example, the cluster0.csv file is:
1,2 1,3 10,11 11,12 12,11 10,10 1,4 4,1 6,6 4,4(The other two file contain real numbers rather than integers). You should interpret the numbers on a line as the coordinates of a point in M dimensional space (in this case M=2).
Given: K : the number of clusters to create (a small integer) Close_Enough : a stopping criterion (a small floating point number) DATA: a MxN array of floats; M is the number of floats in a data item, N is the number of data items Return: A MxK array of floats; That is, K cluster centers (centroids). Let: CM = a KxM array of floats OLD_CM = a KxM array of floats Initialize OLD_CM to all 0 Initialize CM to K randomly chosen elements from DATA while true if Clus_Distance(CM, OLD_CM) < Close_Enough break Let: groups be an array of size K of something that can hold 1xM arrays, initially empty For each item in DATA Find the citem in CM to which item is closest add item to group corresponding to closest citem Copy CM into OLD_CM For j in 0..(K-1) CM[j] = centroid of group[j] (the average of the group) return CM Clus_Distance(cOne, cTwo) // cOne and cTwo are KxM 2-D arrays Let d=0 foreach c1,c2 in cOne, cTwo d += Distance(c1,c2) return d Distance(item1, item2) // item1 and item 2 and Mx1 vectors return the distance between the two vectorsSee the output in Appendix A (below) for an example of an instance of this algorithm running on the cluster0 dataset. The 1-Norm run is very verbose and should help with understanding what this algorithm is doing.
There are 2 tricky parts of K-means.
What to hand in
1. A short (no more than 3 pages) formal report describing your findings. Use the outputs of your programs on the questions above to frame your narrative and conclusions. Do not simply put the answers as bullet points. Do include at least one figure or table in your report.
2. In the Appendix of the report include: A printout of the program that you wrote. Indicate, with comments, how parts of the program were used in your solution.
Item Content ID Content ---- ------- 0 [1 2] 1 [1 3] 2 [10 11] 3 [11 12] 4 [12 11] 5 [10 10] 6 [1 4] 7 [4 1] 8 [6 6] 9 [4 4] Doing K-Means with k=2 Stopping when old centers differ from new centers by less than 0.100000 Using 1-Norm Randomly picked items 0 and 1 as initial centers Iteration: 0 Old Cluster centers : [[0 0] [0 0]] Current Cluster centers: [[1 2] [1 3]] distance from old centers to new centers: 7.00 Data item assignments ('groups' in algorithm): [[0 7] [1 2 3 4 5 6 8 9]] Iteration: 1 Old Cluster centers : [[1 2] [1 3]] Current Cluster centers: [[2.5 1.5] [6.875 7.625]] distance from old centers to new centers: 12.50 Data item assignments ('groups' in algorithm): [[0 1 6 7 9] [2 3 4 5 8]] Iteration: 2 Old Cluster centers : [[2.5 1.5] [6.875 7.625]] Current Cluster centers: [[2.2 2.8] [9.8 10]] distance from old centers to new centers: 6.90 Data item assignments ('groups' in algorithm): [[0 1 6 7 8 9] [2 3 4 5]] Iteration: 3 Old Cluster centers : [[2.2 2.8] [9.8 10]] Current Cluster centers: [[2.8333333333333335 3.3333333333333335] [10.75 11]] distance from old centers to new centers: 3.12 Data item assignments ('groups' in algorithm): [[0 1 6 7 8 9] [2 3 4 5]] STOPPING distance from old centers to new centers: 0.00 Final cluster centers: [[2.8333333333333335 3.3333333333333335] [10.75 11]] //Same algorithm, just less output Using 2-Norm: Iteration: 0 distance from old centers:5.40 Current Cluster centers: [[1 2] [1 3]] Data item assignments: [[0 7] [1 2 3 4 5 6 8 9]] Iteration: 1 distance from old centers:9.06 Current Cluster centers: [[2.5 1.5] [6.875 7.625]] Data item assignments: [[0 1 6 7 9] [2 3 4 5 8]] Iteration: 2 distance from old centers:5.10 Current Cluster centers: [[2.2 2.8] [9.8 10]] Data item assignments: [[0 1 6 7 8 9] [2 3 4 5]] Iteration: 3 distance from old centers:2.21 Current Cluster centers: [[2.8333333333333335 3.3333333333333335] [10.75 11]] Data item assignments: [[0 1 6 7 8 9] [2 3 4 5]] Final cluster centers: [[2.8333333333333335 3.3333333333333335] [10.75 11]] Using ∞ Norm Iteration: 0 distance from old centers:5.00 Current Cluster centers: [[1 2] [1 3]] Data item assignments: [[0 2 3 4 5 7 8 9] [1 6]] Iteration: 1 distance from old centers:6.75 Current Cluster centers: [[7.25 7.125] [1 3.5]] Data item assignments: [[2 3 4 5 8] [0 1 6 7 9]] Iteration: 2 distance from old centers:4.08 Current Cluster centers: [[9.8 10] [2.2 2.8]] Data item assignments: [[2 3 4 5] [0 1 6 7 8 9]] Iteration: 3 distance from old centers:1.63 Current Cluster centers: [[10.75 11] [2.8333333333333335 3.3333333333333335]] Data item assignments: [[2 3 4 5] [0 1 6 7 8 9]] Final cluster centers: [[10.75 11] [2.8333333333333335 3.3333333333333335]]
Topic | Point Value |
---|---|
Introduction/Conclusion | 10 |
K-means textual description | 10 |
Answers to Questions 6 questions,8 points each | 48 |
Figures | 10 |
Appendices | 22 |