The data points can be in as many dimensions as we want. For example, maybe we have measurements on a bunch of penguins. We could use just their heights as data values, and cluster them in one dimension to look for groupings. Or we could plot their heights vs. their beak lengths, and look for clusters in 2D. If we knew their wing lengths, we could plot points in 3D where each point represents a single penguin and its coordinates correspond to its height, beak length, and wing length.
Your solutions should be flexible enough work with any number of dimensions. We will consult the user for the desired number of clusters, and also prompt them for the initial positions of the cluster centers. (This simplifies the assignment since we won't need to figure out how to generate random coordinates, and it also gives us more control during testing.) In the interests of simplifying the assignment, we'll also hard-code in the data values to be clustered rather than reading them from a file. Haskell can do file I/O like any respectable language, but we don't have the time to learn the details.
When checking to see if the centers have stopped moving, it would be nice to have a definition of-- Data type to hold information about a labeled data point. The constructor -- takes a label and a list of Doubles that is the point's location. It's -- general enough that the label can be of any type, and the list of Doubles -- can accommodate any number of dimensions in the data. data LabeledPoint a = Point a [Double] deriving (Ord, Show)
==
that checks whether two points are close to the same location without being identical. (Checking doubles with ==
is a bad idea in any language.) Therefore, instead of deriving the default implementation of ==
, we'll define our own version for a change:
Make sure you understand how that function works, since similar techniques might be useful elsewhere in your implementation: The-- Here we define our own version of == rather than asking for its default -- implementation. Points are == if their coordinates are within .0001 across -- all dimensions. instance Eq (LabeledPoint a) where Point _ ns == Point _ ms = and (map (\(n,m)->abs(n-m)<0.0001) (zip ns ms))
zip
call pairs up the two points' coordinates. For example, if we had a point at [5,10,7]
and one at [5,9,7]
, zipping the two lists would give [(5,5),(10,9),(7,7)]
. The map
turns that list of tuples into a list of booleans — [True, False, True]
in this case. The and
call returns true if all of the boolean values in the list are True
. In this case our two hypothetical points would not be equal because their second dimensions are too far apart, but if they were within 0.0001 of each other they would be considered equal. Note that because ==
is defined on LabeledPoint
, lists of points can be compared via ==
and it will return true if the points in the lists satisfy our definition of equality.
The starter code also contains a function that takes a list of data points and a list of centers, and prints them to the screen in a format that can be directly pasted into a Google Sheet and charted (as a scatter plot) if you want to see your clusters visually. Even if you don't end up using the function, it would be worth studying to see how it works. You might be able to borrow some techniques from it for use in your own code.
map
, filter
, and foldr1
— they will simplify your programming tasks and make the resulting definitions shorter and easier to read. If you choose a different organization for your implementation, keep as many of the functions on the "pure side" as possible. The only two functions below that require side effects are the readCenters
function that reads the initial cluster center points from the user, and the main
function that does some additional input and, after clustering, prints the results.
distance
: Takes two LabeledPoint
values and returns the square of the distance between them. This will be used to determine the distance between a point and a cluster center, when picking the closest center. You could calculate the actual distance rather than the square of the distance, but all we really care about is which center is closest, and the squared distance will work just fine for that without requiring us to use sqrt
, which is computationally expensive. This function should work no matter how many dimensions the points have, though you can assume they all have the same number of dimensions. Here are some sample interactions:
> distance (Point "foo" [5]) (Point "bar" [7]) 4.0 > distance (Point "us" [0,0]) (Point "them" [10,-10]) 200.0 > distance (Point 1 [2,5,3]) (Point 2 [4,8,7]) 29.0
nearest
:
Takes a point and a list of points (cluster centers) and returns the label of the nearest point in the list. (The previous function will be helpful here.) You may assume the list contains at least one point. The interactions below only show 2D points, but the function should work with any number of dimensions. You may assume that all points have the same number of dimensions.
> nearest (Point "p" [10,10]) [Point "a" [0,2],Point "b" [12,15],Point "c" [-3,5]] "b" > nearest (Point "p" [10,10]) [Point "a" [0,2]] "a"
relabel
:
Takes a list of data points and a list of cluster centers and reassigns the points to their nearest cluster centers. More specifically, it returns a list of points in the same order as the original data points but potentially with different labels — the label of the nearest cluster center. The interactions below only show 1D, but your function should work with any number of dimensions. Here again you may assume all of the points have the same number of dimensions. The interactions below show a list of points being relabeled against a variety of different centers.
> let points = [Point "x" [-2],Point "x" [1],Point "x" [15],Point "x" [5],Point "x" [17]] > relabel points [Point "left" [0],Point "right" [12]] [Point "left" [-2.0],Point "left" [1.0],Point "right" [15.0],Point "left" [5.0],Point "right" [17.0]] > relabel points [Point "right" [12],Point "left" [0]] [Point "left" [-2.0],Point "left" [1.0],Point "right" [15.0],Point "left" [5.0],Point "right" [17.0]] > relabel points [Point "left" [0],Point "right" [8]] [Point "left" [-2.0],Point "left" [1.0],Point "right" [15.0],Point "right" [5.0],Point "right" [17.0]] > relabel points [Point "left" [0],Point "right" [0]] [Point "right" [-2.0],Point "right" [1.0],Point "right" [15.0],Point "right" [5.0],Point "right" [17.0]]
center
:
Takes a list of points and returns the coordinates of the center of the group. In my implementation, this function returns a list of doubles rather than a point. You may assume that there's at least one point in the list, and that all points have the same number of dimensions, but it should work in any number of dimensions. In 2D, for example, it would return a list where the X coordinate (the head of the output list) is the average of the X coordinates across all points, and the Y coordinate (the second value in the list) is the average of all Y coordinates. It might be helpful to think about ways to build a list containing the nth coordinates from across all of the points. If you could extend that function to collect coordinate values and average them, you could then map that across the dimensions. (Hint: You'll probably need to use fromIntegral
to make the types work out right in your average calculations.)
> center [Point 0 [10],Point 1 [12],Point 2 [14]] [12.0] > center [Point 0 [-5],Point 17 [5]] [0.0] > center [Point 4 [0,0],Point 5 [10,0],Point 6 [0,10],Point 7 [10,10]] [5.0,5.0] > center [Point "a" [5,5,5],Point "b" [10,0,15]] [7.5,2.5,10.0]
recenter
:
Takes a list of data items (assigned to various clusters) and a list of cluster centers and finds the new center points for each of the cluster centers. It returns a list of points (cluster centers) in the same order as the original but with potentially different coordinates. (Hint: You basically need to map the previous function over the cluster centers to implement this.) Note that it's possible that a center could end up with no points assigned to it. My code doesn't handle that case properly and yours doesn't need to either.
cluster
:
Takes a list of data points and a list of centers, and applies the K-Means algorithm to assign points to clusters. Given the helper functions above, this boils down to calling relabel
and recenter
over and over until the centers stop changing. The interactions below show how to do those steps manually. First we relabel the points (to get points'
), then we recenter the cluster centers within their new groups (to get centers'
). Two of the points are associated with the first cluster, and four as associated with the second at this point. After another round of relabeling and recentering, we discover that the solution has converged — centers'
is the same as centers''
. (Remember that using ==
on two lists of centers uses LabeledPoint
's definition of ==
for each of the points in the lists, so the two lists wouldn't have to be identical — just pretty darned close.)
You'll want to write a single (probably recursive) function that repeats the relabel and recenter calls until the centers stop changing, then return a two-tuple containing both the labeled points and the final centers. Here's a call to> let points = [Point 0 [0], Point 0 [2], Point 0 [5], Point 0 [7], Point 0 [10], Point 0 [12]] > let centers = [Point 0 [2.5], Point 1 [7.1]] > let points' = relabel points centers > points' [Point 0 [0.0],Point 0 [2.0],Point 1 [5.0],Point 1 [7.0],Point 1 [10.0],Point 1 [12.0]] > let centers' = recenter points' centers > centers' [Point 0 [1.0],Point 1 [8.5]] > let points'' = relabel points' centers' > let centers'' = recenter points'' centers' > centers'' [Point 0 [1.0],Point 1 [8.5]] > centers' == centers'' True
cluster
on the same inputs as were shown above:
> cluster points centers ([Point 0 [0.0],Point 0 [2.0],Point 1 [5.0],Point 1 [7.0],Point 1 [10.0],Point 1 [12.0]],[Point 0 [1.0],Point 1 [8.5]])
readCenters
:
Takes an integer (the number of desired center points to be entered) and prompts the user for the details of points. It returns a list containing the LabeledPoint
values specified by the user. In my code I don't put any restrictions on the number of dimensions in the coordinates, or even enforce that they're the same across points, but you should feel free to write more robust code here if you wish. Also, while the functions above are general enough that the point labels could be of any type, readCenters
needs to make a choice. You can see from the type signature at the end of the interactions below that mine treats all labels as strings. (Hint, you can read entire lists of doubles rather than inputting the coordinates individually.) I've shown the inputs in blue below to make it clear which things are being typed by the user, but they won't really be blue in ghci.
> readCenters 2 Please enter a center label: foo Please enter a list of coordinates: [3, 4,5] Please enter a center label: bar Please enter a list of coordinates: [-1.5,0,1.73] [Point "foo" [3.0,4.0,5.0],Point "bar" [-1.5,0.0,1.73]] > :type readCenters readCenters :: Int -> IO [LabeledPoint String]
main
:
Your solution is required to have a main function that starts a run of your K-Means code. I'm including the framework of a main function in the starter code that shows how it hard-codes in a set of data points to be clustered, prints information about the points, prompts the user for information, then runs the clustering function and reports the results. Here's a sample run of my implementation:
I've put together snapshots of the clustering process on a separate page.> main Welcome to k-Means There are 150 points to be labeled. It looks like we're working with 2 dimension(s). How many centers would you like? 3 Please enter a center label: one Please enter a list of coordinates: [8,10] Please enter a center label: two Please enter a list of coordinates: [35, 5] Please enter a center label: three Please enter a list of coordinates: [17.5, 35] Final centers are: [Point "one" [9.363265306122447,9.60204081632653],Point "two" [30.13199999999999,9.862],Point "three" [19.55686274509804,24.617647058823533]] Labeled points: [Point "one" [12.1,7.2],Point "one" [8.7,11.0],Point "one" [6.1,3.9],Point "one" [11.7,8.2],Point "one" [7.6,8.5],Point "one" [4.3,11.0],Point "one" [7.5,11.7],Point "one" [6.8,6.3],Point "one" [8.6,14.4],Point "three" [17.0,17.6],Point "one" [12.8,16.3],Point "one" [4.2,10.9],Point "one" [2.8,6.6],Point "one" [2.6,15.0],Point "one" [10.1,10.0],Point "one" [2.0,15.6],Point "one" [7.5,8.4],Point "one" [10.3,11.7],Point "one" [14.2,7.8],Point "one" [9.8,3.7],Point "one" [8.6,9.1],Point "one" [9.2,8.2],Point "one" [9.8,7.3],Point "one" [9.6,9.6],Point "one" [10.2,14.5],Point "one" [9.1,9.0],Point "one" [13.0,9.7],Point "one" [6.3,9.4],Point "one" [10.2,10.1],Point "one" [11.1,13.8],Point "one" [9.4,12.0],Point "one" [13.1,5.2],Point "one" [5.9,12.9],Point "one" [12.1,9.3],Point "one" [3.7,13.3],Point "one" [10.8,10.0],Point "one" [9.1,3.2],Point "one" [12.2,-1.1],Point "one" [10.0,10.0],Point "one" [13.9,11.2],Point "one" [9.4,10.0],Point "one" [14.2,7.6],Point "one" [10.7,4.6],Point "one" [9.2,10.4],Point "one" [9.8,10.7],Point "one" [11.6,9.5],Point "one" [13.7,8.3],Point "one" [10.5,11.6],Point "one" [11.2,10.6],Point "one" [11.5,12.3],Point "three" [18.5,26.9],Point "three" [19.9,25.2],Point "three" [18.7,24.2],Point "three" [18.2,33.3],Point "three" [24.1,25.8],Point "three" [18.8,25.9],Point "three" [17.1,24.5],Point "three" [17.0,26.2],Point "three" [20.3,25.0],Point "three" [20.3,25.0],Point "three" [22.0,27.3],Point "three" [21.5,25.8],Point "three" [22.1,26.6],Point "three" [25.6,28.9],Point "three" [13.9,20.7],Point "three" [26.0,27.2],Point "three" [19.9,25.0],Point "three" [10.2,26.0],Point "three" [19.9,24.8],Point "three" [20.3,24.3],Point "three" [19.0,28.7],Point "three" [17.0,22.1],Point "three" [15.2,22.4],Point "three" [21.0,22.2],Point "three" [19.1,28.5],Point "three" [22.6,24.0],Point "three" [18.2,26.4],Point "three" [20.0,25.1],Point "three" [19.3,18.8],Point "three" [20.7,21.3],Point "three" [17.3,24.4],Point "three" [24.5,19.7],Point "three" [12.3,21.6],Point "three" [16.4,25.6],Point "three" [22.0,26.4],Point "three" [18.1,24.1],Point "three" [19.0,23.8],Point "three" [21.1,31.5],Point "three" [17.4,24.0],Point "three" [20.1,24.9],Point "three" [17.7,22.0],Point "three" [20.6,29.3],Point "three" [21.2,29.7],Point "three" [19.8,18.9],Point "three" [21.4,18.8],Point "three" [23.1,23.5],Point "three" [21.2,24.5],Point "three" [21.0,24.5],Point "three" [19.9,23.9],Point "three" [19.9,18.7],Point "two" [25.9,14.2],Point "two" [26.5,12.4],Point "two" [32.3,10.3],Point "two" [26.0,5.7],Point "two" [22.7,6.0],Point "two" [30.1,13.0],Point "two" [33.8,8.1],Point "two" [29.9,9.3],Point "two" [38.9,10.3],Point "two" [30.7,10.5],Point "two" [35.2,1.3],Point "two" [28.4,14.8],Point "two" [30.8,8.2],Point "two" [26.3,16.8],Point "two" [24.3,7.6],Point "two" [32.1,14.4],Point "two" [31.2,8.6],Point "two" [31.4,9.3],Point "two" [31.3,10.5],Point "two" [35.4,13.3],Point "two" [30.0,10.7],Point "two" [29.1,10.2],Point "two" [25.2,6.0],Point "two" [30.3,10.0],Point "two" [30.0,2.9],Point "two" [28.7,0.3],Point "two" [29.2,8.8],Point "two" [25.4,-1.5],Point "two" [24.0,4.5],Point "two" [30.2,14.3],Point "two" [33.5,11.2],Point "two" [31.9,10.7],Point "two" [28.1,10.4],Point "two" [29.9,8.7],Point "two" [31.6,10.4],Point "two" [29.3,8.5],Point "two" [29.3,13.7],Point "two" [27.2,11.3],Point "two" [33.0,7.7],Point "two" [30.0,10.0],Point "two" [28.4,15.1],Point "two" [37.1,13.3],Point "two" [37.6,13.1],Point "two" [29.6,10.7],Point "two" [31.9,14.2],Point "two" [29.5,9.7],Point "two" [30.1,9.5],Point "two" [29.6,9.6],Point "two" [34.1,14.7],Point "two" [29.6,9.8]]
points_2D
.