# K-Means ClusteringÂ¶

In this notebook, we're going to learn about the K-Means clustering algorithm. We'll implement it ourselves and look at various applications.

## ClusteringÂ¶

*Clustering* is a form of unsupervised machine learning. Given a set $X$, the aim of clustering is to partition $X$ such that the elements of each partition are "close". I've put "close" in quotation marks because you can define "close" in any way that's convenient for the problem you're solving. Mathematically, we say that the *metric* we use to measure closeness is up to us.

A *metric* is a measure of closeness. Specifically, given a set $X$, we define a metric as a map $d: X \times X \rightarrow \mathbb{R}$ that satisfies the following for any $x, y, z \in X$:

- Positive Definiteness: $d(x,y) \geq 0$ with $d(x,y) = 0$ if and only if $x = y$.
- Symmetry: $d(x,y) = d(y,x)$
- Triangle Inequality: $d(x,y) \leq d(x,z) + d(z,y)$.

The pair $(X, d)$ is called a *metric space*.

We can reformulate the concept of clustering in terms of metric spaces. Given a metric space $(X, d)$, clustering aims to partition $X$ such that the elements of each partition are close according to the metric $d$.

It's useful to think of clustering in terms of metric spaces because they're so abstract. As long as you can represent elements of $X$ and the metric $d$ in code, you can run a clustering algorithm on that metric space.

### Clustering AlgorithmsÂ¶

A *clustering algorithm* attempts to find the optimal partition of $X$; that is, the partition where the elements of each set in the partition are closer than they would be in any other partition.

Now, if the optimal partition can consist of sets of any size, then the obvious answer to this problem is to partition $X$ into $n$ sets of size 1, where $n$ is the cardinality of $X$. That's not very interesting, so we typically choose a fixed number of sets $k$ that the optimal partition will have. $k$ is a *hyperparameter* of the clustering algorithm.

Mathematically, if $X = \{x_1, \ldots, x_n\}$, a *$k$-partition* of $X$ is a partition with exactly $k$ non-empty elements. We can write a $k$-partition as $S = \{S_1, \ldots, S_k\}$, where $X = \bigcup_{i=1}^k{S_i}$ and $S_j \cap S_i = \emptyset$ for any $j, i$.

So, given the space $(X, d)$ and $k$, let $\mathcal{S}$ be the set of all $k$-partitions of $X$. A *clustering algorithm* finds the $k$-partition $S^*$ that minimizes the metric $d$ within each partition, that is,

$$S^* = \text{argmin}_{S \in \mathcal{S}} \sum_i^k{\sum_{x_j \in S_i}{d(x_j, \mu_i)}}$$

Here, $\mu_i$ is the mean of $S_i$. In a metric space, $\mu_i$ is the element of $S_i$ that satisfies

$$\mu_i = \text{argmin}_{x_p \in S_i} \sum_{x_j \in S_i}{d(x_j, x_p)}$$

I should point out that $\mu_i$ doesn't necessarily have to be in $X$. If $X$ is a subset of $\mathbb{R}^n$ or $\mathbb{C}^n$, then $\mu_i$ can simply be calculated as the usual mean of the vectors in $S_i$.

In pretty much all real-world scenarios, $X$ will be rather large and it would be expensive to calculate $S^*$ by brute force. For that reason, there are algorithms that give us a reasonably good estimate of $S^*$ in a much shorter time. One of those algorithms is called K-Means.

## The K-Means Clustering AlgorithmÂ¶

The K-Means clustering algorithm works as follows:

- From $X$, choose $k$ initial cluster means, either randomly or by some heuristic.
- For $i = 0, \ldots, \text{maxiter}$
- Assign each data point in $X$ to the cluster mean that it's closest to
- Recompute the cluster means as the means of the newly formed clusters.
- If the new cluster means are sufficiently close ($\leq \text{tol}$) to the old cluster means, terminate early.

Let's walk through this algorithm in python code. We'll use the iris flower dataset, reduced to two principle components, as $X$. The iris dataset is a well-known dataset and is often used for testing machine learning models. You can read more about it here.

The TLDR is that the iris dataset contains 150 samples classified into 3 types of flowers: setosa, versicolor, and virginica. There are 5 features in the dataset, but we want to display our results on 2-d graphs. So, we'll reduce the dataset to just 2 principal components. I won't explain what a principal component is in this notebook; that's a very interesting subject for another time.

Here's a plot of the iris dataset for reference.

```
import numpy as np
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
plt.rcParams["figure.dpi"] = 200
iris_dataset = load_iris()
pca = PCA(n_components=2)
data = pca.fit_transform(iris_dataset.data)
colors = np.array(['#7d4e88', '#42a19d', '#fde725'])
for k in np.unique(iris_dataset.target):
mask = iris_dataset.target == k
x = data[:,0][mask]
y = data[:,1][mask]
plt.scatter(x, y, s=5, c=colors[k], label=iris_dataset.target_names[k])
plt.title('The Iris Dataset')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.legend(loc='lower left')
plt.show()
```

The first step in implementing the K-Means algorithm is to define the hyperparameters and the set $X$.

We'll set $X$ to the $150 \times 2$ matrix containing the iris data. We'll also set the tolerance (`tol`

) to `0.00001`

and the maximum number of iterations `max_iter`

to 100.

Since there are 3 types of flowers in the iris dataset, we'll set $k$ (`n_clusters`

) to 3 and compare the clusters that the algorithm creates to the true clusters in the graph above.

```
X = data
max_iter = 100
tol = 1e-5
n_clusters = 3
print("First 5 rows\n", X[:5,:])
```

First 5 rows [[-2.68412563 0.31939725] [-2.71414169 -0.17700123] [-2.88899057 -0.14494943] [-2.74534286 -0.31829898] [-2.72871654 0.32675451]]

The next step is to randomly assign the initial cluster means and define a map where we'll store the clusters.

```
means = X[np.random.randint(X.shape[0], size=n_clusters),:]
clusters = {k: [] for k in range(n_clusters)}
```

Now for the main loop. On each iteration, we need to do 3 things.

First, we assign each sample in $X$ to the cluster mean that is closest. The distance between two vectors $x, y$ is the norm of their difference $\lVert x - y \rVert$. Since we're storing the means as a $3 \times 2$ NumPy array, we can quickly calculate the distances by taking the norm of the difference along axis 1.

Once we've built up the clusters, the second step is to recompute the cluster means as the means of the new clusters. If one of the clusters happens to be empty, we'll just use a random vector from $X$ instead.

Finally, we check to see if the newly computed means are within `tol`

of the means from the previous iteration. Again, we can do this by taking the norm of the difference along axis 1 and grabbing the maximum norm from that list.

```
for i in range(max_iter):
# Assign each sample to the cluster mean that is closest
for sample in X:
distances = np.linalg.norm(means - sample, axis=1)
classification = np.argmin(distances)
clusters[classification].append(sample)
# Recompute cluster means as the means of the new clusters
new_means = np.zeros_like(means)
for k, cluster in clusters.items():
if len(cluster) == 0:
new_means[k] = X[np.random.randint(X.shape[0]),:]
else:
new_means[k] = np.mean(cluster, axis=0)
# If old and new means are sufficiently close, terminate early
if np.max(np.linalg.norm(means - new_means, axis=1)) <= tol:
break
# Otherwise, assign new means, reset clusters and continue
means = new_means
clusters = {k: [] for k in range(n_clusters)}
```

Let's see how it did. We'll see which cluster each data point belongs to by iterating over the data and comparing it to each cluster mean.

```
classifications = []
for row in X:
classification = np.argmin(np.linalg.norm(means - row, axis=1))
classifications.append(classification)
classifications = np.array(classifications)
plt.scatter(data[:,0], data[:,1], s=5, c=colors[classifications])
for k in range(means.shape[0]):
plt.scatter(means[k,0], means[k,1], marker='+', c=colors[k], label='Cluster Mean' if k == 0 else None)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('K-Means Clustering on the Iris Dataset')
plt.legend(loc='upper left')
plt.show()
```

It looks like the clusters it picked out are pretty close to the original data!

Now you might be wondering, what was the point of that? We already know which group each point belongs to, so why run K-Means?

Well, first of all, imagine we were working with a dataset where we didn't know the correct groups. In that case, K-Means would help us make sense of the data by dividing it into groups of points that are close together. And remember, K-Means works on any metric space so even if you can't plot the data, K-Means can still give you groups that make sense according to the metric you're using.

Another important use of K-Means is that it gives us a *model* by which to label data points outside of the set $X$.

## Clustering Algorithms as a Predictive ModelÂ¶

One use of a clustering algorithm is that we can use the resulting $k$-partition $S^*$ (or approximation thereof) to classify elements outside of $X$.

In this case, we expand the metric space $X$ with a set $W$ where $d$ still applies. This gives us a new, larger metric space $(Y, d)$ where $Y = X \cup W$. We call $Y$ the *domain* of the clustering algorithm and we call $X$ the training data. When we run K-Means on $X$, we say that we're "fitting" a K-Means model to the training data.

Then, given $w \in W$, we can predict which $S_i$ $w$ belongs to by calculating

$$S_k = \text{argmin}_{S_i \in S^*} d(w, \mu_i)$$

Below, we group everything we've talked about so far into a class using sklearn's API.

```
from sklearn.base import BaseEstimator, ClusterMixin
class KMeans(BaseEstimator, ClusterMixin):
"""Basic k-means clustering class."""
def __init__(self, n_clusters=8, max_iter=100, tol=1e-5, norm=2):
"""Store clustering algorithm parameters.
Parameters:
n_clusters (int): How many clusters to compute.
max_iter (int): The maximum number of iterations to compute.
tol (float): The convergence tolerance.
norm (int): The order of the norm to use
"""
self.n_clusters = n_clusters
self.max_iter = max_iter
self.tol = tol
self.norm = norm
def fit(self, X, y=None):
"""Compute the cluster means from random initial conditions.
Parameters:
X ((n_samples, n_classes) ndarray): the data to be clustered.
"""
# Randomly assign initial means and set empty clusters
means = X[np.random.randint(X.shape[0], size=self.n_clusters),:]
clusters = {k: [] for k in range(self.n_clusters)}
for i in range(self.max_iter):
# Assign each sample to the cluster mean that is closest
for sample in X:
# The distance between two vectors is the norm of their
# difference. So we take the difference and compute
# the norm on each row.
distances = np.linalg.norm(means - sample, axis=1, ord=self.norm)
classification = np.argmin(distances)
clusters[classification].append(sample)
# Recompute cluster means as the means of the new clusters
new_means = np.zeros_like(means)
for k, cluster in clusters.items():
if len(cluster) == 0: # If the cluster is empty, use a random data point
new_means[k] = X[np.random.randint(X.shape[0]),:]
else:
new_means[k] = np.mean(cluster, axis=0)
# If old and new means are sufficiently close, terminate early
max_diff = np.max(np.linalg.norm(means - new_means, axis=1, ord=self.norm))
if max_diff <= self.tol:
break
# Otherwise, assign new means, reset clusters and continue
means = new_means
clusters = {k: [] for k in range(self.n_clusters)}
self.means = means
return self
def predict(self, W):
"""Classify each entry of W based on which cluster mean it belongs to.
Parameters:
W ((n_samples, n_classes) ndarray): the data to be clustered.
Returns:
((n_samples) ndarray): Integer labels from 0 to n_clusters for each entry of W.
"""
classifications = []
for row in W:
classification = np.argmin(np.linalg.norm(self.means - row, axis=1, ord=self.norm))
classifications.append(classification)
return np.array(classifications)
def fit_predict(self, X, y=None):
"""Fit to the data and return the resulting labels.
Parameters:
X ((n_samples, n_classes) ndarray): the data to be clustered.
"""
return self.fit(X).predict(X)
```

You'll notice that I added an additional hyperparameter, `norm`

. This allows us to change the metric we use from the regular norm (usually called the 2-norm and deonted $\lVert \cdot \rVert_2$) to other norms, like the 1-norm $\lVert \cdot \rVert_1$.

If you don't know, for any $x \in \mathbb{R}^n$ and $p \in \mathbb{N}$, the $p$-norm is defined as

$$\lVert x \rVert_p = (\lvert x_1 \rvert^p + \ldots + \lvert x_n \rvert^p)^{1/p}$$

As we talked about before, different metrics can provide better results in different contexts. Let's do an example.

## An Example with City PlanningÂ¶

Part of the city planning process is to figure out where fire stations should go. Each house needs to be within a reasonable distance of a fire station. K-Means can help with this.

Given the latitudes and longitudes of all the houses in Sacramento, K-Means will give us the optimal locations for fire stations throughout the city, so that no house is too far from a single station.

The function below accepts the number of clusters and the norm to use, and it plots the housing data with the cluster means as `+`

signs, which represent where the fire stations should go. It also tells us what the maximum distance any house is from a fire station under the 2-norm, so we can compare the distances at the end.

```
def plot_fire_stations(n_clusters, norm):
data = np.load('sacramento.npy')
model = KMeans(n_clusters=n_clusters, norm=norm)
labels = model.fit_predict(data)
means = model.means
plt.scatter(data[:,0], data[:,1], s=2, c=labels)
plt.scatter(means[:,0], means[:,1], s=50, c=np.unique(labels), marker='+')
plt.ylabel('Latitude')
plt.xlabel('Longitude')
plt.title(f'Optimal Sacramento Fire Station Placement: {norm}-norm')
plt.show()
```

Let's try it out with the regular 2-norm first.

```
plot_fire_stations(16, 2)
```

It's a bit difficult to interpret the max distance number, since our data is in latitude and longitude. We could say that no house is more than 0.19 latitude/longitude units away from a fire station. The plot helps to interpret this number. You can see that most houses are quite close to a station (marked by +'s).

The issue with using the 2-norm is that the 2-norm represents distance "as the crow flies", or in other words, from one point directly to another. Fire trucks don't travel as the crow flies, they travel along the streets. What metric could we use to account for this?

First, notice that the streets in the plot above appear to be in a grid. To get from one point to another in a grid system, you have to first travel along either the x or y axis, then travel along the other axis. The 1-norm is a metric that calculates the distance between two points as the total difference of the x and y coordinates. This is a much more accurate measure of distance in a grid system, so let's try again with the 1-norm.

```
plot_fire_stations(16, 1)
```

With the 1-norm, notice how the clusters are more diamond shaped. This is a reflection of the fact that it's quick to get to a fire station if you live on either of the streets that intersect it.