Project: MNIST with k-means¶

Project weight: 10 points

This project uses the MNIST database. The files (the same as for the k-NN project) can be downloaded here:

train-images-idx3-ubyte.gz

train-labels-idx1-ubyte.gz

Objectives¶

Part 1.

Use the k-means algorithm to split MNIST images into 10 clusters.
Display centroids of the clusters. Do they resemble digits?
Check how useful the clusters are for classifying images of digits. That is, check how accurately we can predict which image corresponds to which digit if we know which cluster the image belongs to.

Part 2.

Note: For full credit, in your report you can choose to investigate one of the following approaches (2a or 2b). You can also experiment with both of them, but this is not required.

An issue with the k-NN algorithm is that it can be slow if we have a big set of training data, and if this data is highly dimensional. The goal of this part of the project is to investigate two different ways of dealing with this issues using k-means clustering. One way is to reduce the amount of training data, and the second is to reduce dimensionality of data.

Part 2a The goal of this part is to experiment with a possible way of making k-NN predictions faster, by reducing the amount of the training data.

Split MNIST images into training and test data.
For each digit 0-9 select all training images corresponding to this digit, and then use k-means to split these images into several (e.g. 100) clusters. Use centroids of these clusters as new training data for k-NN. This will make the amount of training data smaller, since every cluster of the original training data will be replaced by a single centroid.
Investigate how the prediction accuracy and speed of k-NN using this new training data compares to the case when you use training data without clustering. Also, if you train k-NN using the centroid training data and then train it using the same amount of randomly selected training data is there a difference in the prediction accuracy?

Part 2b The goal of this part is to experiment with a possible way of making k-NN predictions faster, by reducing dimensionality of the MNIST data.

Split MNIST images into training and test data.
Use k-means to compute clusters for the training data. Then use these clusters to reduce dimensionality of both training and test data, and investigate how the prediction accuracy and speed of k-NN using the reduced data compares to the predictions done without dimensionality reduction. Experiment with dimensionality reduction done with different numbers of clusters and compare the results.

Note. For parts 2a and 2b use the k-NN classifier implemented in sklearn. It will run faster than the implementation that you wrote yourself.