Unsupervised Learning and Clustering

26. Unsupervised Learning and Clustering#

Amanda R. Kube Jotte

The models that we’ve talked about so far in this book belong to a category of machine learning called Supervised Learning. In supervised learning algorithms, every observation \(x_i\) has an associated label/response \(y_i\). We learn the relationship between the inputs and response with the goal of predicting the response for new observations.

In this chapter, we will discuss the other main category of statistical learning, Unsupervised Learning. We call this set of algorithms unsupervised because we no longer have an associated response for each observation. We can still look for patterns in the data, but now we are looking for patterns between input variables or observations.

Just as we had different categories of supervised learning (regression and classification), there are different categories of unsupervised learning as well. Here, we will focus on one of the most commonly used categories of unsupervised learning, clustering. Clustering algorithms partition data into distinct subgroups with the goal of finding groups such that observations within the same group are similar to each other and different from observations in other groups.

Two common use cases of clustering are market segmentation and medical image analysis. Market segmentation refers to businesses grouping customers based on purchasing behavior and using that information to create targeted marketing strategies for particular groups of customers. An example of clustering in medicine might include grouping images of tumors based on physical characteristics and using this information to help identify different subtypes of cancers. It is also common to use clustering along with domain knowledge to label observations and then use those labels for supervised learning.

In the following sections, we will discuss two common clustering algorithms, K-Means Clustering and Hierarchical Clustering, their implementation, and use cases.