K-Means Clustering: A Larger Example

26.2. K-Means Clustering: A Larger Example#

Now that we understand the k-means clustering algorithm, let’s try an example with more features and use and elbow plot to choose \(k\). We will also show how you can (and should!) run the algorithm multiple times with different initial centroids because, as we saw in the animations from the previous section, the initialization can have an effect on the final clustering.

Clustering Countries#

For this example, we will use a dataset[1] with information about countries across the world. It includes demographic, economic, environmental, and socio-economic information from 2023. This data and more information about it can be found here. The first few lines are shown below.

countries = pd.read_csv("../../data/world-data-2023.csv")
countries.head()

	Country	Density\n(P/Km2)	Abbreviation	Agricultural Land( %)	Land Area(Km2)	Armed Forces size	Birth Rate	Calling Code	Capital/Major City	Co2-Emissions	...	Out of pocket health expenditure	Physicians per thousand	Population	Population: Labor force participation (%)	Tax revenue (%)	Total tax rate	Unemployment rate	Urban_population	Latitude	Longitude
0	Afghanistan	60	AF	58.10%	652,230	323,000	32.49	93.0	Kabul	8,672	...	78.40%	0.28	38,041,754	48.90%	9.30%	71.40%	11.12%	9,797,273	33.939110	67.709953
1	Albania	105	AL	43.10%	28,748	9,000	11.78	355.0	Tirana	4,536	...	56.90%	1.20	2,854,191	55.70%	18.60%	36.60%	12.33%	1,747,593	41.153332	20.168331
2	Algeria	18	DZ	17.40%	2,381,741	317,000	24.28	213.0	Algiers	150,006	...	28.10%	1.72	43,053,054	41.20%	37.20%	66.10%	11.70%	31,510,100	28.033886	1.659626
3	Andorra	164	AD	40.00%	468	NaN	7.20	376.0	Andorra la Vella	469	...	36.40%	3.33	77,142	NaN	NaN	NaN	NaN	67,873	42.506285	1.521801
4	Angola	26	AO	47.50%	1,246,700	117,000	40.73	244.0	Luanda	34,693	...	33.40%	0.21	31,825,295	77.50%	9.20%	49.10%	6.89%	21,061,025	-11.202692	17.873887

5 rows × 35 columns

We want to see if we can cluster countries based on their characteristics. First, we need to do some cleaning. I don’t want to include Abbreviation, Calling Code, Capital/Major City, Largest city, Latitude, or Longitude in my analysis because they uniquely identify a given country. I also see some variables that are numeric with percentage signs, dollar signs, and commas. These are characters which indicate that the variable is a string, but I would like them to be floats instead so that Python knows they have a numerical meaning.

The code used for this cleaning is hidden for brevity, but the resulting, clean dataframe is shown below.

	Country	Density\n(P/Km2)	Agricultural Land( %)	Land Area(Km2)	Armed Forces size	Birth Rate	Co2-Emissions	CPI	CPI Change (%)	Currency-Code	...	Maternal mortality ratio	Official language	Out of pocket health expenditure	Physicians per thousand	Population	Population: Labor force participation (%)	Tax revenue (%)	Total tax rate	Unemployment rate	Urban_population
0	Afghanistan	60	58.1	652230.0	323000.0	32.49	8672.0	149.90	2.3	AFN	...	638.0	Pashto	78.4	0.28	38041754.0	48.9	9.3	71.4	11.12	9797273.0
1	Albania	105	43.1	28748.0	9000.0	11.78	4536.0	119.05	1.4	ALL	...	15.0	Albanian	56.9	1.20	2854191.0	55.7	18.6	36.6	12.33	1747593.0
2	Algeria	18	17.4	2381741.0	317000.0	24.28	150006.0	151.36	2.0	DZD	...	112.0	Arabic	28.1	1.72	43053054.0	41.2	37.2	66.1	11.70	31510100.0
3	Angola	26	47.5	1246700.0	117000.0	40.73	34693.0	261.73	17.1	AOA	...	241.0	Portuguese	33.4	0.21	31825295.0	77.5	9.2	49.1	6.89	21061025.0
4	Argentina	17	54.3	2780400.0	105000.0	17.02	201348.0	232.75	53.5	ARS	...	39.0	Spanish	17.6	3.96	44938712.0	61.3	10.1	106.3	9.79	41339571.0

5 rows × 28 columns

Preprocessing the Data#

In the previous section, we wrote our own functions to implement the k-means algorithm. This is a useful exercise to make sure we understand how the algorithm works, but as we know, there are libraries with optimized functions built to do these kinds of common analyses. The library sklearn has built-in functions to do k-means clustering that are much faster than the functions we wrote. Let’s use these functions to cluster our countries dataset.

Before, we can cluster the data, we need to do some preprocessing. Below, I import StandardScaler which we can use to standardize our data.

from sklearn.preprocessing import StandardScaler

Next, we separate our numeric and categorical data for ease of preprocessing.

country_names = countries_clean['Country']
num_columns = countries_clean.drop(columns=['Country', 'Currency-Code', 'Official language'])
cat_columns = countries_clean[['Currency-Code', 'Official language']]

Now, we can use get_dummies from the pandas library to dummy code our categorical features. I set drop_first equal to True so that the first category will be dropped and used as the reference level. I also set dummy_na equal to True which creates a dummy variable to indicate which values are missing.

cat_dummies = pd.get_dummies(cat_columns, drop_first=True, dummy_na=True)

Next, we need to initialize our StandardScaler and use it to scale our numeric features.

scaler = StandardScaler()
num_scaled = pd.DataFrame(scaler.fit_transform(num_columns),columns=num_columns.columns)

Now, we can put our categorical and numerical data back together into one preprocessed dataframe using the .concat function from pandas.

countries_proc = pd.concat([num_scaled,cat_dummies], axis = 1)
countries_proc.head()

	Density\n(P/Km2)	Agricultural Land( %)	Land Area(Km2)	Armed Forces size	Birth Rate	Co2-Emissions	CPI	CPI Change (%)	Fertility Rate	Forested Area (%)	...	Official language_Swahili	Official language_Swedish	Official language_Tamil	Official language_Thai	Official language_Tok Pisin	Official language_Turkish	Official language_Ukrainian	Official language_Urdu	Official language_Vietnamese	Official language_nan
0	-0.215644	0.818138	-0.115866	0.348798	1.264388	-0.232539	-0.077214	-0.280488	1.414957	-1.256779	...	False	False	False	False	False	False	False	False	False	False
1	-0.156441	0.115295	-0.389922	-0.417215	-0.788460	-0.236728	-0.335232	-0.389776	-0.774075	-0.051855	...	False	False	False	False	False	False	False	False	False	False
2	-0.270901	-1.088910	0.644351	0.334160	0.450584	-0.089375	-0.065003	-0.316917	0.301239	-1.317025	...	False	False	False	False	False	False	False	False	False	False
3	-0.260376	0.321462	0.145437	-0.153746	2.081166	-0.206181	0.858093	1.516694	2.221443	0.791591	...	False	False	False	False	False	False	False	False	False	False
4	-0.272217	0.640084	0.819584	-0.183020	-0.269053	-0.037369	0.615715	5.936791	-0.282503	-0.899936	...	False	False	False	False	False	False	False	False	False	False

5 rows × 192 columns

Choosing K#

Now that our data has been preprocessed, we are ready to start clustering. First, we import the KMeans function from sklearn.cluster.

from sklearn.cluster import KMeans

The KMeans function takes in the number of clusters, \(k\), as n_clusters, the number of times the algorithm should be run with different initial centroids as n_init, and a random seed (as explained in Section 10.3) as random_state. It also takes in a maximum number of iterations and a tolerance as max_iter with default 300 and tol with default \(10^{-4}\) respectively. For more information about the function, see the scikit-learn documentation here.

As we mentioned in the previous section, when it is not obvious how many clusters to use, we can build an Elbow Plot to help us choose \(k\). Below, we use iteration to try different values (here 1-10) for \(k\). For each \(k\) we try, we initialize our KMeans() function with that \(k\) value and set n_init to 10 which tries 10 different initial random centroids and chooses the resulting clustering with the smallest WCV. We fit this model to countries_proc and save the WCV which can be found using the attribute .inertia_. The for loop below results in a list of WCV values which we can use to build our elbow plot.

wcv = []
for k in range(1, 11):
   kmeans = KMeans(n_clusters=k, n_init=10)
   kmeans.fit(countries_proc)
   wcv.append(kmeans.inertia_)

plt.plot(range(1, 11),wcv)
plt.xlabel("Number of Clusters: k")
plt.ylabel("Within-Cluster Variation")
plt.title('Elbow Plot for Choosing Number of Country Clusters');

../../_images/391f69ed9110310129ff6e580ecd9d7269d1659627fb712e46ee4110fffdea1e.png

The elbow of this plot is not as clear as the plot from the previous section. It looks to be somewhere between 3 and 5. We will choose \(k = 4\) clusters for our data, since 4 is in the middle.

Training Our Model#

Now that we have chosen k, we can use KMeans to cluster our dataset into 4 clusters. The attribute .labels_ shows us the cluster membership for each row of countries proc.

kmeans = KMeans(n_clusters=4, n_init=10)
kmeans.fit(countries_proc)

kmeans.labels_

array([0, 2, 1, 0, 1, 1, 2, 2, 1, 1, 1, 2, 2, 2, 0, 1, 1, 2, 0, 0, 0, 1,
       0, 2, 2, 3, 1, 0, 2, 2, 2, 2, 0, 2, 1, 1, 2, 0, 1, 2, 2, 0, 0, 2,
       2, 0, 2, 1, 0, 1, 2, 2, 3, 1, 1, 1, 2, 2, 2, 2, 1, 1, 0, 1, 1, 0,
       2, 1, 2, 2, 0, 0, 1, 0, 2, 2, 1, 1, 1, 1, 0, 1, 1, 2, 1, 0, 0, 2,
       1, 1, 0, 1, 2, 1, 2, 2, 1, 2, 1, 0, 1, 0, 2, 0, 2, 2, 2, 1, 2, 2,
       1, 0, 2, 2, 2, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 2, 3, 2, 1, 0],
      dtype=int32)

We can investigate which countries were clustered together using the country_names data which we extracted from our original dataset.

Cluster 0 seems to contain mostly European countries.

country_names[kmeans.labels_ == 0]

                         Afghanistan
                              Angola
                              Benin
                       Burkina Faso
                            Burundi
                        Ivory Coast
                           Cameroon
              Republic of the Congo
   Democratic Republic of the Congo
                           Ethiopia
                              Gabon
                         The Gambia
                              Ghana
                             Guinea
                              Kenya
                               Laos
                         Madagascar
                             Malawi
                               Mali
                         Mozambique
                              Niger
                            Nigeria
                   Papua New Guinea
                             Rwanda
                           Senegal
                      Sierra Leone
                             Sudan
                          Tanzania
                        East Timor
                              Togo
                            Uganda
                            Zambia
Name: Country, dtype: object

Cluster 1 contains many Middle Eastern and Eastern European countries as well as Southern and Central American countries.

country_names[kmeans.labels_ == 1]

                 Algeria
               Argentina
                 Armenia
              Azerbaijan
                 Bahrain
             Bangladesh
               Botswana
                 Brazil
             Cape Verde
               Colombia
     Dominican Republic
                  Egypt
                   Fiji
              Guatemala
               Honduras
              Indonesia
                   Iran
                   Iraq
                 Jordan
             Kazakhstan
                 Kuwait
             Kyrgyzstan
                Lebanon
               Malaysia
                 Mexico
                Moldova
               Mongolia
                Morocco
                Myanmar
                  Nepal
              Nicaragua
                   Oman
               Pakistan
               Paraguay
            Philippines
                  Qatar
                 Russia
          Saudi Arabia
          South Africa
             Sri Lanka
                 Syria
            Tajikistan
              Thailand
   Trinidad and Tobago
               Tunisia
                Turkey
               Ukraine
  United Arab Emirates
               Vietnam
Name: Country, dtype: object

Cluster 2 contains mostly African countries.

country_names[kmeans.labels_ == 2]

                Albania
              Australia
                Austria
              Barbados
               Belgium
                Belize
              Bulgaria
                Canada
                 Chile
            Costa Rica
               Croatia
                Cyprus
        Czech Republic
               Denmark
               Estonia
               Finland
                France
               Georgia
               Germany
                Greece
               Hungary
               Iceland
   Republic of Ireland
                Israel
                 Italy
               Jamaica
                Latvia
             Lithuania
            Luxembourg
                 Malta
             Mauritius
           New Zealand
                Norway
                  Peru
                Poland
              Portugal
               Romania
               Serbia
            Singapore
             Slovakia
             Slovenia
          South Korea
                Spain
             Suriname
               Sweden
          Switzerland
       United Kingdom
              Uruguay
Name: Country, dtype: object

China, India and the United States make up their own cluster.

country_names[kmeans.labels_ == 3]

           China
           India
  United States
Name: Country, dtype: object

The map below shows which countries are assigned to each cluster. Interestingly, the clustering seems to have some geographic meaning. Countries close together on the map tend to belong to the same cluster.

Disadvantages of `K-Means` Clustering#

As we discussed previously, k-means clustering has several disadvantages. It does not always converge to a solution that provides the global minimum within-cluster variation. Because of this, it can also give differing solutions depending on the initial starting points. In addition, the k-means algorithm requires the user to specify the number of clusters, which may not always be obvious, especially for data with high dimensionality. In the next section, we will discuss another clustering method that does not require you to specify a number of clusters: hierarchical clustering.