top of page

Learn through our Blogs, Get Expert Help & Innovate with Colabcodes

Welcome to Colabcodes, where technology meets innovation. Our articles are designed to provide you with the latest news and information about the world of tech. From software development to artificial intelligence, we cover it all. Stay up-to-date with the latest trends and technological advancements. If you need help with any of the mentioned technologies or any of its variants, feel free to contact us and connect with our freelancers and mentors for any assistance and guidance. 

blog cover_edited.jpg

ColabCodes

Writer's picturesamuel black

Implementing k-Means Clustering on the Iris Dataset in Python

k-Means clustering is one of the simplest and most popular unsupervised machine learning algorithms. It is used to partition a dataset into k distinct, non-overlapping clusters based on the data's features. In this blog, we will implement k-Means clustering on the Iris dataset in python, a classic dataset in the field of machine learning.

Implementing k-Means Clustering on the Iris Dataset in Python - colabcodes

k-Means Clustering on the Iris Dataset in Python

k-Means clustering is an unsupervised machine learning algorithm that partitions data into k distinct clusters based on feature similarity. When applied to the Iris dataset, which consists of 150 samples of iris flowers described by four features (sepal length, sepal width, petal length, and petal width), k-Means clustering aims to group these samples into clusters that ideally represent the three species: Iris-setosa, Iris-versicolor, and Iris-virginica. Implementing k-Means clustering in Python involves several steps. First, we import necessary libraries such as NumPy, Pandas, Matplotlib, and scikit-learn. We then load the dataset using datasets.load_iris() from scikit-learn and standardize the features to ensure that each has a mean of zero and a standard deviation of one, which helps improve the performance of the clustering algorithm. Next, we apply the k-Means algorithm using KMeans from scikit-learn, setting n_clusters=3 to match the number of species. The algorithm iteratively assigns samples to the nearest cluster centroid and updates the centroids until convergence. To visualize the results, we use Principal Component Analysis (PCA) to reduce the dimensionality of the dataset to two dimensions, allowing us to plot the clusters and centroids. Finally, we evaluate the clustering performance by comparing the predicted clusters to the actual species labels using a confusion matrix, which can be visualized with a heatmap to easily identify the accuracy and any misclassifications. This approach demonstrates how k-Means can be effectively used to uncover natural groupings in the data, providing insights into the structure of the Iris dataset.


What is the Iris Dataset in Python?

The Iris dataset is one of the most famous datasets in the field of machine learning and statistics, often used as a beginner's dataset for learning and experimentation. It consists of 150 samples of iris flowers, with each sample described by four features: sepal length, sepal width, petal length, and petal width. These features are measured in centimeters. The dataset includes three different species of iris flowers: Iris-setosa, Iris-versicolor, and Iris-virginica, with 50 samples of each species.

The Iris dataset is particularly useful because it is small, well-understood, and easy to visualize, making it an excellent choice for demonstrating various machine learning techniques. The goal when working with this dataset is often to classify the samples into their respective species based on the four features, or to use unsupervised learning techniques like clustering to discover the natural groupings within the data.

Here is a brief overview of the dataset:


Features:

  1. Sepal length (cm)

  2. Sepal width (cm)

  3. Petal length (cm)

  4. Petal width (cm)


Classes:

  1. Iris-setosa

  2. Iris-versicolor

  3. Iris-virginica


The dataset was introduced by the British statistician and biologist Ronald A. Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems," and it has since become a staple in the machine learning community. It is readily available in many machine learning libraries, including scikit-learn in Python, where it can be loaded with a simple command, making it highly accessible for practitioners and learners.


Steps to Implement k-Means Clustering

To implement k-Means clustering, first, import necessary libraries and load the dataset. Next, standardize the data and apply the k-Means algorithm to partition the data into clusters. Finally, visualize the clusters and evaluate the results using metrics like the confusion matrix.


  1. Import Libraries

  2. Load the Dataset

  3. Preprocess the Data

  4. Apply k-Means Clustering

  5. Visualize the Results

  6. Evaluate the Clustering


Let's dive into the implementation.


1. Import Libraries

First, we need to import the necessary libraries.


# Import Libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn import datasets

from sklearn.cluster import KMeans

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler


2. Load the Dataset

Next, we load the Iris dataset from the sklearn library.


# Load the iris dataset

iris = datasets.load_iris()

X = iris.data

y = iris.target


3. Preprocess the Data

It's a good practice to standardize the dataset to have mean 0 and variance 1 for each feature, especially when dealing with different scales of features.


# Standardize the features

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)


4. Apply k-Means Clustering

Now we can apply k-Means clustering to the standardized data. We will use k=3 because we know there are three species in the dataset.


# Apply k-means clustering

kmeans = KMeans(n_clusters=3, random_state=42)

kmeans.fit(X_scaled)

y_kmeans = kmeans.predict(X_scaled)

5. Visualize the Results

To visualize the clustering results, we can use PCA (Principal Component Analysis) to reduce the dimensionality of the data to 2D.


# Reduce dimensions to 2D for visualization

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)


# Plot the clusters

plt.figure(figsize=(8, 6))

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_kmeans, cmap='viridis', marker='o')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', marker='x')

plt.title('k-Means Clustering on Iris Dataset')

plt.xlabel('PCA Component 1')

plt.ylabel('PCA Component 2')

plt.show()


Output for the above code:

k-Means Clustering on Iris Dataset

6. Evaluate the Clustering

Finally, we can evaluate how well the clusters match the actual species using the confusion matrix.


from sklearn.metrics import confusion_matrix

import seaborn as sns


# Confusion matrix

cm = confusion_matrix(y, y_kmeans)


# Plot confusion matrix

plt.figure(figsize=(8, 6))

sns.heatmap(cm, annot=True, fmt="d", cmap='Blues', xticklabels=iris.target_names, yticklabels=iris.target_names)

plt.xlabel('Predicted')

plt.ylabel('True')

plt.title('Confusion Matrix of k-Means Clustering')

plt.show()


Output for the above code:

Confusion Matrix of k-Means Clustering - colabcodes

Full Code to Implementing k-Means Clustering on the Iris Dataset in Python

The full code to implement k-Means clustering on the Iris dataset in Python involves several key steps. First, import the necessary libraries such as NumPy, Pandas, Matplotlib, and scikit-learn, and load the Iris dataset. Standardize the data using StandardScaler for improved performance, then apply the k-Means algorithm from scikit-learn with n_clusters=3 to partition the data into clusters. Finally, visualize the results using PCA to reduce the dataset to two dimensions and evaluate the clustering with a confusion matrix to compare predicted clusters against actual species labels.


import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn import datasets

from sklearn.cluster import KMeans

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import confusion_matrix

import seaborn as sns


# Load the iris dataset

iris = datasets.load_iris()

X = iris.data

y = iris.target


# Standardize the features

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)


# Apply k-means clustering

kmeans = KMeans(n_clusters=3, random_state=42)

kmeans.fit(X_scaled)

y_kmeans = kmeans.predict(X_scaled)


# Reduce dimensions to 2D for visualization

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)


# Plot the clusters

plt.figure(figsize=(8, 6))

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_kmeans, cmap='viridis', marker='o')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', marker='x')

plt.title('k-Means Clustering on Iris Dataset')

plt.xlabel('PCA Component 1')

plt.ylabel('PCA Component 2')

plt.show()


# Confusion matrix

cm = confusion_matrix(y, y_kmeans)


# Plot confusion matrix

plt.figure(figsize=(8, 6))

sns.heatmap(cm, annot=True, fmt="d", cmap='Blues', xticklabels=iris.target_names, yticklabels=iris.target_names)

plt.xlabel('Predicted')

plt.ylabel('True')

plt.title('Confusion Matrix of k-Means Clustering')

plt.show()


Conclusion

In conclusion, k-Means clustering is a powerful and intuitive method for uncovering patterns and groupings within a dataset. By implementing k-Means on the Iris dataset, we demonstrated the complete process from data loading and preprocessing to clustering and evaluation. Standardizing the data ensures the algorithm performs optimally, while using PCA for visualization helps in understanding the cluster distribution in reduced dimensions. Evaluating the clusters with a confusion matrix provides insights into the clustering accuracy compared to actual species labels. This exercise not only highlights the practical steps involved in applying k-Means clustering but also reinforces its effectiveness in identifying natural groupings within the Iris dataset. By mastering these techniques, one can effectively apply k-Means clustering to a variety of datasets, gaining valuable insights into the underlying data structure.


Feel free to experiment with different numbers of clusters or different datasets to see how k-Means clustering performs on other data!

Related Posts

See All

Comments


Get in touch for customized mentorship and freelance solutions tailored to your needs.

bottom of page