top of page

Learn through our Blogs, Get Expert Help & Innovate with Colabcodes

Welcome to Colabcodes, where technology meets innovation. Our articles are designed to provide you with the latest news and information about the world of tech. From software development to artificial intelligence, we cover it all. Stay up-to-date with the latest trends and technological advancements. If you need help with any of the mentioned technologies or any of its variants, feel free to contact us and connect with our freelancers and mentors for any assistance and guidance. 

blog cover_edited.jpg

ColabCodes

Writer's picturesamuel black

Implementing k-Nearest Neighbors (kNN) on the Diabetes Dataset in Python

The k-Nearest Neighbors (kNN) algorithm is a straightforward yet powerful method used for classification and regression tasks in machine learning. Its simplicity lies in its approach: it classifies a data point based on the majority class among its k nearest neighbors. In this blog, we will walk through the implementation of kNN on the Diabetes dataset using Python, showcasing how to apply this algorithm to a real-world medical dataset.

Implementing k-Nearest Neighbors (kNN) on the Diabetes Dataset in Python - colabcodes

k-Nearest Neighbors (kNN) in Python

k-Nearest Neighbors (kNN) is a versatile and intuitive machine learning algorithm used for classification and regression tasks. In Python, the scikit-learn library provides a robust implementation of kNN through the KNeighborsClassifierand KNeighborsRegressor classes. The kNN algorithm operates on the principle that similar data points are likely to be close to each other in the feature space. For classification, kNN assigns a class label to a data point based on the majority class among its k nearest neighbors, where k is a user-defined parameter. The distance between points is typically measured using metrics like Euclidean or Manhattan distance. In Python, after importing the necessary libraries, you can load and preprocess your dataset, split it into training and testing sets, and initialize the kNN model. The model is then trained on the training data using the fit() method, and predictions are made on the test data. The performance of the kNN model can be evaluated using metrics such as accuracy and classification reports. kNN’s simplicity and effectiveness in handling non-linear relationships and high-dimensional data make it a powerful tool for various machine learning applications.


The Diabetes Dataset in Python - sklearn

The Diabetes dataset, available in Python through the scikit-learn library, is a well-known dataset used for medical research and machine learning. It provides a set of features related to diabetes, making it suitable for classification and regression tasks. The dataset is used to predict whether a patient has diabetes based on certain medical attributes.


Features of the Diabetes Dataset

The dataset consists of the following features:


  1. Pregnancies: The number of times the patient has been pregnant.

  2. Glucose: Plasma glucose concentration after a 2-hour oral glucose tolerance test (mg/dL).

  3. BloodPressure: Diastolic blood pressure (mm Hg).

  4. SkinThickness: Triceps skin fold thickness (mm).

  5. Insulin: 2-Hour serum insulin (mu U/ml).

  6. BMI: Body mass index (weight in kg/(height in m)^2).

  7. DiabetesPedigreeFunction: A function that scores the likelihood of diabetes based on family history.

  8. Age: Age of the patient (years).

  9. Outcome: The target variable indicating whether the patient has diabetes (1) or not (0).


k-Nearest Neighbors (kNN) on the Diabetes Dataset in Python Step-by-Step Implementation

Implementing k-Nearest Neighbors (kNN) on the Diabetes dataset in Python involves loading the dataset, preprocessing it by splitting into training and testing sets, and training the kNN model using KNeighborsClassifier from scikit-learn. After training, the model's performance is evaluated using metrics such as accuracy and classification reports. This step-by-step process highlights kNN’s effectiveness in predicting diabetes outcomes based on various medical features.


Import Libraries

Begin by importing the required libraries for data manipulation, modeling, and evaluation.


import numpy as np

import pandas as pd

from sklearn.datasets import load_diabetes

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, classification_report

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler


Load and Prepare the Data

Load the Diabetes dataset and preprocess it. For simplicity, we will convert the continuous target variable into a binary classification problem.

# Load the Diabetes dataset

diabetes = load_diabetes()

X = diabetes.data

y = diabetes.target


# Convert the target variable to binary classification

y_binary = (y > np.median(y)).astype(int)


# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.3, random_state=42)


# Scale data

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)


Initialize and Train the k-Nearest Neighbors (kNN) Model

Create a kNN classifier, train it on the training data, and make predictions on the test data.


# Initialize the kNN classifier

k = 5  # Number of neighbors

knn = KNeighborsClassifier(n_neighbors=k)


# Train the model

knn.fit(X_train, y_train)


# Make predictions on the test set

y_pred = knn.predict(X_test


Evaluate the Model

Assess the performance of the kNN model using accuracy and other relevant metrics.


# Evaluate the model

accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy * 100:.2f}%')

print('Classification Report:')

print(classification_report(y_test, y_pred))


Output for the above code:

Accuracy: 70.68%
Classification Report:
              precision    recall  f1-score   support

           0       0.73      0.72      0.73        72
           1       0.68      0.69      0.68        61

    accuracy                           0.71       133
   macro avg       0.70      0.71      0.71       133
weighted avg       0.71      0.71      0.71       133

Optimize the k-Nearest Neighbors (kNN) Model

To find the best k value, you can perform a grid search or cross-validation to test different values of k and select the one that gives the best performance.


from sklearn.model_selection import GridSearchCV


# Define a range of k values to test

param_grid = {'n_neighbors': list(range(1, 21))}


# Initialize GridSearchCV

grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='accuracy')


# Fit the model

grid_search.fit(X_train, y_train)


# Get the best k value

best_k = grid_search.best_params_['n_neighbors']

print(f'Best k value: {best_k}')


Output for the above code:

Best k value: 19

Plot Predictions - k-Nearest Neighbors (kNN)

Following code will produce a plot showing the decision boundary created by the kNN model on the Diabetes dataset, with the data points color-coded based on their class labels. This visualization helps to understand how the kNN algorithm classifies data in a reduced 2D space, providing insights into the decision-making process of the model.


# Initialize PCA with 2 components

pca = PCA(n_components=2)

X_reduced = pca.fit_transform(X_train)


k = 5  # Number of neighbors

knn = KNeighborsClassifier(n_neighbors=k)

knn.fit(X_reduced, y_train)


# Create a mesh grid

x_min, x_max = X_reduced[:, 0].min() - 1, X_reduced[:, 0].max() + 1

y_min, y_max = X_reduced[:, 1].min() - 1, X_reduced[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),

                     np.arange(y_min, y_max, 0.01))

Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)


# Plot decision boundary

plt.figure(figsize=(10, 6))

plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.coolwarm)

plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y_train, edgecolor='k', s=20, cmap=plt.cm.coolwarm)

plt.title('k-Nearest Neighbors (kNN) Decision Boundary on Diabetes Dataset')

plt.xlabel('Principal Component 1')

plt.ylabel('Principal Component 2')

plt.show()


Output for the above code:

k-Nearest Neighbors (kNN) Decision Boundary on Diabetes Dataset - colabcodes

Full Code for Implementing k-Nearest Neighbors (kNN) on the Diabetes Dataset in Python

The full code for implementing k-Nearest Neighbors (kNN) on the Diabetes dataset in Python involves several key steps. First, necessary libraries such as numpy, pandas, scikit-learn, and matplotlib are imported. The Diabetes dataset is loaded, and the continuous target variable is converted into a binary classification problem for simplicity. The dataset is then split into training and testing sets, and the features are standardized using StandardScaler to ensure uniform scaling. To visualize the kNN model's performance, the data is reduced to two dimensions using Principal Component Analysis (PCA). A kNN classifier is initialized with a specified number of neighbors (k) and trained on the reduced training data. Predictions are made on a mesh grid to plot the decision boundary. Additionally usage of gridsearchCV is also demonestrated. Finally, the decision boundary and data points are plotted, providing a visual representation of how the kNN model classifies the data in the reduced feature space. This code effectively demonstrates the application of kNN on a real-world dataset and highlights the model’s decision-making process through visualization.

Here’s the complete code:


import numpy as np

import pandas as pd

from sklearn.datasets import load_diabetes

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, classification_report

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import GridSearchCV


# Load the Diabetes dataset

diabetes = load_diabetes()

X = diabetes.data

y = diabetes.target


# Convert the target variable to binary classification

y_binary = (y > np.median(y)).astype(int)


# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.3, random_state=42)


# Scale data

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)


# Initialize the kNN classifier

k = 5  # Number of neighbors

knn = KNeighborsClassifier(n_neighbors=k)


# Train the model

knn.fit(X_train, y_train)


# Make predictions on the test set

y_pred = knn.predict(X_test


# Evaluate the model

accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy * 100:.2f}%')

print('Classification Report:')

print(classification_report(y_test, y_pred))


# Define a range of k values to test

param_grid = {'n_neighbors': list(range(1, 21))}


# Initialize GridSearchCV

grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='accuracy')


# Fit the model

grid_search.fit(X_train, y_train)


# Get the best k value

best_k = grid_search.best_params_['n_neighbors']

print(f'Best k value: {best_k}')


pca = PCA(n_components=2)

X_reduced = pca.fit_transform(X_train)


k = 5  # Number of neighbors

knn = KNeighborsClassifier(n_neighbors=k)

knn.fit(X_reduced, y_train)


# Create a mesh grid

x_min, x_max = X_reduced[:, 0].min() - 1, X_reduced[:, 0].max() + 1

y_min, y_max = X_reduced[:, 1].min() - 1, X_reduced[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),

                     np.arange(y_min, y_max, 0.01))

Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)


# Plot decision boundary

plt.figure(figsize=(10, 6))

plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.coolwarm)

plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y_train, edgecolor='k', s=20, cmap=plt.cm.coolwarm)

plt.title('k-Nearest Neighbors (kNN) Decision Boundary on Diabetes Dataset')

plt.xlabel('Principal Component 1')

plt.ylabel('Principal Component 2')

plt.show()


Conclusion

Implementing k-Nearest Neighbors (kNN) on the Diabetes dataset in Python showcases the effectiveness of this algorithm in classification tasks. By following a structured approach—loading and preprocessing the data, standardizing features, reducing dimensionality with PCA, training the kNN model, and visualizing the decision boundary—this process highlights kNN's capability to handle real-world datasets. The visualization provides valuable insights into how the model classifies different data points and distinguishes between classes. Overall, kNN's simplicity and intuitive nature make it a powerful tool for predictive analytics, especially in medical data analysis where understanding patterns and classifications is crucial.

Comments


Get in touch for customized mentorship and freelance solutions tailored to your needs.

bottom of page