The k-Nearest Neighbors (kNN) algorithm is a straightforward yet powerful method used for classification and regression tasks in machine learning. Its simplicity lies in its approach: it classifies a data point based on the majority class among its k nearest neighbors. In this blog, we will walk through the implementation of kNN on the Diabetes dataset using Python, showcasing how to apply this algorithm to a real-world medical dataset.
k-Nearest Neighbors (kNN) in Python
k-Nearest Neighbors (kNN) is a versatile and intuitive machine learning algorithm used for classification and regression tasks. In Python, the scikit-learn library provides a robust implementation of kNN through the KNeighborsClassifierand KNeighborsRegressor classes. The kNN algorithm operates on the principle that similar data points are likely to be close to each other in the feature space. For classification, kNN assigns a class label to a data point based on the majority class among its k nearest neighbors, where k is a user-defined parameter. The distance between points is typically measured using metrics like Euclidean or Manhattan distance. In Python, after importing the necessary libraries, you can load and preprocess your dataset, split it into training and testing sets, and initialize the kNN model. The model is then trained on the training data using the fit() method, and predictions are made on the test data. The performance of the kNN model can be evaluated using metrics such as accuracy and classification reports. kNN’s simplicity and effectiveness in handling non-linear relationships and high-dimensional data make it a powerful tool for various machine learning applications.
The Diabetes Dataset in Python - sklearn
The Diabetes dataset, available in Python through the scikit-learn library, is a well-known dataset used for medical research and machine learning. It provides a set of features related to diabetes, making it suitable for classification and regression tasks. The dataset is used to predict whether a patient has diabetes based on certain medical attributes.
Features of the Diabetes Dataset
The dataset consists of the following features:
Pregnancies: The number of times the patient has been pregnant.
Glucose: Plasma glucose concentration after a 2-hour oral glucose tolerance test (mg/dL).
BloodPressure: Diastolic blood pressure (mm Hg).
SkinThickness: Triceps skin fold thickness (mm).
Insulin: 2-Hour serum insulin (mu U/ml).
BMI: Body mass index (weight in kg/(height in m)^2).
DiabetesPedigreeFunction: A function that scores the likelihood of diabetes based on family history.
Age: Age of the patient (years).
Outcome: The target variable indicating whether the patient has diabetes (1) or not (0).
k-Nearest Neighbors (kNN) on the Diabetes Dataset in Python Step-by-Step Implementation
Implementing k-Nearest Neighbors (kNN) on the Diabetes dataset in Python involves loading the dataset, preprocessing it by splitting into training and testing sets, and training the kNN model using KNeighborsClassifier from scikit-learn. After training, the model's performance is evaluated using metrics such as accuracy and classification reports. This step-by-step process highlights kNN’s effectiveness in predicting diabetes outcomes based on various medical features.
Import Libraries
Begin by importing the required libraries for data manipulation, modeling, and evaluation.
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
Load and Prepare the Data
Load the Diabetes dataset and preprocess it. For simplicity, we will convert the continuous target variable into a binary classification problem.
# Load the Diabetes dataset
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
# Convert the target variable to binary classification
y_binary = (y > np.median(y)).astype(int)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.3, random_state=42)
# Scale data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Initialize and Train the k-Nearest Neighbors (kNN) Model
Create a kNN classifier, train it on the training data, and make predictions on the test data.
# Initialize the kNN classifier
k = 5 # Number of neighbors
knn = KNeighborsClassifier(n_neighbors=k)
# Train the model
knn.fit(X_train, y_train)
# Make predictions on the test set
y_pred = knn.predict(X_test
Evaluate the Model
Assess the performance of the kNN model using accuracy and other relevant metrics.
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')
print('Classification Report:')
print(classification_report(y_test, y_pred))
Output for the above code:
Accuracy: 70.68%
Classification Report:
precision recall f1-score support
0 0.73 0.72 0.73 72
1 0.68 0.69 0.68 61
accuracy 0.71 133
macro avg 0.70 0.71 0.71 133
weighted avg 0.71 0.71 0.71 133
Optimize the k-Nearest Neighbors (kNN) Model
To find the best k value, you can perform a grid search or cross-validation to test different values of k and select the one that gives the best performance.
from sklearn.model_selection import GridSearchCV
# Define a range of k values to test
param_grid = {'n_neighbors': list(range(1, 21))}
# Initialize GridSearchCV
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='accuracy')
# Fit the model
grid_search.fit(X_train, y_train)
# Get the best k value
best_k = grid_search.best_params_['n_neighbors']
print(f'Best k value: {best_k}')
Output for the above code:
Best k value: 19
Plot Predictions - k-Nearest Neighbors (kNN)
Following code will produce a plot showing the decision boundary created by the kNN model on the Diabetes dataset, with the data points color-coded based on their class labels. This visualization helps to understand how the kNN algorithm classifies data in a reduced 2D space, providing insights into the decision-making process of the model.
# Initialize PCA with 2 components
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X_train)
k = 5 # Number of neighbors
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_reduced, y_train)
# Create a mesh grid
x_min, x_max = X_reduced[:, 0].min() - 1, X_reduced[:, 0].max() + 1
y_min, y_max = X_reduced[:, 1].min() - 1, X_reduced[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
np.arange(y_min, y_max, 0.01))
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot decision boundary
plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.coolwarm)
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y_train, edgecolor='k', s=20, cmap=plt.cm.coolwarm)
plt.title('k-Nearest Neighbors (kNN) Decision Boundary on Diabetes Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
Output for the above code:
Full Code for Implementing k-Nearest Neighbors (kNN) on the Diabetes Dataset in Python
The full code for implementing k-Nearest Neighbors (kNN) on the Diabetes dataset in Python involves several key steps. First, necessary libraries such as numpy, pandas, scikit-learn, and matplotlib are imported. The Diabetes dataset is loaded, and the continuous target variable is converted into a binary classification problem for simplicity. The dataset is then split into training and testing sets, and the features are standardized using StandardScaler to ensure uniform scaling. To visualize the kNN model's performance, the data is reduced to two dimensions using Principal Component Analysis (PCA). A kNN classifier is initialized with a specified number of neighbors (k) and trained on the reduced training data. Predictions are made on a mesh grid to plot the decision boundary. Additionally usage of gridsearchCV is also demonestrated. Finally, the decision boundary and data points are plotted, providing a visual representation of how the kNN model classifies the data in the reduced feature space. This code effectively demonstrates the application of kNN on a real-world dataset and highlights the model’s decision-making process through visualization.
Here’s the complete code:
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
# Load the Diabetes dataset
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
# Convert the target variable to binary classification
y_binary = (y > np.median(y)).astype(int)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.3, random_state=42)
# Scale data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Initialize the kNN classifier
k = 5 # Number of neighbors
knn = KNeighborsClassifier(n_neighbors=k)
# Train the model
knn.fit(X_train, y_train)
# Make predictions on the test set
y_pred = knn.predict(X_test
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')
print('Classification Report:')
print(classification_report(y_test, y_pred))
# Define a range of k values to test
param_grid = {'n_neighbors': list(range(1, 21))}
# Initialize GridSearchCV
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='accuracy')
# Fit the model
grid_search.fit(X_train, y_train)
# Get the best k value
best_k = grid_search.best_params_['n_neighbors']
print(f'Best k value: {best_k}')
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X_train)
k = 5 # Number of neighbors
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_reduced, y_train)
# Create a mesh grid
x_min, x_max = X_reduced[:, 0].min() - 1, X_reduced[:, 0].max() + 1
y_min, y_max = X_reduced[:, 1].min() - 1, X_reduced[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
np.arange(y_min, y_max, 0.01))
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot decision boundary
plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.coolwarm)
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y_train, edgecolor='k', s=20, cmap=plt.cm.coolwarm)
plt.title('k-Nearest Neighbors (kNN) Decision Boundary on Diabetes Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
Conclusion
Implementing k-Nearest Neighbors (kNN) on the Diabetes dataset in Python showcases the effectiveness of this algorithm in classification tasks. By following a structured approach—loading and preprocessing the data, standardizing features, reducing dimensionality with PCA, training the kNN model, and visualizing the decision boundary—this process highlights kNN's capability to handle real-world datasets. The visualization provides valuable insights into how the model classifies different data points and distinguishes between classes. Overall, kNN's simplicity and intuitive nature make it a powerful tool for predictive analytics, especially in medical data analysis where understanding patterns and classifications is crucial.
Comments