Analyzing Diabetes Dataset with Python

samuel black

Aug 7, 20246 min read

Updated: Aug 8, 2024

Diabetes is a chronic medical condition that affects millions of people worldwide. The management and prediction of diabetes are critical for improving patient outcomes. In this blog, we'll explore the Diabetes dataset, a classic dataset used in machine learning, to predict the onset of diabetes based on diagnostic measures. We'll use Python and its powerful data science libraries to analyze and model this data.

Diabetes Dataset with Python

The Diabetes dataset, also known as the Pima Indians Diabetes dataset, consists of 768 observations of female patients of Pima Indian heritage. The dataset includes several medical predictor variables and one target variable, indicating whether a patient has diabetes. The features include:

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skinfold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 if non-diabetic, 1 if diabetic)

Loading the Dataset

First, we'll load the dataset using the pandas library, which is commonly used for data manipulation and analysis in Python.

import pandas as pd

# Load the dataset

df = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv', header=None)

df.columns = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]

# Display the first few rows

print(df.head())

Output of the above code:

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1

Data Exploration and Visualization

Understanding the distribution and relationships between features is essential. Let's explore the data through descriptive statistics and visualizations.

Descriptive statistics in pandas

Descriptive statistics in pandas provide a quick summary of the central tendency, dispersion, and shape of a dataset's distribution. By using the .describe() method, you can obtain key statistics such as mean, median, standard deviation, minimum, and maximum values for numerical columns. This summary helps in understanding the data's overall distribution and identifying any potential anomalies or outliers.

import seaborn as sns

import matplotlib.pyplot as plt

# Summary statistics

print(df.describe())

Output of the above code:

       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin  \
count   768.000000  768.000000     768.000000     768.000000  768.000000   
mean      3.845052  120.894531      69.105469      20.536458   79.799479   
std       3.369578   31.972618      19.355807      15.952218  115.244002   
min       0.000000    0.000000       0.000000       0.000000    0.000000   
25%       1.000000   99.000000      62.000000       0.000000    0.000000   
50%       3.000000  117.000000      72.000000      23.000000   30.500000   
75%       6.000000  140.250000      80.000000      32.000000  127.250000   
max      17.000000  199.000000     122.000000      99.000000  846.000000   

              BMI  DiabetesPedigreeFunction         Age     Outcome  
count  768.000000                768.000000  768.000000  768.000000  
mean    31.992578                  0.471876   33.240885    0.348958  
std      7.884160                  0.331329   11.760232    0.476951  
min      0.000000                  0.078000   21.000000    0.000000  
25%     27.300000                  0.243750   24.000000    0.000000  
50%     32.000000                  0.372500   29.000000    0.000000  
75%     36.600000                  0.626250   41.000000    1.000000  
max     67.100000                  2.420000   81.000000    1.000000

Pairplot in Seaborn

A pairplot in Seaborn is a powerful visualization tool that creates a grid of scatter plots for each pair of features in a dataset, along with histograms for each feature. It helps in visualizing the relationships and distributions between multiple variables, making it easy to detect patterns, correlations, and potential outliers. Additionally, pairplots can include different hues to distinguish between categorical outcomes, providing deeper insights into data separation based on classes.

# Pairplot to visualize relationships

sns.pairplot(df, hue='Outcome')

plt.show()

Output of the above code:

Pairplot to visualize relationships - colabcodes

Correlation heatmap in seaborn

A correlation heatmap in Seaborn is a visual representation of the correlation matrix of numerical features in a dataset. It uses color gradients to indicate the strength and direction of relationships between variables, with darker or lighter shades showing stronger correlations. This tool helps quickly identify highly correlated features, which can be useful for feature selection and understanding the data's structure.

# Correlation heatmap

plt.figure(figsize=(10, 8))

sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

plt.title('Feature Correlation Heatmap')

plt.show()

Output of the above code:

correlation heatmap in Seaborn - colabcodes

Data Preprocessing in Python

Before modeling, we need to handle missing values and scale the features. Missing values in the dataset are represented by zeros in some features, which is not realistic for medical measurements.

Data Clearning & preprocessing in sklearn

Data preprocessing in scikit-learn involves transforming raw data into a format suitable for modeling. It includes tasks such as handling missing values, scaling features to ensure they have similar ranges, and encoding categorical variables into numerical formats. Effective preprocessing is crucial for improving model performance and ensuring accurate predictions.

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

import numpy as np

# Replace zero values with NaN for specific columns

columns_to_replace = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]

df[columns_to_replace] = df[columns_to_replace].replace(0, np.nan)

# Fill NaN values with column means

df.fillna(df.mean(), inplace=True)

# Split the data into training and testing sets

X = df.drop('Outcome', axis=1)

y = df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

Building a Classification Model - Logistic Regression

We'll build a classification model using Logistic Regression, a simple yet effective algorithm for binary classification tasks.

Logistic Regression in sklearn

Logistic Regression in scikit-learn is a popular algorithm for binary classification tasks, predicting probabilities of class membership using a logistic function. It models the relationship between features and the binary target variable through a linear decision boundary.

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report, accuracy_score

# Initialize and train the model

model = LogisticRegression(random_state=42)

model.fit(X_train, y_train)

# Make predictions on the test set

y_pred = model.predict(X_test)

Model Evaluation in sklearn

The scikit-learn classification report provides detailed performance metrics for a classification model, including precision, recall, F1-score, and support for each class. Precision measures the accuracy of positive predictions, recall assesses the model's ability to identify all relevant instances, and the F1-score balances precision and recall. The support indicates the number of true instances for each class, providing context for the metrics.

# Evaluate the model

accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy * 100:.2f}%')

# Classification report

print(classification_report(y_test, y_pred))

Output of the above code:

Accuracy: 75.32%
              precision    recall  f1-score   support

           0       0.80      0.83      0.81        99
           1       0.67      0.62      0.64        55

    accuracy                           0.75       154
   macro avg       0.73      0.72      0.73       154
weighted avg       0.75      0.75      0.75       154

Feature Importance and Model Interpretation

Feature importance helps in understanding the contribution of each feature to the model's predictions. While Logistic Regression coefficients provide a sense of feature importance, other models like Random Forests offer a clearer picture.

# For logistic regression, feature importance can be inferred from the coefficients

coefficients = pd.DataFrame({"Feature": X.columns, "Coefficient": model.coef_[0]})

coefficients['Absolute Coefficient'] = coefficients['Coefficient'].abs()

coefficients = coefficients.sort_values(by='Absolute Coefficient', ascending=False)

# Display feature importance

print(coefficients)

Output of the above code:

                    Feature  Coefficient  Absolute Coefficient
1                   Glucose     1.083488              1.083488
5                       BMI     0.679410              0.679410
7                       Age     0.394747              0.394747
0               Pregnancies     0.224880              0.224880
6  DiabetesPedigreeFunction     0.199964              0.199964
2             BloodPressure    -0.145412              0.145412
4                   Insulin    -0.097142              0.097142
3             SkinThickness     0.068521              0.068521

Alternatively, use Random Forest for feature importance

In sklearn, Random Forest can be used to assess feature importance by measuring how much each feature contributes to reducing impurity across all trees in the forest. This is done through aggregating the decrease in the Gini impurity or entropy for each feature across the ensemble of decision trees. Feature importance values from a Random Forest model can help identify which features are most influential in making predictions.

from sklearn.ensemble import RandomForestClassifier

# Train a Random Forest model

rf_model = RandomForestClassifier(random_state=42)

rf_model.fit(X_train, y_train)

# Get feature importances

importances = rf_model.feature_importances_

indices = np.argsort(importances)[::-1]

# Print feature importance ranking

print("Feature ranking:")

for f in range(X_train.shape[1]):

print(f"{f + 1}. Feature {X.columns[indices[f]]} ({importances[indices[f]]})")

Output of the above code:

Feature ranking:
1. Feature Glucose (0.2574371360871144)
2. Feature BMI (0.16682723948882405)
3. Feature Age (0.131210912998538)
4. Feature DiabetesPedigreeFunction (0.11896641692795754)
5. Feature Insulin (0.09398375802944853)
6. Feature BloodPressure (0.08419015561804571)
7. Feature SkinThickness (0.07397265325818526)
8. Feature Pregnancies (0.0734117275918865)

Conclusion

In this blog, we delved into the Diabetes dataset, utilizing Python's powerful libraries to explore, preprocess, and model the data. We examined the dataset's characteristics, performed essential data cleaning, and built a Logistic Regression model to predict diabetes outcomes. Additionally, we discussed how Random Forests can reveal feature importance, enhancing our understanding of which factors most significantly influence predictions. This analysis demonstrates the potential of machine learning in healthcare, providing valuable insights that could lead to improved diagnostic tools and better management of diabetes. By leveraging these techniques, we pave the way for more accurate and effective medical decision-making.

Dive into the Diabetes dataset to experience firsthand how advanced algorithms can drive meaningful innovations in medical diagnostics.

Learn through our Blogs, Get Expert Help, Mentorship & Freelance Support!

ColabCodes

Analyzing Diabetes Dataset with Python

Diabetes Dataset with Python

Loading the Dataset

Data Exploration and Visualization

Descriptive statistics in pandas

Pairplot in Seaborn

Correlation heatmap in seaborn

Data Preprocessing in Python

Data Clearning & preprocessing in sklearn

from sklearn.model_selection import train_test_split

Building a Classification Model - Logistic Regression

Logistic Regression in sklearn

Model Evaluation in sklearn

Feature Importance and Model Interpretation

Alternatively, use Random Forest for feature importance

Conclusion

Related Posts

Comments

Get in touch for customized mentorship and freelance solutions tailored to your needs.