End-to-end analysis and prediction of diabetes risk using the Pima Indians dataset.
This project analyzes the Pima Indians Diabetes Dataset to identify key indicators of diabetes risk. It progresses from Exploratory Data Analysis (EDA) to building predictive models using jupiter notebook.
- Data Integrity: Identified that the dataset used
0as a placeholder for missing values in biologically impossible columns (Glucose, Blood Pressure, BMI). - Correlation: Found strong correlations between Glucose levels and Diabetes outcome.
- Outliers: Handled missing data by imputing median values to preserve distribution integrity.
- Python: Pandas for data manipulation.
- Visualization: Seaborn & Matplotlib for heatmaps and distribution plots.
- Machine Learning: Scikit-Learn (Linear Regression, Random Forest, Logistic Regression).
- Clone the repository.
- Install dependencies:
pip install pandas seaborn scikit-learn. - Run the script
metabolic_health_eda.py.
We trained a Logistic Regression model to predict diabetes onset based on diagnostic measures.
- Overall Accuracy: 75.32%
- Key Insight: The model is highly effective at identifying healthy patients (Precision: 0.80) but requires further tuning to improve sensitivity for detecting positive diabetic cases (Recall: 0.62).
| Predicted Healthy | Predicted Diabetic | |
|---|---|---|
| Actual Healthy | High Accuracy | Low False Positives |
| Actual Diabetic | Moderate False Negatives | Moderate True Positives |
To understand patient profiles beyond simple "Sick/Healthy" labels, we applied K-Means Clustering (k=3) to segment the population.
Key Findings:
- Cluster 0 (Metabolic Syndrome): Young patients (avg age 29) with severe obesity (BMI 39) and highest insulin resistance. This group represents the highest intervention priority.
- Cluster 1 (Older Mothers): Older patients (avg age 45) with a history of multiple pregnancies (avg 7.3). Diabetes risk here is likely age-related.
- Cluster 2 (Healthy Baseline): Young patients with lower BMI and normal glucose levels. Only 13% of this group is diabetic.