EDA 4 Global Health data

The leading global health challenges in the artificial intelligence era

Machine learning (ML) is transforming global health by improving disease detection, optimizing healthcare delivery, and enhancing medical research. Here are some key applications and considerations:

Applications of ML in Global Health

Disease Diagnosis & Prediction
- ML models analyze medical images (e.g., X-rays, MRIs) to detect diseases like tuberculosis, cancer, and pneumonia.
- Predictive models help identify disease outbreaks and assess patient risks for conditions like diabetes or cardiovascular disease.
Epidemiology & Pandemic Response
- ML helps track and predict the spread of infectious diseases (e.g., COVID-19, malaria, dengue).
- AI-driven models analyze social media, hospital data, and environmental factors to detect early warning signs of outbreaks.
Drug Discovery & Treatment Personalization
- ML accelerates drug discovery by analyzing molecular data to identify potential new drugs.
- Personalized medicine tailors treatments based on genetic and clinical data.
Healthcare Access & Telemedicine
- AI chatbots and virtual assistants provide medical advice in remote areas with limited healthcare access.
- ML optimizes supply chain management for vaccines and medicines, reducing shortages.
Medical Imaging & Radiology
- AI-powered imaging tools assist radiologists in detecting abnormalities with high accuracy.
- ML enhances image segmentation and analysis in pathology, dermatology, and ophthalmology.
Mental Health & Well-being
- ML-powered apps analyze speech, text, and behavior patterns to detect signs of depression and anxiety.
- AI chatbots provide mental health support where human therapists are scarce.

Challenges & Ethical Considerations

Data Privacy & Security: Handling sensitive patient data responsibly to prevent misuse.
Bias in AI Models: ML models can inherit biases from training data, leading to disparities in healthcare outcomes.
Infrastructure & Accessibility: Low-income regions may lack the digital infrastructure needed for ML implementation.
Regulation & Trust: Ensuring ML applications comply with healthcare standards and ethical guidelines.

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#    for filename in filenames:
#        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Beginner-Friendly Data Analysis Notebook

This notebook is designed for beginners and new learners who want to develop their data analysis and visualization skills using Python. Follow along, learn, and take small steps towards building confidence in handling datasets!

Each task has a partial solution to guide you, and you are encouraged to complete the remaining parts.

⚠️ Important Warning:

Before seeking help from tools like ChatGPT or online resources, try using your own skills, logic, and prior learning to even your friends. Only use external help if you're truly stuck after a real hard effort. Practice makes progress!

Instructions:

Carefully read the task descriptions.
Complete the "Your Code Here" sections.
Execute the cells step-by-step to debug and validate your answers.
Reflect on your learning at the end.

Let's get started!

# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import warnings to ignore
import warnings
warnings.filterwarnings('ignore')

#Here we load the dataset
print("The datase is loading...\n")
dataset_path = 'Global_Health_Statistics.csv'
df = pd.read_csv(dataset_path)
df.head()

Output

The datase is loading...

    Country    Year     Disease Name         Disease Category   Prevalence Rate (%)    Incidence Rate (%)   Mortality Rate (%)   Age Group   Gender 	Population Affected    ...  Hospital Beds per 1000   Treatment Type   Average Treatment Cost (USD)   Availability of Vaccines/Treatment     Recovery Rate (%)   DALYs   Improvement in 5 Years (%)  Per Capita Income (USD)   Education Index   Urbanization Rate (%)
 Italy      2013     Malaria              Respiratory        0.95                   1.55                 8.42                 0-18        Male       471007                 ...  7.58                     Medication       21064                          No                                     91.82               4493    2.16                        6886                      0.79              86.02
 France     2002     Ebola                Parasitic          12.46                  8.63                 8.75                 61+         Male       634318                 ...  5.11                     Surgery          47851                          Yes                                    76.65               2366    4.82                        80639                     0.74              45.52
 Turkey     2015     COVID-19             Genetic            0.91                   2.35                 6.22                 36-60       Male       154878                 ...  3.49                     Vaccination      27834                          Yes                                    98.55               41      5.81                        12245                     0.41              40.20
 Indonesia  2011     Parkinson's Disease  Autoimmune         4.68                   6.29                 3.99                 0-18        Other      446224                 ...  8.44                     Surgery          144                            Yes                                    67.35               3201    2.22                        49336                     0.49              58.47
 Italy      2013     Tuberculosis         Genetic            0.83                   13.59                7.01                 61+         Male       472908                 ...  5.90                     Medication       8908                           Yes                                    50.06               2832    6.93                        47701                     0.50              48.14

rows × 22 columns

pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
df.head(3)

Output

    Country   Year   Disease   Category     Prev Rate (%)   Inc Rate (%)   Mort Rate (%)   Age      Gender   Pop Affected   Health Access (%)   Doctors/1000   Beds/1000   Treatment      Avg Cost (USD)   Vaccines Avail   Recovery (%)   DALYs   5yr Improvement (%)   Income (USD)   Edu Index   Urban Rate (%)
 Italy     2013   Malaria   Respiratory  0.95            1.55           8.42            0-18     Male     471007         57.74               3.34           7.58        Medication     21064            No               91.82          4493    2.16                  16886          0.79        86.02
 France    2002   Ebola     Parasitic    12.46           8.63           8.75            61+      Male     634318         89.21               1.33           5.11        Surgery        47851            Yes              76.65          2366    4.82                  80639          0.74        45.52
 Turkey    2015   COVID-19  Genetic      0.91            2.35           6.22            36-60    Male     154878         56.41               4.07           3.49        Vaccination    27834            Yes              98.55          41      5.81                  12245          0.41        40.20

print("Dataset shape and size:")
print(df.shape, df.size)

Output

Dataset shape and size:
(1000000, 22) 22000000

print("Dataset columns:")
print(df.columns)

Output

Dataset columns:
Index(['Country', 'Year', 'Disease', 'Category', 'Prev Rate (%)', 'Inc Rate (%)', 'Mort Rate (%)', 'Age', 'Gender', 'Pop Affected', 'Health Access (%)', 'Doctors/1000', 'Beds/1000', 'Treatment', 'Avg Cost (USD)', 'Vaccines Avail', 'Recovery (%)', 'DALYs', '5yr Improvement (%)', 'Income (USD)', 'Edu Index', 'Urban Rate (%)'], dtype='object')

EDA

print("\n### Group by the country ###")
grouped_by_country = df.groupby('Country')
print(grouped_by_country.size())

Output

### Group by the country ###
Country
Argentina       49798
Australia       49953
Brazil          49687
Canada          50114
China           50066
France          49943
Germany         50176
India           49760
Indonesia       49756
Italy           49839
Japan           49764
Mexico          50080
Nigeria         50046
Russia          50532
Saudi Arabia    49958
South Africa    50408
South Korea     50181
Turkey          49901
UK              50125
USA             49913
dtype: int64

print("\n### Change column names and make short, meaningful ones ### \n")
# your code here

short_names = [
    'Country', 'Year', 'Disease', 'Category', 'Prev Rate (%)', 'Inc Rate (%)',
    'Mort Rate (%)', 'Age', 'Gender', 'Pop Affected', 'Health Access (%)',
    'Doctors/1000', 'Beds/1000', 'Treatment', 'Avg Cost (USD)', 'Vaccines Avail',
    'Recovery (%)', 'DALYs', '5yr Improvement (%)', 'Income (USD)', 'Edu Index', 'Urban Rate (%)'
]

df.columns = short_names #Trying to rename

# Display the updated DataFrame
df.head(3)

Output

### Change column names and make short, meaningful ones ###

    Country   Year   Disease   Category      Prev Rate (%)   Inc Rate (%)   Mort Rate (%)   Age     Gender   Pop Affected   Health Access (%)   Doctors/1000   Beds/1000   Treatment      Avg Cost (USD)   Vaccines Avail   Recovery (%)   DALYs    5yr Improvement (%)    Income (USD)   Edu Index   Urban Rate (%)
0   Italy     2013   Malaria   Respiratory   0.95            1.55           8.42            0-18    Male     471007         57.74               3.34           7.58        Medication     21064            No               91.82          4493     2.16                   16886          0.79        86.02
1   France    2002   Ebola     Parasitic     12.46           8.63           8.75            61+     Male     634318         89.21               1.33           5.11        Surgery        47851            Yes              76.65          2366     4.82                   80639          0.74        45.52
2   Turkey    2015   COVID-19  Genetic       0.91            2.35           6.22            36-60   Male     154878         56.41               4.07           3.49        Vaccination    27834            Yes              98.55          41       5.81                   12245          0.41        40.20

There are some other ways as well try them too.

print("\n### Dataset Information ###")
df.info()

Output

### Dataset Information ###
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 22 columns):
 #   Column               Non-Null Count    Dtype
---  ------               --------------    -----
 Country              1000000 non-null  object
 Year                 1000000 non-null  int64
 Disease              1000000 non-null  object
 Category             1000000 non-null  object
 Prev Rate (%)        1000000 non-null  float64
 Inc Rate (%)         1000000 non-null  float64
 Mort Rate (%)        1000000 non-null  float64
 Age                  1000000 non-null  object
 Gender               1000000 non-null  object
 Pop Affected         1000000 non-null  int64
Health Access (%)    1000000 non-null  float64
Doctors/1000         1000000 non-null  float64
Beds/1000            1000000 non-null  float64
Treatment            1000000 non-null  object
Avg Cost (USD)       1000000 non-null  int64
Vaccines Avail       1000000 non-null  object
Recovery (%)         1000000 non-null  float64
DALYs                1000000 non-null  int64
5yr Improvement (%)  1000000 non-null  float64
Income (USD)         1000000 non-null  int64
Edu Index            1000000 non-null  float64
Urban Rate (%)       1000000 non-null  float64
dtypes: float64(10), int64(5), object(7)
memory usage: 167.8+ MB

print("\n### Checking for Missing Values ###")
print(df.isnull().sum())

Output

### Checking for Missing Values ###
Country                0
Year                   0
Disease                0
Category               0
Prev Rate (%)          0
Inc Rate (%)           0
Mort Rate (%)          0
Age                    0
Gender                 0
Pop Affected           0
Health Access (%)      0
Doctors/1000           0
Beds/1000              0
Treatment              0
Avg Cost (USD)         0
Vaccines Avail         0
Recovery (%)           0
DALYs                  0
5yr Improvement (%)    0
Income (USD)           0
Edu Index              0
Urban Rate (%)         0
dtype: int64

Task 1: Explore High Treatment Costs

We want to identify disease categories where the average treatment cost exceeds $5,000. The following code calculates the average treatment cost for each disease category. Complete the rest to filter the results.

print("\n### High Treatment Costs ###")
avg_treatment_cost = df.groupby('Category')['Avg Cost (USD)'].mean()

# 5000$ üzerindeki tedavi maliyetlerini filtreleme
high_cost_diseases = avg_treatment_cost[avg_treatment_cost > 5000]
print(high_cost_diseases)

Output

### High Treatment Costs ###
Category
Autoimmune        25124.453381
Bacterial         24958.060657
Cardiovascular    25019.609335
Chronic           25019.360949
Genetic           24991.234814
Infectious        25021.655117
Metabolic         24964.149170
Neurological      25017.514703
Parasitic         24972.076740
Respiratory       25066.355632
Viral             24959.072181
Name: Avg Cost (USD), dtype: float64

Task 2: Visualize Healthcare Access vs Mortality Rate

We will create a scatter plot to visualize the relationship between Healthcare Access (%) and Mortality Rate (%) across different regions.

This task helps you understand whether higher healthcare access correlates with lower mortality rates.

print("\n### Healthcare Access vs Mortality Rate ###")

# Scatter Plot
plt.figure(figsize=(12, 6))
sns.scatterplot(
    x='Health Access (%)',
    y='Mort Rate (%)',
    data=df,
    hue='Category',
    palette='YlOrRd'
)
plt.title("Healthcare Access vs Mortality Rate")
plt.xlabel("Healthcare Access (%)")
plt.ylabel("Mortality Rate (%)")
plt.show()

Output

Task 3: Identify Top 5 Most Prevalent Diseases

Group the data by Disease Name and calculate the total prevalence rate. Find the top 5 diseases with the highest prevalence rate.

print("\n### Top 5 Most Prevalent Diseases ###")
prevalence = df.groupby('Disease')['Prev Rate (%)'].sum().sort_values(ascending=False)
top_5_diseases = prevalence.head(5)
print(top_5_diseases)

Output

### Top 5 Most Prevalent Diseases ###
Disease
Cholera     506925.25
HIV/AIDS    506703.20
COVID-19    506447.35
Dengue      505964.81
Cancer      505779.18
Name: Prev Rate (%), dtype: float64

Step 1: Prepare the Data

We will use the following columns:

Healthcare Access (%)' as the feature (X)
Mortality Rate (%)' as the target (y)

Tasks:

Drop missing values for these columns.
Split the data into training and testing sets (80-20 split).

from sklearn.model_selection import train_test_split

# Eksik verileri düşür
df.dropna(subset=['Health Access (%)', 'Mort Rate (%)'], inplace=True)

# Feature ve target seçimi
X = df[['Health Access (%)']]
y = df['Mort Rate (%)']

# Eğitim ve test setlerine ayırma
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Data Prepared Successfully!")
print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)

Output

Data Prepared Successfully!
Training set size: (800000, 1)
Testing set size: (200000, 1)

Step 2: Train a Linear Regression Model

We will use Linear Regression to predict 'Mortality Rate (%)' using 'Healthcare Access (%)'.

Tasks:

Import LinearRegression from sklearn.
Initialize the model and train it using the training data.

from sklearn.linear_model import LinearRegression

# Model oluştur ve eğit
model = LinearRegression()
model.fit(X_train, y_train)

print("Model Trained Successfully!")

Output

Model Trained Successfully!

Step 3: Evaluate the Model

We will evaluate the model's performance using:

Mean Squared Error (MSE)
R-squared (R2) score

Tasks:

Make predictions on the test set.
Calculate the MSE and R2 score.

from sklearn.metrics import mean_squared_error, r2_score

# Tahmin yap
y_pred = model.predict(X_test)

# Performans metriklerini hesapla
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("\n### Model Performance ###")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R2 Score): {r2:.2f}")

Output

### Model Performance ###
Mean Squared Error (MSE): 8.19
R-squared (R2 Score): -0.00

Step 4: Visualize Predictions vs Actual Values

Tasks:

Create a scatter plot for actual vs predicted values.
Add a line for perfect predictions (y = x).

plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, color='red', alpha=0.5)  # Noktaları kırmızı yapıyoruz
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'b--', lw=2)  # Kesikli çizgiyi mavi yapıyoruz

plt.title("Actual vs Predicted Mortality Rate")
plt.xlabel("Actual Mortality Rate (%)")
plt.ylabel("Predicted Mortality Rate (%)")
plt.show()

Output

Step 5: Challenge

You are expected to add more features (e.g., 'Prevalence Rate (%)', 'Average Treatment Cost (USD)') to improve the model.
You are expected to use a different algorithm (e.g., Decision Tree, Random Forest).
You are expected to evaluate performance using Cross-Validation.

Your Tasks:

Expand the input features.
Train a new model using your choice of algorithm.
Compare the performance with Linear Regression.

from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor

# Yeni özellikleri seçiyoruz
X_new = df[['Health Access (%)', 'Prev Rate (%)', 'Avg Cost (USD)']]
y_new = df['Mort Rate (%)']

# Eksik verileri düşür
X_new.dropna(inplace=True)
y_new = y_new[X_new.index]  # X ve y'nin hizalanmasını sağlıyoruz

# Eğitim ve test setlerine ayırma
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_new, y_new, test_size=0.2, random_state=42)



# Decision Tree modeli oluştur ve eğit
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train_new, y_train_new)

# Tahmin yap
y_pred_dt = dt_model.predict(X_test_new)

# Performans metriklerini hesapla
mse_dt = mean_squared_error(y_test_new, y_pred_dt)
r2_dt = r2_score(y_test_new, y_pred_dt)

print("\n### Decision Tree Model Performance ###")
print(f"Mean Squared Error (MSE): {mse_dt:.2f}")
print(f"R-squared (R2 Score): {r2_dt:.2f}")


# Cross-validation ile performansı değerlendirme
cv_scores_dt = cross_val_score(dt_model, X_new, y_new, cv=5, scoring='r2')

print("\n### Cross-Validation R2 Scores (Decision Tree) ###")
print(cv_scores_dt)
print(f"Mean R2 Score: {np.mean(cv_scores_dt):.2f}")


# Linear Regression Modeli için cross-validation
cv_scores_lr = cross_val_score(model, X, y, cv=5, scoring='r2')

print("\n### Cross-Validation R2 Scores (Linear Regression) ###")
print(cv_scores_lr)
print(f"Mean R2 Score: {np.mean(cv_scores_lr):.2f}")

# Performans karşılaştırması
print("\n### Performance Comparison ###")
print(f"Linear Regression - Mean R2: {np.mean(cv_scores_lr):.2f}")
print(f"Decision Tree - Mean R2: {np.mean(cv_scores_dt):.2f}")

Output

### Decision Tree Model Performance ###
Mean Squared Error (MSE): 16.65
R-squared (R2 Score): -1.03

### Cross-Validation R2 Scores (Decision Tree) ###
[-1.03504351 -1.0374017  -1.03589653 -1.03985854 -1.04207183]
Mean R2 Score: -1.04

### Cross-Validation R2 Scores (Linear Regression) ###
[-5.33452368e-06 -6.77185006e-06 -3.17034059e-06 -1.67578961e-05
 -1.13589043e-05]
Mean R2 Score: -0.00

### Performance Comparison ###
Linear Regression - Mean R2: -0.00
Decision Tree - Mean R2: -1.04

On the base + sources:
EDA 4 Global Health data

Applications of ML in Global Health​

Challenges & Ethical Considerations​

Beginner-Friendly Data Analysis Notebook​

Instructions:​

Output​

Output​

Output​

Output​

EDA​

Output​

Output​

Output​

Output​

Task 1: Explore High Treatment Costs​

Output​

Task 2: Visualize Healthcare Access vs Mortality Rate​

Output​

Task 3: Identify Top 5 Most Prevalent Diseases​

Output​

Step 1: Prepare the Data​

Output​

Step 2: Train a Linear Regression Model​

Output​

Step 3: Evaluate the Model​

Output​

Step 4: Visualize Predictions vs Actual Values​

Output​

Step 5: Challenge​

Output​

Applications of ML in Global Health

Challenges & Ethical Considerations

Beginner-Friendly Data Analysis Notebook

Instructions:

Output

Output

Output

Output

EDA

Output

Output

Output

Output

Task 1: Explore High Treatment Costs

Output

Task 2: Visualize Healthcare Access vs Mortality Rate

Output

Task 3: Identify Top 5 Most Prevalent Diseases

Output

Step 1: Prepare the Data

Output

Step 2: Train a Linear Regression Model

Output

Step 3: Evaluate the Model

Output

Step 4: Visualize Predictions vs Actual Values

Output

Step 5: Challenge

Output