Skip to main content

EDA 4 Global Health data

The leading global health challenges in the artificial intelligence era

Machine learning (ML) is transforming global health by improving disease detection, optimizing healthcare delivery, and enhancing medical research. Here are some key applications and considerations:

Applications of ML in Global Health

  1. Disease Diagnosis & Prediction

    • ML models analyze medical images (e.g., X-rays, MRIs) to detect diseases like tuberculosis, cancer, and pneumonia.
    • Predictive models help identify disease outbreaks and assess patient risks for conditions like diabetes or cardiovascular disease.
  2. Epidemiology & Pandemic Response

    • ML helps track and predict the spread of infectious diseases (e.g., COVID-19, malaria, dengue).
    • AI-driven models analyze social media, hospital data, and environmental factors to detect early warning signs of outbreaks.
  3. Drug Discovery & Treatment Personalization

    • ML accelerates drug discovery by analyzing molecular data to identify potential new drugs.
    • Personalized medicine tailors treatments based on genetic and clinical data.
  4. Healthcare Access & Telemedicine

    • AI chatbots and virtual assistants provide medical advice in remote areas with limited healthcare access.
    • ML optimizes supply chain management for vaccines and medicines, reducing shortages.
  5. Medical Imaging & Radiology

    • AI-powered imaging tools assist radiologists in detecting abnormalities with high accuracy.
    • ML enhances image segmentation and analysis in pathology, dermatology, and ophthalmology.
  6. Mental Health & Well-being

    • ML-powered apps analyze speech, text, and behavior patterns to detect signs of depression and anxiety.
    • AI chatbots provide mental health support where human therapists are scarce.

Challenges & Ethical Considerations

  • Data Privacy & Security: Handling sensitive patient data responsibly to prevent misuse.
  • Bias in AI Models: ML models can inherit biases from training data, leading to disparities in healthcare outcomes.
  • Infrastructure & Accessibility: Low-income regions may lack the digital infrastructure needed for ML implementation.
  • Regulation & Trust: Ensuring ML applications comply with healthcare standards and ethical guidelines.

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
# for filename in filenames:
# print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Beginner-Friendly Data Analysis Notebook

This notebook is designed for beginners and new learners who want to develop their data analysis and visualization skills using Python. Follow along, learn, and take small steps towards building confidence in handling datasets!

Each task has a partial solution to guide you, and you are encouraged to complete the remaining parts.

⚠️ Important Warning:

Before seeking help from tools like ChatGPT or online resources, try using your own skills, logic, and prior learning to even your friends. Only use external help if you're truly stuck after a real hard effort. Practice makes progress!

Instructions:

  • Carefully read the task descriptions.
  • Complete the "Your Code Here" sections.
  • Execute the cells step-by-step to debug and validate your answers.
  • Reflect on your learning at the end.

Let's get started!

# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import warnings to ignore
import warnings
warnings.filterwarnings('ignore')
#Here we load the dataset
print("The datase is loading...\n")
dataset_path = 'Global_Health_Statistics.csv'
df = pd.read_csv(dataset_path)
df.head()

Output

The datase is loading...

Country Year Disease Name Disease Category Prevalence Rate (%) Incidence Rate (%) Mortality Rate (%) Age Group Gender Population Affected ... Hospital Beds per 1000 Treatment Type Average Treatment Cost (USD) Availability of Vaccines/Treatment Recovery Rate (%) DALYs Improvement in 5 Years (%) Per Capita Income (USD) Education Index Urbanization Rate (%)
0 Italy 2013 Malaria Respiratory 0.95 1.55 8.42 0-18 Male 471007 ... 7.58 Medication 21064 No 91.82 4493 2.16 6886 0.79 86.02
1 France 2002 Ebola Parasitic 12.46 8.63 8.75 61+ Male 634318 ... 5.11 Surgery 47851 Yes 76.65 2366 4.82 80639 0.74 45.52
2 Turkey 2015 COVID-19 Genetic 0.91 2.35 6.22 36-60 Male 154878 ... 3.49 Vaccination 27834 Yes 98.55 41 5.81 12245 0.41 40.20
3 Indonesia 2011 Parkinson's Disease Autoimmune 4.68 6.29 3.99 0-18 Other 446224 ... 8.44 Surgery 144 Yes 67.35 3201 2.22 49336 0.49 58.47
4 Italy 2013 Tuberculosis Genetic 0.83 13.59 7.01 61+ Male 472908 ... 5.90 Medication 8908 Yes 50.06 2832 6.93 47701 0.50 48.14

5 rows × 22 columns

pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
df.head(3)

Output


Country Year Disease Category Prev Rate (%) Inc Rate (%) Mort Rate (%) Age Gender Pop Affected Health Access (%) Doctors/1000 Beds/1000 Treatment Avg Cost (USD) Vaccines Avail Recovery (%) DALYs 5yr Improvement (%) Income (USD) Edu Index Urban Rate (%)
0 Italy 2013 Malaria Respiratory 0.95 1.55 8.42 0-18 Male 471007 57.74 3.34 7.58 Medication 21064 No 91.82 4493 2.16 16886 0.79 86.02
1 France 2002 Ebola Parasitic 12.46 8.63 8.75 61+ Male 634318 89.21 1.33 5.11 Surgery 47851 Yes 76.65 2366 4.82 80639 0.74 45.52
2 Turkey 2015 COVID-19 Genetic 0.91 2.35 6.22 36-60 Male 154878 56.41 4.07 3.49 Vaccination 27834 Yes 98.55 41 5.81 12245 0.41 40.20

print("Dataset shape and size:")
print(df.shape, df.size)

Output

Dataset shape and size:
(1000000, 22) 22000000

print("Dataset columns:")
print(df.columns)

Output

Dataset columns:
Index(['Country', 'Year', 'Disease', 'Category', 'Prev Rate (%)', 'Inc Rate (%)', 'Mort Rate (%)', 'Age', 'Gender', 'Pop Affected', 'Health Access (%)', 'Doctors/1000', 'Beds/1000', 'Treatment', 'Avg Cost (USD)', 'Vaccines Avail', 'Recovery (%)', 'DALYs', '5yr Improvement (%)', 'Income (USD)', 'Edu Index', 'Urban Rate (%)'], dtype='object')

EDA

print("\n### Group by the country ###")
grouped_by_country = df.groupby('Country')
print(grouped_by_country.size())

Output

### Group by the country ###
Country
Argentina 49798
Australia 49953
Brazil 49687
Canada 50114
China 50066
France 49943
Germany 50176
India 49760
Indonesia 49756
Italy 49839
Japan 49764
Mexico 50080
Nigeria 50046
Russia 50532
Saudi Arabia 49958
South Africa 50408
South Korea 50181
Turkey 49901
UK 50125
USA 49913
dtype: int64

print("\n### Change column names and make short, meaningful ones ### \n")
# your code here

short_names = [
'Country', 'Year', 'Disease', 'Category', 'Prev Rate (%)', 'Inc Rate (%)',
'Mort Rate (%)', 'Age', 'Gender', 'Pop Affected', 'Health Access (%)',
'Doctors/1000', 'Beds/1000', 'Treatment', 'Avg Cost (USD)', 'Vaccines Avail',
'Recovery (%)', 'DALYs', '5yr Improvement (%)', 'Income (USD)', 'Edu Index', 'Urban Rate (%)'
]

df.columns = short_names #Trying to rename

# Display the updated DataFrame
df.head(3)

Output


### Change column names and make short, meaningful ones ###

Country Year Disease Category Prev Rate (%) Inc Rate (%) Mort Rate (%) Age Gender Pop Affected Health Access (%) Doctors/1000 Beds/1000 Treatment Avg Cost (USD) Vaccines Avail Recovery (%) DALYs 5yr Improvement (%) Income (USD) Edu Index Urban Rate (%)
0 Italy 2013 Malaria Respiratory 0.95 1.55 8.42 0-18 Male 471007 57.74 3.34 7.58 Medication 21064 No 91.82 4493 2.16 16886 0.79 86.02
1 France 2002 Ebola Parasitic 12.46 8.63 8.75 61+ Male 634318 89.21 1.33 5.11 Surgery 47851 Yes 76.65 2366 4.82 80639 0.74 45.52
2 Turkey 2015 COVID-19 Genetic 0.91 2.35 6.22 36-60 Male 154878 56.41 4.07 3.49 Vaccination 27834 Yes 98.55 41 5.81 12245 0.41 40.20

There are some other ways as well try them too.

print("\n### Dataset Information ###")
df.info()

Output

### Dataset Information ###
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 1000000 non-null object
1 Year 1000000 non-null int64
2 Disease 1000000 non-null object
3 Category 1000000 non-null object
4 Prev Rate (%) 1000000 non-null float64
5 Inc Rate (%) 1000000 non-null float64
6 Mort Rate (%) 1000000 non-null float64
7 Age 1000000 non-null object
8 Gender 1000000 non-null object
9 Pop Affected 1000000 non-null int64
10 Health Access (%) 1000000 non-null float64
11 Doctors/1000 1000000 non-null float64
12 Beds/1000 1000000 non-null float64
13 Treatment 1000000 non-null object
14 Avg Cost (USD) 1000000 non-null int64
15 Vaccines Avail 1000000 non-null object
16 Recovery (%) 1000000 non-null float64
17 DALYs 1000000 non-null int64
18 5yr Improvement (%) 1000000 non-null float64
19 Income (USD) 1000000 non-null int64
20 Edu Index 1000000 non-null float64
21 Urban Rate (%) 1000000 non-null float64
dtypes: float64(10), int64(5), object(7)
memory usage: 167.8+ MB

print("\n### Checking for Missing Values ###")
print(df.isnull().sum())

Output


### Checking for Missing Values ###
Country 0
Year 0
Disease 0
Category 0
Prev Rate (%) 0
Inc Rate (%) 0
Mort Rate (%) 0
Age 0
Gender 0
Pop Affected 0
Health Access (%) 0
Doctors/1000 0
Beds/1000 0
Treatment 0
Avg Cost (USD) 0
Vaccines Avail 0
Recovery (%) 0
DALYs 0
5yr Improvement (%) 0
Income (USD) 0
Edu Index 0
Urban Rate (%) 0
dtype: int64

Task 1: Explore High Treatment Costs

We want to identify disease categories where the average treatment cost exceeds $5,000. The following code calculates the average treatment cost for each disease category. Complete the rest to filter the results.

print("\n### High Treatment Costs ###")
avg_treatment_cost = df.groupby('Category')['Avg Cost (USD)'].mean()

# 5000$ üzerindeki tedavi maliyetlerini filtreleme
high_cost_diseases = avg_treatment_cost[avg_treatment_cost > 5000]
print(high_cost_diseases)

Output

### High Treatment Costs ###
Category
Autoimmune 25124.453381
Bacterial 24958.060657
Cardiovascular 25019.609335
Chronic 25019.360949
Genetic 24991.234814
Infectious 25021.655117
Metabolic 24964.149170
Neurological 25017.514703
Parasitic 24972.076740
Respiratory 25066.355632
Viral 24959.072181
Name: Avg Cost (USD), dtype: float64

Task 2: Visualize Healthcare Access vs Mortality Rate

We will create a scatter plot to visualize the relationship between Healthcare Access (%) and Mortality Rate (%) across different regions.

This task helps you understand whether higher healthcare access correlates with lower mortality rates.

print("\n### Healthcare Access vs Mortality Rate ###")

# Scatter Plot
plt.figure(figsize=(12, 6))
sns.scatterplot(
x='Health Access (%)',
y='Mort Rate (%)',
data=df,
hue='Category',
palette='YlOrRd'
)
plt.title("Healthcare Access vs Mortality Rate")
plt.xlabel("Healthcare Access (%)")
plt.ylabel("Mortality Rate (%)")
plt.show()

Output

01-healthcare-access-vs.png

Task 3: Identify Top 5 Most Prevalent Diseases

Group the data by Disease Name and calculate the total prevalence rate. Find the top 5 diseases with the highest prevalence rate.

print("\n### Top 5 Most Prevalent Diseases ###")
prevalence = df.groupby('Disease')['Prev Rate (%)'].sum().sort_values(ascending=False)
top_5_diseases = prevalence.head(5)
print(top_5_diseases)

Output


### Top 5 Most Prevalent Diseases ###
Disease
Cholera 506925.25
HIV/AIDS 506703.20
COVID-19 506447.35
Dengue 505964.81
Cancer 505779.18
Name: Prev Rate (%), dtype: float64

Step 1: Prepare the Data

We will use the following columns:

  • Healthcare Access (%)' as the feature (X)
  • Mortality Rate (%)' as the target (y)

Tasks:

  1. Drop missing values for these columns.
  2. Split the data into training and testing sets (80-20 split).
from sklearn.model_selection import train_test_split

# Eksik verileri düşür
df.dropna(subset=['Health Access (%)', 'Mort Rate (%)'], inplace=True)

# Feature ve target seçimi
X = df[['Health Access (%)']]
y = df['Mort Rate (%)']

# Eğitim ve test setlerine ayırma
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Data Prepared Successfully!")
print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)

Output

Data Prepared Successfully!
Training set size: (800000, 1)
Testing set size: (200000, 1)

Step 2: Train a Linear Regression Model

We will use Linear Regression to predict 'Mortality Rate (%)' using 'Healthcare Access (%)'.

Tasks:

  1. Import LinearRegression from sklearn.
  2. Initialize the model and train it using the training data.
from sklearn.linear_model import LinearRegression

# Model oluştur ve eğit
model = LinearRegression()
model.fit(X_train, y_train)

print("Model Trained Successfully!")

Output

Model Trained Successfully!

Step 3: Evaluate the Model

We will evaluate the model's performance using:

  • Mean Squared Error (MSE)
  • R-squared (R2) score

Tasks:

  1. Make predictions on the test set.
  2. Calculate the MSE and R2 score.
from sklearn.metrics import mean_squared_error, r2_score

# Tahmin yap
y_pred = model.predict(X_test)

# Performans metriklerini hesapla
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("\n### Model Performance ###")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R2 Score): {r2:.2f}")

Output

### Model Performance ###
Mean Squared Error (MSE): 8.19
R-squared (R2 Score): -0.00

Step 4: Visualize Predictions vs Actual Values

Tasks:

  1. Create a scatter plot for actual vs predicted values.
  2. Add a line for perfect predictions (y = x).
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, color='red', alpha=0.5) # Noktaları kırmızı yapıyoruz
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'b--', lw=2) # Kesikli çizgiyi mavi yapıyoruz

plt.title("Actual vs Predicted Mortality Rate")
plt.xlabel("Actual Mortality Rate (%)")
plt.ylabel("Predicted Mortality Rate (%)")
plt.show()

Output

02-actual-vs-predicted.png

Step 5: Challenge

  1. You are expected to add more features (e.g., 'Prevalence Rate (%)', 'Average Treatment Cost (USD)') to improve the model.
  2. You are expected to use a different algorithm (e.g., Decision Tree, Random Forest).
  3. You are expected to evaluate performance using Cross-Validation.

Your Tasks:

  • Expand the input features.
  • Train a new model using your choice of algorithm.
  • Compare the performance with Linear Regression.
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor

# Yeni özellikleri seçiyoruz
X_new = df[['Health Access (%)', 'Prev Rate (%)', 'Avg Cost (USD)']]
y_new = df['Mort Rate (%)']

# Eksik verileri düşür
X_new.dropna(inplace=True)
y_new = y_new[X_new.index] # X ve y'nin hizalanmasını sağlıyoruz

# Eğitim ve test setlerine ayırma
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_new, y_new, test_size=0.2, random_state=42)



# Decision Tree modeli oluştur ve eğit
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train_new, y_train_new)

# Tahmin yap
y_pred_dt = dt_model.predict(X_test_new)

# Performans metriklerini hesapla
mse_dt = mean_squared_error(y_test_new, y_pred_dt)
r2_dt = r2_score(y_test_new, y_pred_dt)

print("\n### Decision Tree Model Performance ###")
print(f"Mean Squared Error (MSE): {mse_dt:.2f}")
print(f"R-squared (R2 Score): {r2_dt:.2f}")


# Cross-validation ile performansı değerlendirme
cv_scores_dt = cross_val_score(dt_model, X_new, y_new, cv=5, scoring='r2')

print("\n### Cross-Validation R2 Scores (Decision Tree) ###")
print(cv_scores_dt)
print(f"Mean R2 Score: {np.mean(cv_scores_dt):.2f}")


# Linear Regression Modeli için cross-validation
cv_scores_lr = cross_val_score(model, X, y, cv=5, scoring='r2')

print("\n### Cross-Validation R2 Scores (Linear Regression) ###")
print(cv_scores_lr)
print(f"Mean R2 Score: {np.mean(cv_scores_lr):.2f}")

# Performans karşılaştırması
print("\n### Performance Comparison ###")
print(f"Linear Regression - Mean R2: {np.mean(cv_scores_lr):.2f}")
print(f"Decision Tree - Mean R2: {np.mean(cv_scores_dt):.2f}")

Output

### Decision Tree Model Performance ###
Mean Squared Error (MSE): 16.65
R-squared (R2 Score): -1.03

### Cross-Validation R2 Scores (Decision Tree) ###
[-1.03504351 -1.0374017 -1.03589653 -1.03985854 -1.04207183]
Mean R2 Score: -1.04

### Cross-Validation R2 Scores (Linear Regression) ###
[-5.33452368e-06 -6.77185006e-06 -3.17034059e-06 -1.67578961e-05
-1.13589043e-05]
Mean R2 Score: -0.00

### Performance Comparison ###
Linear Regression - Mean R2: -0.00
Decision Tree - Mean R2: -1.04

On the base + sources:
EDA 4 Global Health data