Introduction to Statistical Analysis in Python: EDA, Visualization, and Inference

Introduction to Statistical Analysis in Python: EDA, Visualization, and Inference#

Introduction#

In this notebook, we will explore statistical analysis using Python, focusing on three critical components: Exploratory Data Analysis (EDA), Data Visualization, and Statistical Inference. We’ll use Python’s most powerful data analysis libraries: pandas for data manipulation, matplotlib and seaborn for static visualization, plotly for interactive graphics, scipy for statistical testing, and statsmodels for regression analysis.

Learning Outcomes#

Exploratory Data Analysis (EDA)#

Load, inspect, and clean datasets; create derived variables and demographic indicators
Perform comprehensive analysis (univariate, bivariate, multivariate) and handle outliers

Data Visualization#

Master matplotlib, seaborn, and plotly for statistical and interactive visualizations
Apply best practices to create publication-ready, multi-panel graphics

Statistical Inference#

Formulate and test hypotheses using scipy and statsmodels; perform common statistical tests
Build multiple linear regression models, calculate confidence intervals, and check assumptions

Dataset Description#

This dataset was created by combining population density data and building footprints. The population density data is age-group specific from Meta. The building footprints were downloaded from Google’s Open Buildings dataset. To generate the data at cell-level, I utilized GIS packages in Python. You will learn about this processing in the next session when we cover Module 3 (Spatial Data Processing).

Dataset Overview:

Original population density data: Gridded Population Density by Age Group
Building Footprints: Google Open Buildings
Download link: Download data file from here

Variable Descriptions#

Geographic Identifiers#

Variable	Type	Description
`cell_id`	String	Unique identifier for each administrative cell
`province_name`	String	Province name (5 provinces: Kigali, Eastern, Western, Northern, Southern)
`district_name`	String	District name within province (30 districts total)
`sector_name`	String	Sector name within district (administrative subdivision)
`cell_name`	String	Cell name (smallest administrative unit)

Demographic Variables (2020 Population Estimates)#

Variable	Description
`general_2020`	Total population in the cell
`elderly_60_plus_2020`	Population aged 60 years and above
`children_under_five_2020`	Population under 5 years of age
`youth_15_24_2020`	Population aged 15-24 years
`men_2020`	Male population
`women_2020`	Female population

Infrastructure Variable#

Variable	Type	Description
`building_count`	Float	Number of buildings/structures in the cell

Import Required Packages#

from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

DIR_DATA = Path.cwd().parents[1] / "data"
FILE_CELL_POP = DIR_DATA / "population" / "rwa-cell-pop.csv"

1. EXPLOLATORY DATA ANALYSIS#

1.1 Data Cleaning and Preprocessing#

TASK-0: Load the dataset and inspect its structure.

TASK-1: The column names with ‘_2020’ suffix are verbose and hard to work with. Rename them to be more concise and readable.

INSTRUCTIONS:

Create a dictionary that maps old column names to new names
Use pandas .rename() method to rename the columns
Print the new column names to verify the changes

NEW VARIABLE NAMES TO USE:

‘general_2020’ → ‘total_pop’
‘elderly_60_plus_2020’ → ‘elderly_pop’
‘children_under_five_2020’ → ‘children_pop’
‘youth_15_24_2020’ → ‘youth_pop’
‘men_2020’ → ‘male_pop’
‘women_2020’ → ‘female_pop’

TASK-2: Create meaningful demographic indicators from the population data.

PART A: BASIC POPULATION MEASURES Create these variables using the NEW column names:

‘population’ - total population (alias for clarity)
‘gender_ratio’ - men per 100 women (male_pop / female_pop * 100)
‘density_proxy’ - people per building (population / building_count)

PART B: AGE STRUCTURE PROPORTIONS (as decimals 0-1, not percentages)

‘pct_elderly’ - proportion of elderly population
‘pct_children’ - proportion of children population
‘pct_youth’ - proportion of youth population

PART C: DEMOGRAPHIC INDICATORS

‘dependency_ratio’ - (children + elderly) / working age population Note: working age = total - children - elderly

1.2. Univariate Analysis#

In univariate analysis, we focus on understanding and summarizing a single variable at a time. The goal is to explore its distribution, central tendency (like mean or median), and variability (such as range or standard deviation). This helps us understand the basic characteristics of individual variables before looking at relationships between them.

Summary statistics
Histograms and basic charts for single variables

# Descriptive Statistics for Population Variables
# Ensure that the variable names used below match the ones you renamed earlier in the notebook.  
# Update them as needed to reflect any changes.
pop_columns = ['total_population', 'elderly_60_plus_2020', 'children_under_five_2020', 
               'youth_15_24_2020', 'men_2020', 'women_2020', 'building_count']

print("Descriptive statistics for population variables:")
desc_stats = df[pop_columns].describe()
print(desc_stats.round(2))

# Distribution analysis
print("\nDistribution characteristics:")
for col in ['total_population', 'gender_ratio', 'people_per_building']:
    if col in df.columns:
        print(f"\n{col}:")
        print(f"  Mean: {df[col].mean():.2f}")
        print(f"  Median: {df[col].median():.2f}")
        print(f"  Std Dev: {df[col].std():.2f}")
        print(f"  Skewness: {df[col].skew():.2f}")
        print(f"  Kurtosis: {df[col].kurtosis():.2f}")

# Visualization of key distributions
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('Distribution of Key Population Variables', fontsize=16)

# Total population distribution
axes[0,0].hist(df['total_population'], bins=50, alpha=0.7, color='skyblue', edgecolor='black')
axes[0,0].set_title('Total Population Distribution')
axes[0,0].set_xlabel('Population')
axes[0,0].set_ylabel('Frequency')


# Age proportions
axes[1,0].hist(df['elderly_proportion'], bins=50, alpha=0.7, color='gold', edgecolor='black')
axes[1,0].set_title('Proportion of Elderly (60+)')
axes[1,0].set_xlabel('Proportion')
axes[1,0].set_ylabel('Frequency')

# Children proportion
axes[1,1].hist(df['children_proportion'], bins=50, alpha=0.7, color='lightcoral', edgecolor='black')
axes[1,1].set_title('Proportion of Children (<5)')
axes[1,1].set_xlabel('Proportion')
axes[1,1].set_ylabel('Frequency')

# People per building
axes[1,2].hist(df['people_per_building'], bins=50, alpha=0.7, color='mediumpurple', edgecolor='black')
axes[1,2].set_title('People per Building')
axes[1,2].set_xlabel('People/Building')
axes[1,2].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

EXERCISE-1: CREATE MORE PLOTS FOR THE FOLLOWING VARIABLES

children_proportion
bulding_count

1.3. Bivariate Analysis#

Bivariate analysis explores the relationship between two variables. This helps identify potential associations, trends, or patterns — such as how one variable might influence or correlate with another. Depending on the variable types (numerical or categorical), different methods and visualizations are used.

Summary statistics (e.g., correlation coefficients, cross-tabulations)
Scatter plots, grouped bar charts, box plots, and line charts

# Correlation analysis
print("Correlation matrix for key variables:")
corr_vars = ['total_population', 'elderly_proportion', 'children_proportion', 
             'youth_proportion', 'gender_ratio', 'people_per_building', 'dependency_ratio']
correlation_matrix = df[corr_vars].corr()
print(correlation_matrix.round(3))

# Visualize correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, fmt='.3f', cbar_kws={'shrink': 0.8})
plt.title('Correlation Matrix: Population Demographics')
plt.tight_layout()
plt.show()

# Population by administrative level
print("\nPopulation distribution by administrative levels:")

# By Province
province_pop = df.groupby('province_name')['total_population'].agg(['sum', 'mean', 'count']).round(2)
province_pop.columns = ['Total_Pop', 'Avg_Cell_Pop', 'Cell_Count']
print("By Province:")
print(province_pop.sort_values('Total_Pop', ascending=False))

# Scatter plots for key relationships
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Key Bivariate Relationships', fontsize=16)

# Population vs Building Count
axes[0,0].scatter(df['building_count'], df['total_population'], alpha=0.6, color='blue')
axes[0,0].set_xlabel('Building Count')
axes[0,0].set_ylabel('Total Population')
axes[0,0].set_title('Population vs Building Count')


# Children vs Elderly proportions
axes[1,0].scatter(df['children_proportion'], df['elderly_proportion'], alpha=0.6, color='green')
axes[1,0].set_xlabel('Children Proportion')
axes[1,0].set_ylabel('Elderly Proportion')
axes[1,0].set_title('Age Structure Relationship')

# Dependency ratio vs Population
axes[1,1].scatter(df['total_population'], df['dependency_ratio'], alpha=0.6, color='purple')
axes[1,1].set_xlabel('Total Population')
axes[1,1].set_ylabel('Dependency Ratio')
axes[1,1].set_title('Dependency Ratio vs Population')

plt.tight_layout()
plt.show()

EXERCISE 2: Correlation Analysis#

create a correlation matrix using: Use these variables: population, pct_elderly, pct_children, pct_youth, gender_ratio, density_proxy, dependency_ratio
Dispaly the matrix as a heatmap

1.3 Multivariate Exploration#

EXERCISE-3: CREATE CATEGORIAL VARIABLE FOR POPULATION

Use quantiles to understand population distribution. Check ```pd.qcut`` function on how to create categorial variable with quantiles
Lets call this variable 'pop_size_category'

df['pop_category'] = pd.qcut(df['total_population'], q=4, labels=['Very Low', 'Low', 'High', 'Very High'])

# =====================================
# CROSS TABULATION 1: Province vs Population Size
# =====================================
print("\n1. CROSS TABULATION: Province vs Population Size Categories")
print("-" * 60)

crosstab1 = pd.crosstab(df['province_name'], df['pop_size_category'], margins=True)
print("Frequency Table:")
print(crosstab1)

print("\nPercentage by Province (Row Percentages):")
crosstab1_pct = pd.crosstab(df['province_name'], df['pop_size_category'], 
                           normalize='index') * 100
print(crosstab1_pct.round(1))

print("\nPercentage by Population Size (Column Percentages):")
crosstab1_col_pct = pd.crosstab(df['province_name'], df['pop_size_category'], 
                               normalize='columns') * 100
print(crosstab1_col_pct.round(1))

2. Static Visualizations#

Static visualizations are non-interactive plots or charts that present data in a fixed format. They are ideal for printed reports, slide decks, and simple summaries where interactivity is not required. These visualizations help convey trends, comparisons, and distributions clearly and efficiently.

Common examples include:

Bar charts
Line plots
Histograms
Box plots
Scatter plots

# Basic scatter plot with customization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Population vs Buildings
ax1.scatter(df['building_count'], df['total_population'], alpha=0.6, color='steelblue')
ax1.set_xlabel('Number of Buildings')
ax1.set_ylabel('Total Population')
ax1.set_title('Population vs Infrastructure')
ax1.grid(True, alpha=0.3)

# Add trend line
z = np.polyfit(df['building_count'], df['total_population'], 1)
p = np.poly1d(z)
ax1.plot(df['building_count'], p(df['building_count']), "r--", alpha=0.8)

# Plot 2: Distribution
ax2.hist(df['total_population'], bins=30, alpha=0.7, color='lightcoral', edgecolor='black')
ax2.axvline(df['total_population'].mean(), color='red', linestyle='--', linewidth=2)
ax2.set_xlabel('Total Population')
ax2.set_ylabel('Frequency')
ax2.set_title('Population Distribution')

plt.tight_layout()
plt.show()

# Correlation heatmap
corr_vars = ['total_population', 'elderly_proportion', 'children_proportion', 'gender_ratio']
correlation_matrix = df[corr_vars].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='RdBu_r', center=0, 
            square=True, ax=axes[1,0])
axes[1,0].set_title('Correlation Heatmap')

# Scatter plot with regression
sns.scatterplot(data=df, x='elderly_proportion', y='children_proportion', 
                hue='province_name', ax=axes[1,1])
sns.regplot(data=df, x='elderly_proportion', y='children_proportion', 
            scatter=False, ax=axes[1,1], color='red')
axes[1,1].set_title('Age Structure Relationship')
axes[1,1].legend(bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()

3. Inference and Basic Modelling#

This section focuses on using statistical methods to draw conclusions about populations based on sample data (inference) and on building simple models to explain or predict outcomes (basic modeling).

We introduce techniques such as hypothesis testing to assess whether observed patterns are statistically significant, and basic regression models to explore relationships between variables.

Key topics may include:

Confidence intervals and hypothesis tests
t-tests and chi-square tests
Simple linear regression
Model interpretation and evaluation

3.1 Hypothesis Testing#

print("\n1. HYPOTHESIS TESTING")
print("-" * 30)

# One-sample t-test: Is gender ratio significantly different from 100?
print("ONE-SAMPLE T-TEST: Gender Ratio vs Population Parity (100)")
print("-" * 50)

t_stat, p_value = stats.ttest_1samp(df['gender_ratio'], 100)
mean_ratio = df['gender_ratio'].mean()

print(f"Sample mean: {mean_ratio:.2f}")
print(f"Test statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Result: {'Reject H0' if p_value < 0.05 else 'Fail to reject H0'} (α = 0.05)")
print(f"Interpretation: Gender ratio is {'significantly different from' if p_value < 0.05 else 'not significantly different from'} 100")

# Two-sample t-test: Compare children proportion between two largest provinces
print(f"\nTWO-SAMPLE T-TEST: Children Proportion Comparison")
print("-" * 50)

# Get two largest provinces by number of cells
province_counts = df['province_name'].value_counts()
prov1, prov2 = province_counts.index[:2]

group1 = df[df['province_name'] == prov1]['children_proportion']
group2 = df[df['province_name'] == prov2]['children_proportion']

t_stat2, p_value2 = stats.ttest_ind(group1, group2)

print(f"Comparing: {prov1} vs {prov2}")
print(f"{prov1} mean: {group1.mean():.3f} (n={len(group1)})")
print(f"{prov2} mean: {group2.mean():.3f} (n={len(group2)})")
print(f"t-statistic: {t_stat2:.3f}")
print(f"P-value: {p_value2:.4f}")
print(f"Result: {'Significant difference' if p_value2 < 0.05 else 'No significant difference'}")

# Chi-square test: Association between province and population size category
print(f"\nCHI-SQUARE TEST: Province vs Population Size")
print("-" * 50)

# Create population size categories
df['pop_category'] = pd.cut(df['total_population'], 
                           bins=[0, 1000, 3000, float('inf')], 
                           labels=['Small', 'Medium', 'Large'])

# Create contingency table
contingency_table = pd.crosstab(df['province_name'], df['pop_category'])
print("Contingency Table:")
print(contingency_table)

chi2, p_val_chi, dof, expected = stats.chi2_contingency(contingency_table)
print(f"\nChi-square statistic: {chi2:.3f}")
print(f"P-value: {p_val_chi:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"Result: {'Significant association' if p_val_chi < 0.05 else 'No significant association'}")

3.2 Regression with StatsModels#

# Simple linear regression
print("SIMPLE LINEAR REGRESSION")
print("-" * 25)

# Predict total population from building count
X = df['building_count']
y = df['total_population']

# Add constant for intercept
X_with_const = sm.add_constant(X)
model_simple = sm.OLS(y, X_with_const).fit()

print("Model: Total Population ~ Building Count")
print(f"R-squared: {model_simple.rsquared:.3f}")
print(f"Coefficient (slope): {model_simple.params[1]:.2f}")
print(f"P-value for slope: {model_simple.pvalues[1]:.4f}")
print(f"Interpretation: Each additional building is associated with {model_simple.params[1]:.1f} more people")

# Multiple linear regression using statsmodels formula API
print(f"\nMULTIPLE LINEAR REGRESSION")
print("-" * 25)

# Predict dependency ratio from multiple factors
formula = 'dependency_ratio ~ children_proportion + elderly_proportion + C(province_name)'
model_multiple = smf.ols(formula, data=df).fit()

print("Model: Dependency Ratio ~ Children % + Elderly % + Province")
print(f"R-squared: {model_multiple.rsquared:.3f}")
print(f"Adjusted R-squared: {model_multiple.rsquared_adj:.3f}")
print(f"F-statistic p-value: {model_multiple.f_pvalue:.4f}")

print("\nKey Coefficients:")
coeffs = model_multiple.params
pvals = model_multiple.pvalues
for var in ['children_proportion', 'elderly_proportion']:
    if var in coeffs:
        print(f"  {var}: {coeffs[var]:.3f} (p={pvals[var]:.4f})")

print(f"\nModel Summary:")
print(f"  Variables: {len(model_multiple.params)} coefficients")
print(f"  Significant predictors: {sum(model_multiple.pvalues < 0.05)} at α=0.05")

model_multiple.summary()

EXERCISE-3: PREDICTING PEOPLE PER BUILDING FROM DEMOGRAPHIC VARIABLES

Build a model which uses all demographic variables to predict people per building
Print model summary