Programming Assignment-2

Programming Assignment-2#

The goal of this assingment is to allow you to practice several the following things in Python:

Perfoming typical data processing (or preprocessing if you prefer). This includes all the typical data wraning such as creating news variables, combining several datasets and more
Running explolatory data analysis including basic plotting of variables
Perfoming basic inferential statisticals using statsmodels and scipy to run hypythesis testing and build simple statistial or econometric models.

Datasets#

For this assignment, you will use the following datasets:

Rwanda Health Indicators#

The Excel file was generated by combining multiple CSV files, each containing data on different health indicators for Rwanda, So that each sheet in the file represent one such indicator. See below some of the input files which were used:

access-to-health-care_subnational_rwa
child-mortality-rates_subnational_rwa
dhs-mobile_subnational_rwa

You can download the dataset from here.

Nights lights Data#

Please download it here and check the documentation in the cells below.

Popupation Dataset#

Please download it here and check the documentation and metadata in the class notebooks.

Submission Guidelines#

Please guidelines and complete all steps in the GitHub Workflow
Once you have completed your assignment, push chanegs to your repository.
Send a link (copy from within GitHub) to your notebook to the tutors/teaching assistants

Import Required Packages#

from pathlib import Path
import pandas as pd

Setup Input Folders#

As usual, it is good practice to set up input folders using the pathlib package. In this section, make sure to define the folders where your data is stored on your machine.

I find it helpful to set up the working directory and input data folders right at the start of the notebook. To keep things organized, I use the naming convention: FILE_{NAME} for files and DIR_{NAME} for folders. We use capital letters because these are global variables that will be referenced throughout the notebook.

We’ll be using the pathlib library, which offers several advantages over traditional string-based path handling:

Cross-platform compatibility - automatically handles path separators (/ vs \) across different operating systems
Object-oriented approach - paths are objects with useful methods rather than strings
Intuitive syntax - use / operator to join paths naturally: parent_dir / "subfolder" / "file.txt"
Built-in path operations - methods like .exists(), .is_file(), .parent, .stem, and .suffix
Safer path manipulation - reduces errors from manual string concatenation and splitting

This is the recommended approach for managing file paths in modern Python development.

# Uncomment the following lines and add your code to define the directories and files
DIR_DATA = Path.cwd().parents[1].joinpath("data")
FILE_EXCEL = DIR_DATA/"RW-Health-Data.xlsx"

# Population by enumeration area (EA) for Malawi
# FILE_POP_MW = ADD YOUR CODE

Part 1: Processing Excel Files#

The primary goal is to preprocess an Excel file with multiple sheets into a unified CSV dataset that consolidates multiple indicators. Having all indicators in a single file at the same analytical unit (national, subnational) is more efficient than managing separate files and enables easier cross-indicator analysis.

Task 1: Generate National-Level Summaries#

For each indicator, compute a single national-level value using appropriate aggregation functions such as mean, sum or count. For this one, all available indicators can be summarized at national level, so we will have a CSV file with one row and

Expected Output Structure#

DataFrame display in Jupyter Notebook
CSV file with columns:

indicator_name: Name of the indicator
aggregated_value: Computed national value
indicator_year: Survey year or something similar
survey_name: Name of the survey where information is coming from
aggregation_method: Statistical method used (optional)

Task 2: Subnational-Level Indicator Dataset#

Create a merged dataset for indicators with subnational data (ADM2/ADM3 levels), ensuring spatial alignment and consistent administrative boundaries.

Expected Output Structure#

indicator_name: Name of the indicator
aggregated_value: Computed national value
indicator_year: Survey year or something similar
survey_name: Name of the survey where information is coming from
aggregation_method: Statistical method used (optional)

This structure enables both single-indicator and multi-indicator analysis at the subnational level.

Introduction to Nightlights Dataset#

What is Nightlight Data?#

Nightlight data is satellite imagery capturing artificial light emissions from Earth’s surface during nighttime. Satellites like VIIRS collect this data regularly, providing an objective, real-time measure of human economic activity and development.

Raw Data: Radiance Measurements#

The fundamental measurement in nightlight data is radiance - the amount of light energy detected by satellite sensors, measured in nanowatts per square centimeter per steradian (nW/cm²/sr). Each pixel in satellite imagery contains a radiance value representing the light intensity from that specific location on Earth’s surface.

Annual Composite Generation#

This dataset was created from annual composite images using VIIRS nightlight files for Rwanda. Annual composites are generated by:

Aggregating daily/monthly observations throughout each year (2015, 2020, 2024)
Filtering out temporary light sources (fires, lightning, aurora)
Removing cloud-affected observations to ensure clear measurements
Averaging or taking median values to create stable, representative annual measurements
Masking techniques to exclude areas with unreliable data

The files used include both average composites (average_masked) and median composites (median_masked), with cloud-free versions (vcmslcfg) preferred over cloud-inclusive versions (vcmcfg) for more accurate measurements.

Why Use Nightlight Data?#

Consistent global coverage - Available everywhere, regardless of local data quality
Real-time updates - More current than traditional economic statistics
Objective measurement - Not subject to reporting biases
High resolution - Captures local development patterns
Proxy for development - Light intensity correlates with economic activity, infrastructure, and quality of life

Dataset Overview#

6,507 observations across Rwanda’s administrative cells
Three time periods: 2015, 2020, 2024
Cell-level data - Rwanda’s smallest administrative units
Allows temporal analysis of development trends

Variable Definitions#

Administrative Identifiers#

cell_id - Unique identifier for linking with other datasets
province_name - Province (5 total in Rwanda)
district_name - District (30 total in Rwanda)
sector_name - Administrative level between district and cell
cell_name - Specific cell name

Core Nightlight Measurements#

`total_nightlight`#

Sum of all radiance values within cell boundaries
Key indicator of overall economic activity/development
Higher values = more total development

`mean_nightlight`#

Average radiance per pixel
Indicates development intensity regardless of cell size
Useful for comparing cells of different areas

`median_nightlight`#

Middle radiance value of all pixels (less sensitive to outliers)
Better represents typical lighting in unevenly developed areas

`max_nightlight`#

Highest radiance within cell
Indicates major infrastructure (hospitals, commercial centers)

`min_nightlight` & `std_nightlight`#

Minimum radiance and standard deviation
High std = uneven development within cell

Spatial Coverage Indicators#

`pixel_count`#

Total pixels in cell (indicates geographic size)
Used to normalize other measurements

`lit_pixel_count`#

Number of pixels with detectable light (radiance > 0)
Shows spatial extent of development

`lit_pixel_percentage`#

Percentage of cell area with lighting
Formula: (lit_pixel_count ÷ pixel_count) × 100
0% = completely dark, 100% = fully developed

`year`#

Time period: 2015, 2020, or 2024

Part-2: Demographic and Nightlights Data#

Part A: Varible Generation and Data Integration#

Population Dataset Variables (`rwa-cell-pop.csv`):#

Create the following derived variables:

dependency_ratio - (children_under_five_2020 + elderly_60_plus_2020) / working_age_population * 100
people_per_building - general_2020 / building_count
working_age_population - general_2020 - children_under_five_2020 - elderly_60_plus_2020
infrastructure_index - Your own formula that incorporates people_per_building and other relevant variables to measure infrastructure adequacy. Document and justify your infrastructure_index methodology, explaining how people_per_building and other variables contribute to measuring infrastructure pressure.

Nightlight Dataset Variables (`cell-ntl-2015-2020-2024.csv`):#

Create the following temporal and development indicators:

nightlight_change_2015_2024 - Percentage change in total nightlight from 2015 to 2024
mean_nightlight_change_2015_2024 - Percentage change in mean nightlight from 2015 to 2024
lit_pixel_percentage - Use existing or calculate: (lit_pixel_count / pixel_count) * 100

Data Integration:#

Merge the datasets using the appropriate column.

Part B: Exploratory Data Analysis#

Correlation Analysis:#

Correlation Heatmap: Create a heatmap showing correlations between 10 key variables (mix of demographic, infrastructure, and nightlight variables).
Report the top 3 variable pairs with the highest correlations and interpret their relationships.
Identify unexpected correlations and discuss potential explanations.

Nightlight Trend Analysis:#

District Ranking: Report the top 5 districts with the highest nightlight growth (2015-2024) and bottom 5 districts with the most decline or lowest growth.
Lit Pixel Analysis: Compare these districts using lit_pixel_percentage changes to understand whether growth represents intensification or spatial expansion.
Create visualizations showing nightlight trends for these extreme districts.

Part C: Modeling#

Multivariate Linear Regression:#

Model Development: Build a multivariate linear regression model predicting population density using both demographic and nightlight variables as predictors. Explore as many variables as possible at the beginning.
Variable Selection: Test different combinations of variables and report the top 3 most predictive variables of population density.
Model Evaluation: Report R-squared, coefficients, and statistical significance. Interpret what these results tell us about population-infrastructure relationships.

Notes and Other Requirements#

Please follow the genral guidelines below when preparing your analysis..

Statistical Analysis:#

Properly handle missing data and outliers
Use appropriate statistical tests and report p-values
Calculate and interpret correlation coefficients
Validate regression assumptions (normality, homoscedasticity)

Data Management:#

Document all data cleaning and aggregation steps using markdown
Ensure consistent district naming across datasets

Visualization Standards:#

Create clear, publication-quality heatmaps with appropriate color scales
Design effective time series plots for nightlight trends
Include proper axis labels, titles, and legends
Use consistent formatting across all visualizations

Reporting Requirements:#

Clearly state the top 3 most predictive variables with statistical justification
Provide ranked lists for nightlight growth districts with supporting metrics
Include model performance statistics and interpretation
Document all methodological choices and assumptions

Programming Assignment-2

Contents

Programming Assignment-2#

Datasets#

Rwanda Health Indicators#

Nights lights Data#

Popupation Dataset#

Submission Guidelines#

Import Required Packages#

Setup Input Folders#

Part 1: Processing Excel Files#

Task 1: Generate National-Level Summaries#

Expected Output Structure#

Task 2: Subnational-Level Indicator Dataset#

Expected Output Structure#

Introduction to Nightlights Dataset#

What is Nightlight Data?#

Raw Data: Radiance Measurements#

Annual Composite Generation#

Why Use Nightlight Data?#

Dataset Overview#

Variable Definitions#

Administrative Identifiers#

Core Nightlight Measurements#

total_nightlight#

mean_nightlight#

median_nightlight#

max_nightlight#

min_nightlight & std_nightlight#

Spatial Coverage Indicators#

pixel_count#

lit_pixel_count#

lit_pixel_percentage#

year#

Part-2: Demographic and Nightlights Data#

Part A: Varible Generation and Data Integration#

Population Dataset Variables (rwa-cell-pop.csv):#

Nightlight Dataset Variables (cell-ntl-2015-2020-2024.csv):#

Data Integration:#

Part B: Exploratory Data Analysis#

Correlation Analysis:#

Nightlight Trend Analysis:#

Part C: Modeling#

Multivariate Linear Regression:#

Notes and Other Requirements#

Statistical Analysis:#

Data Management:#

Visualization Standards:#

Reporting Requirements:#

`total_nightlight`#

`mean_nightlight`#

`median_nightlight`#

`max_nightlight`#

`min_nightlight` & `std_nightlight`#

`pixel_count`#

`lit_pixel_count`#

`lit_pixel_percentage`#

`year`#

Population Dataset Variables (`rwa-cell-pop.csv`):#

Nightlight Dataset Variables (`cell-ntl-2015-2020-2024.csv`):#