Programming Assignment-2#

The goal of this assingment is to allow you to practice several the following things in Python:

  1. Perfoming typical data processing (or preprocessing if you prefer). This includes all the typical data wraning such as creating news variables, combining several datasets and more

  2. Running explolatory data analysis including basic plotting of variables

  3. Perfoming basic inferential statisticals using statsmodels and scipy to run hypythesis testing and build simple statistial or econometric models.

Datasets#

For this assignment, you will use the following datasets:

Rwanda Health Indicators#

The Excel file was generated by combining multiple CSV files, each containing data on different health indicators for Rwanda, So that each sheet in the file represent one such indicator. See below some of the input files which were used:

  • access-to-health-care_subnational_rwa

  • child-mortality-rates_subnational_rwa

  • dhs-mobile_subnational_rwa

You can download the dataset from here.

Nights lights Data#

  • Please download it here and check the documentation in the cells below.

Popupation Dataset#

  • Please download it here and check the documentation and metadata in the class notebooks.

Submission Guidelines#

  • Please guidelines and complete all steps in the GitHub Workflow

  • Once you have completed your assignment, push chanegs to your repository.

  • Send a link (copy from within GitHub) to your notebook to the tutors/teaching assistants

Import Required Packages#

from pathlib import Path
import pandas as pd

Setup Input Folders#

As usual, it is good practice to set up input folders using the pathlib package. In this section, make sure to define the folders where your data is stored on your machine.

I find it helpful to set up the working directory and input data folders right at the start of the notebook. To keep things organized, I use the naming convention: FILE_{NAME} for files and DIR_{NAME} for folders. We use capital letters because these are global variables that will be referenced throughout the notebook.

We’ll be using the pathlib library, which offers several advantages over traditional string-based path handling:

  • Cross-platform compatibility - automatically handles path separators (/ vs \) across different operating systems

  • Object-oriented approach - paths are objects with useful methods rather than strings

  • Intuitive syntax - use / operator to join paths naturally: parent_dir / "subfolder" / "file.txt"

  • Built-in path operations - methods like .exists(), .is_file(), .parent, .stem, and .suffix

  • Safer path manipulation - reduces errors from manual string concatenation and splitting

This is the recommended approach for managing file paths in modern Python development.

# Uncomment the following lines and add your code to define the directories and files
DIR_DATA = Path.cwd().parents[1].joinpath("data")
FILE_EXCEL = DIR_DATA/"RW-Health-Data.xlsx"

# Population by enumeration area (EA) for Malawi
# FILE_POP_MW = ADD YOUR CODE

Part 1: Processing Excel Files#

The primary goal is to preprocess an Excel file with multiple sheets into a unified CSV dataset that consolidates multiple indicators. Having all indicators in a single file at the same analytical unit (national, subnational) is more efficient than managing separate files and enables easier cross-indicator analysis.

Task 1: Generate National-Level Summaries#

For each indicator, compute a single national-level value using appropriate aggregation functions such as mean, sum or count. For this one, all available indicators can be summarized at national level, so we will have a CSV file with one row and

Expected Output Structure#

  1. DataFrame display in Jupyter Notebook

  2. CSV file with columns:

  • indicator_name: Name of the indicator

  • aggregated_value: Computed national value

  • indicator_year: Survey year or something similar

  • survey_name: Name of the survey where information is coming from

  • aggregation_method: Statistical method used (optional)

Task 2: Subnational-Level Indicator Dataset#

Create a merged dataset for indicators with subnational data (ADM2/ADM3 levels), ensuring spatial alignment and consistent administrative boundaries.

Expected Output Structure#

  • indicator_name: Name of the indicator

  • aggregated_value: Computed national value

  • indicator_year: Survey year or something similar

  • survey_name: Name of the survey where information is coming from

  • aggregation_method: Statistical method used (optional)

This structure enables both single-indicator and multi-indicator analysis at the subnational level.

Introduction to Nightlights Dataset#

What is Nightlight Data?#

Nightlight data is satellite imagery capturing artificial light emissions from Earth’s surface during nighttime. Satellites like VIIRS collect this data regularly, providing an objective, real-time measure of human economic activity and development.

Raw Data: Radiance Measurements#

The fundamental measurement in nightlight data is radiance - the amount of light energy detected by satellite sensors, measured in nanowatts per square centimeter per steradian (nW/cm²/sr). Each pixel in satellite imagery contains a radiance value representing the light intensity from that specific location on Earth’s surface.

Annual Composite Generation#

This dataset was created from annual composite images using VIIRS nightlight files for Rwanda. Annual composites are generated by:

  • Aggregating daily/monthly observations throughout each year (2015, 2020, 2024)

  • Filtering out temporary light sources (fires, lightning, aurora)

  • Removing cloud-affected observations to ensure clear measurements

  • Averaging or taking median values to create stable, representative annual measurements

  • Masking techniques to exclude areas with unreliable data

The files used include both average composites (average_masked) and median composites (median_masked), with cloud-free versions (vcmslcfg) preferred over cloud-inclusive versions (vcmcfg) for more accurate measurements.

Why Use Nightlight Data?#

  • Consistent global coverage - Available everywhere, regardless of local data quality

  • Real-time updates - More current than traditional economic statistics

  • Objective measurement - Not subject to reporting biases

  • High resolution - Captures local development patterns

  • Proxy for development - Light intensity correlates with economic activity, infrastructure, and quality of life

Dataset Overview#

  • 6,507 observations across Rwanda’s administrative cells

  • Three time periods: 2015, 2020, 2024

  • Cell-level data - Rwanda’s smallest administrative units

  • Allows temporal analysis of development trends


Variable Definitions#

Administrative Identifiers#

  • cell_id - Unique identifier for linking with other datasets

  • province_name - Province (5 total in Rwanda)

  • district_name - District (30 total in Rwanda)

  • sector_name - Administrative level between district and cell

  • cell_name - Specific cell name

Core Nightlight Measurements#

total_nightlight#

  • Sum of all radiance values within cell boundaries

  • Key indicator of overall economic activity/development

  • Higher values = more total development

mean_nightlight#

  • Average radiance per pixel

  • Indicates development intensity regardless of cell size

  • Useful for comparing cells of different areas

median_nightlight#

  • Middle radiance value of all pixels (less sensitive to outliers)

  • Better represents typical lighting in unevenly developed areas

max_nightlight#

  • Highest radiance within cell

  • Indicates major infrastructure (hospitals, commercial centers)

min_nightlight & std_nightlight#

  • Minimum radiance and standard deviation

  • High std = uneven development within cell

Spatial Coverage Indicators#

pixel_count#

  • Total pixels in cell (indicates geographic size)

  • Used to normalize other measurements

lit_pixel_count#

  • Number of pixels with detectable light (radiance > 0)

  • Shows spatial extent of development

lit_pixel_percentage#

  • Percentage of cell area with lighting

  • Formula: (lit_pixel_count ÷ pixel_count) × 100

  • 0% = completely dark, 100% = fully developed

year#

  • Time period: 2015, 2020, or 2024

Part-2: Demographic and Nightlights Data#

Part A: Varible Generation and Data Integration#

Population Dataset Variables (rwa-cell-pop.csv):#

Create the following derived variables:

  • dependency_ratio - (children_under_five_2020 + elderly_60_plus_2020) / working_age_population * 100

  • people_per_building - general_2020 / building_count

  • working_age_population - general_2020 - children_under_five_2020 - elderly_60_plus_2020

  • infrastructure_index - Your own formula that incorporates people_per_building and other relevant variables to measure infrastructure adequacy. Document and justify your infrastructure_index methodology, explaining how people_per_building and other variables contribute to measuring infrastructure pressure.

Nightlight Dataset Variables (cell-ntl-2015-2020-2024.csv):#

Create the following temporal and development indicators:

  • nightlight_change_2015_2024 - Percentage change in total nightlight from 2015 to 2024

  • mean_nightlight_change_2015_2024 - Percentage change in mean nightlight from 2015 to 2024

  • lit_pixel_percentage - Use existing or calculate: (lit_pixel_count / pixel_count) * 100

Data Integration:#

Merge the datasets using the appropriate column.

Part B: Exploratory Data Analysis#

Correlation Analysis:#

  1. Correlation Heatmap: Create a heatmap showing correlations between 10 key variables (mix of demographic, infrastructure, and nightlight variables).

  2. Report the top 3 variable pairs with the highest correlations and interpret their relationships.

  3. Identify unexpected correlations and discuss potential explanations.

Nightlight Trend Analysis:#

  1. District Ranking: Report the top 5 districts with the highest nightlight growth (2015-2024) and bottom 5 districts with the most decline or lowest growth.

  2. Lit Pixel Analysis: Compare these districts using lit_pixel_percentage changes to understand whether growth represents intensification or spatial expansion.

  3. Create visualizations showing nightlight trends for these extreme districts.

Part C: Modeling#

Multivariate Linear Regression:#

  1. Model Development: Build a multivariate linear regression model predicting population density using both demographic and nightlight variables as predictors. Explore as many variables as possible at the beginning.

  2. Variable Selection: Test different combinations of variables and report the top 3 most predictive variables of population density.

  3. Model Evaluation: Report R-squared, coefficients, and statistical significance. Interpret what these results tell us about population-infrastructure relationships.

Notes and Other Requirements#

Please follow the genral guidelines below when preparing your analysis..

Statistical Analysis:#

  • Properly handle missing data and outliers

  • Use appropriate statistical tests and report p-values

  • Calculate and interpret correlation coefficients

  • Validate regression assumptions (normality, homoscedasticity)

Data Management:#

  • Document all data cleaning and aggregation steps using markdown

  • Ensure consistent district naming across datasets

Visualization Standards:#

  • Create clear, publication-quality heatmaps with appropriate color scales

  • Design effective time series plots for nightlight trends

  • Include proper axis labels, titles, and legends

  • Use consistent formatting across all visualizations

Reporting Requirements:#

  • Clearly state the top 3 most predictive variables with statistical justification

  • Provide ranked lists for nightlight growth districts with supporting metrics

  • Include model performance statistics and interpretation

  • Document all methodological choices and assumptions