Programming Assignment-2#
The goal of this assingment is to allow you to practice several the following things in Python:
Perfoming typical data processing (or preprocessing if you prefer). This includes all the typical data wraning such as creating news variables, combining several datasets and more
Running explolatory data analysis including basic plotting of variables
Perfoming basic inferential statisticals using statsmodels and scipy to run hypythesis testing and build simple statistial or econometric models.
Datasets#
For this assignment, you will use the following datasets:
Rwanda Health Indicators#
The Excel file was generated by combining multiple CSV files, each containing data on different health indicators for Rwanda, So that each sheet in the file represent one such indicator. See below some of the input files which were used:
access-to-health-care_subnational_rwa
child-mortality-rates_subnational_rwa
dhs-mobile_subnational_rwa
You can download the dataset from here.
Nights lights Data#
Please download it here and check the documentation in the cells below.
Popupation Dataset#
Please download it here and check the documentation and metadata in the class notebooks.
Submission Guidelines#
Please guidelines and complete all steps in the GitHub Workflow
Once you have completed your assignment, push chanegs to your repository.
Send a link (copy from within GitHub) to your notebook to the tutors/teaching assistants
Import Required Packages#
from pathlib import Path
import pandas as pd
Setup Input Folders#
As usual, it is good practice to set up input folders using the pathlib
package. In this section, make sure to define the folders where your data is stored on your machine.
I find it helpful to set up the working directory and input data folders right at the start of the notebook. To keep things organized, I use the naming convention: FILE_{NAME}
for files and DIR_{NAME}
for folders. We use capital letters because these are global variables that will be referenced throughout the notebook.
We’ll be using the pathlib
library, which offers several advantages over traditional string-based path handling:
Cross-platform compatibility - automatically handles path separators (
/
vs\
) across different operating systemsObject-oriented approach - paths are objects with useful methods rather than strings
Intuitive syntax - use
/
operator to join paths naturally:parent_dir / "subfolder" / "file.txt"
Built-in path operations - methods like
.exists()
,.is_file()
,.parent
,.stem
, and.suffix
Safer path manipulation - reduces errors from manual string concatenation and splitting
This is the recommended approach for managing file paths in modern Python development.
# Uncomment the following lines and add your code to define the directories and files
DIR_DATA = Path.cwd().parents[1].joinpath("data")
FILE_EXCEL = DIR_DATA/"RW-Health-Data.xlsx"
# Population by enumeration area (EA) for Malawi
# FILE_POP_MW = ADD YOUR CODE
Part 1: Processing Excel Files#
The primary goal is to preprocess an Excel file with multiple sheets into a unified CSV dataset that consolidates multiple indicators. Having all indicators in a single file at the same analytical unit (national, subnational) is more efficient than managing separate files and enables easier cross-indicator analysis.
Task 1: Generate National-Level Summaries#
For each indicator, compute a single national-level value using appropriate aggregation functions such as mean, sum or count. For this one, all available indicators can be summarized at national level, so we will have a CSV file with one row and
Expected Output Structure#
DataFrame display in Jupyter Notebook
CSV file with columns:
indicator_name
: Name of the indicatoraggregated_value
: Computed national valueindicator_year
: Survey year or something similarsurvey_name
: Name of the survey where information is coming fromaggregation_method
: Statistical method used (optional)
Task 2: Subnational-Level Indicator Dataset#
Create a merged dataset for indicators with subnational data (ADM2/ADM3 levels), ensuring spatial alignment and consistent administrative boundaries.
Expected Output Structure#
indicator_name
: Name of the indicatoraggregated_value
: Computed national valueindicator_year
: Survey year or something similarsurvey_name
: Name of the survey where information is coming fromaggregation_method
: Statistical method used (optional)
This structure enables both single-indicator and multi-indicator analysis at the subnational level.
Introduction to Nightlights Dataset#
What is Nightlight Data?#
Nightlight data is satellite imagery capturing artificial light emissions from Earth’s surface during nighttime. Satellites like VIIRS collect this data regularly, providing an objective, real-time measure of human economic activity and development.
Raw Data: Radiance Measurements#
The fundamental measurement in nightlight data is radiance - the amount of light energy detected by satellite sensors, measured in nanowatts per square centimeter per steradian (nW/cm²/sr). Each pixel in satellite imagery contains a radiance value representing the light intensity from that specific location on Earth’s surface.
Annual Composite Generation#
This dataset was created from annual composite images using VIIRS nightlight files for Rwanda. Annual composites are generated by:
Aggregating daily/monthly observations throughout each year (2015, 2020, 2024)
Filtering out temporary light sources (fires, lightning, aurora)
Removing cloud-affected observations to ensure clear measurements
Averaging or taking median values to create stable, representative annual measurements
Masking techniques to exclude areas with unreliable data
The files used include both average composites (average_masked
) and median composites (median_masked
), with cloud-free versions (vcmslcfg
) preferred over cloud-inclusive versions (vcmcfg
) for more accurate measurements.
Why Use Nightlight Data?#
Consistent global coverage - Available everywhere, regardless of local data quality
Real-time updates - More current than traditional economic statistics
Objective measurement - Not subject to reporting biases
High resolution - Captures local development patterns
Proxy for development - Light intensity correlates with economic activity, infrastructure, and quality of life
Dataset Overview#
6,507 observations across Rwanda’s administrative cells
Three time periods: 2015, 2020, 2024
Cell-level data - Rwanda’s smallest administrative units
Allows temporal analysis of development trends
Variable Definitions#
Administrative Identifiers#
cell_id
- Unique identifier for linking with other datasetsprovince_name
- Province (5 total in Rwanda)district_name
- District (30 total in Rwanda)sector_name
- Administrative level between district and cellcell_name
- Specific cell name
Core Nightlight Measurements#
total_nightlight
#
Sum of all radiance values within cell boundaries
Key indicator of overall economic activity/development
Higher values = more total development
mean_nightlight
#
Average radiance per pixel
Indicates development intensity regardless of cell size
Useful for comparing cells of different areas
median_nightlight
#
Middle radiance value of all pixels (less sensitive to outliers)
Better represents typical lighting in unevenly developed areas
max_nightlight
#
Highest radiance within cell
Indicates major infrastructure (hospitals, commercial centers)
min_nightlight
& std_nightlight
#
Minimum radiance and standard deviation
High std = uneven development within cell
Spatial Coverage Indicators#
pixel_count
#
Total pixels in cell (indicates geographic size)
Used to normalize other measurements
lit_pixel_count
#
Number of pixels with detectable light (radiance > 0)
Shows spatial extent of development
lit_pixel_percentage
#
Percentage of cell area with lighting
Formula:
(lit_pixel_count ÷ pixel_count) × 100
0% = completely dark, 100% = fully developed
year
#
Time period: 2015, 2020, or 2024
Part-2: Demographic and Nightlights Data#
Part A: Varible Generation and Data Integration#
Population Dataset Variables (rwa-cell-pop.csv
):#
Create the following derived variables:
dependency_ratio
-(children_under_five_2020 + elderly_60_plus_2020) / working_age_population * 100
people_per_building
-general_2020 / building_count
working_age_population
-general_2020 - children_under_five_2020 - elderly_60_plus_2020
infrastructure_index
- Your own formula that incorporatespeople_per_building
and other relevant variables to measure infrastructure adequacy. Document and justify yourinfrastructure_index
methodology, explaining howpeople_per_building
and other variables contribute to measuring infrastructure pressure.
Nightlight Dataset Variables (cell-ntl-2015-2020-2024.csv
):#
Create the following temporal and development indicators:
nightlight_change_2015_2024
- Percentage change in total nightlight from 2015 to 2024mean_nightlight_change_2015_2024
- Percentage change in mean nightlight from 2015 to 2024lit_pixel_percentage
- Use existing or calculate:(lit_pixel_count / pixel_count) * 100
Data Integration:#
Merge the datasets using the appropriate column.
Part B: Exploratory Data Analysis#
Correlation Analysis:#
Correlation Heatmap: Create a heatmap showing correlations between 10 key variables (mix of demographic, infrastructure, and nightlight variables).
Report the top 3 variable pairs with the highest correlations and interpret their relationships.
Identify unexpected correlations and discuss potential explanations.
Nightlight Trend Analysis:#
District Ranking: Report the top 5 districts with the highest nightlight growth (2015-2024) and bottom 5 districts with the most decline or lowest growth.
Lit Pixel Analysis: Compare these districts using
lit_pixel_percentage
changes to understand whether growth represents intensification or spatial expansion.Create visualizations showing nightlight trends for these extreme districts.
Part C: Modeling#
Multivariate Linear Regression:#
Model Development: Build a multivariate linear regression model predicting population density using both demographic and nightlight variables as predictors. Explore as many variables as possible at the beginning.
Variable Selection: Test different combinations of variables and report the top 3 most predictive variables of population density.
Model Evaluation: Report R-squared, coefficients, and statistical significance. Interpret what these results tell us about population-infrastructure relationships.
Notes and Other Requirements#
Please follow the genral guidelines below when preparing your analysis..
Statistical Analysis:#
Properly handle missing data and outliers
Use appropriate statistical tests and report p-values
Calculate and interpret correlation coefficients
Validate regression assumptions (normality, homoscedasticity)
Data Management:#
Document all data cleaning and aggregation steps using markdown
Ensure consistent district naming across datasets
Visualization Standards:#
Create clear, publication-quality heatmaps with appropriate color scales
Design effective time series plots for nightlight trends
Include proper axis labels, titles, and legends
Use consistent formatting across all visualizations
Reporting Requirements:#
Clearly state the top 3 most predictive variables with statistical justification
Provide ranked lists for nightlight growth districts with supporting metrics
Include model performance statistics and interpretation
Document all methodological choices and assumptions