Programming Assignment-2#
The goal of this assingment is to allow you to practice several the following things in Python:
Perfoming typical data processing (or preprocessing if you prefer). This includes all the typical data wraning such as creating news variables, combining several datasets and more
Running explolatory data analysis including basic plotting of variables
Perfoming basic inferential statisticals using statsmodels and scipy to run hypythesis testing and build simple statistial or econometric models.
Datasets#
For this assignment, you will use the following datasets:
Rwanda Health Indicators#
The Excel file was generated by combining multiple CSV files, each containing data on different health indicators for Rwanda, So that each sheet in the file represent one such indicator. See below some of the input files which were used:
access-to-health-care_subnational_rwa
child-mortality-rates_subnational_rwa
dhs-mobile_subnational_rwa
You can download the dataset from here.
Submission Guidelines#
Please
Import Required Packages#
from pathlib import Path
import pandas as pd
Setup Input Folders#
As usual, it is good practice to set up input folders using the pathlib
package. In this section, make sure to define the folders where your data is stored on your machine.
I find it helpful to set up the working directory and input data folders right at the start of the notebook. To keep things organized, I use the naming convention: FILE_{NAME}
for files and DIR_{NAME}
for folders. We use capital letters because these are global variables that will be referenced throughout the notebook.
We’ll be using the pathlib
library, which offers several advantages over traditional string-based path handling:
Cross-platform compatibility - automatically handles path separators (
/
vs\
) across different operating systemsObject-oriented approach - paths are objects with useful methods rather than strings
Intuitive syntax - use
/
operator to join paths naturally:parent_dir / "subfolder" / "file.txt"
Built-in path operations - methods like
.exists()
,.is_file()
,.parent
,.stem
, and.suffix
Safer path manipulation - reduces errors from manual string concatenation and splitting
This is the recommended approach for managing file paths in modern Python development.
# Uncomment the following lines and add your code to define the directories and files
DIR_DATA = Path.cwd().parents[1].joinpath("data")
FILE_EXCEL = DIR_DATA/"RW-Health-Data.xlsx"
# Population by enumeration area (EA) for Malawi
# FILE_POP_MW = ADD YOUR CODE
Part 1: Processing Excel Files#
The primary goal is to preprocess an Excel file with multiple sheets into a unified CSV dataset that consolidates multiple indicators. Having all indicators in a single file at the same analytical unit (national, subnational) is more efficient than managing separate files and enables easier cross-indicator analysis.
Task 1: Generate National-Level Summaries#
For each indicator, compute a single national-level value using appropriate aggregation functions such as mean, sum or count. For this one, all available indicators can be summarized at national level, so we will have a CSV file with one row and
Expected Output Structure#
DataFrame display in Jupyter Notebook
CSV file with columns:
indicator_name
: Name of the indicatoraggregated_value
: Computed national valueindicator_year
: Survey year or something similarsurvey_name
: Name of the survey where information is coming fromaggregation_method
: Statistical method used (optional)
Task 2: Subnational-Level Indicator Dataset#
Create a merged dataset for indicators with subnational data (ADM2/ADM3 levels), ensuring spatial alignment and consistent administrative boundaries.
Expected Output Structure#
indicator_name
: Name of the indicatoraggregated_value
: Computed national valueindicator_year
: Survey year or something similarsurvey_name
: Name of the survey where information is coming fromaggregation_method
: Statistical method used (optional)
This structure enables both single-indicator and multi-indicator analysis at the subnational level.