Programming Assignment-2#

The goal of this assingment is to allow you to practice several the following things in Python:

  1. Perfoming typical data processing (or preprocessing if you prefer). This includes all the typical data wraning such as creating news variables, combining several datasets and more

  2. Running explolatory data analysis including basic plotting of variables

  3. Perfoming basic inferential statisticals using statsmodels and scipy to run hypythesis testing and build simple statistial or econometric models.

Datasets#

For this assignment, you will use the following datasets:

Rwanda Health Indicators#

The Excel file was generated by combining multiple CSV files, each containing data on different health indicators for Rwanda, So that each sheet in the file represent one such indicator. See below some of the input files which were used:

  • access-to-health-care_subnational_rwa

  • child-mortality-rates_subnational_rwa

  • dhs-mobile_subnational_rwa

You can download the dataset from here.

Submission Guidelines#

  • Please

Import Required Packages#

from pathlib import Path
import pandas as pd

Setup Input Folders#

As usual, it is good practice to set up input folders using the pathlib package. In this section, make sure to define the folders where your data is stored on your machine.

I find it helpful to set up the working directory and input data folders right at the start of the notebook. To keep things organized, I use the naming convention: FILE_{NAME} for files and DIR_{NAME} for folders. We use capital letters because these are global variables that will be referenced throughout the notebook.

We’ll be using the pathlib library, which offers several advantages over traditional string-based path handling:

  • Cross-platform compatibility - automatically handles path separators (/ vs \) across different operating systems

  • Object-oriented approach - paths are objects with useful methods rather than strings

  • Intuitive syntax - use / operator to join paths naturally: parent_dir / "subfolder" / "file.txt"

  • Built-in path operations - methods like .exists(), .is_file(), .parent, .stem, and .suffix

  • Safer path manipulation - reduces errors from manual string concatenation and splitting

This is the recommended approach for managing file paths in modern Python development.

# Uncomment the following lines and add your code to define the directories and files
DIR_DATA = Path.cwd().parents[1].joinpath("data")
FILE_EXCEL = DIR_DATA/"RW-Health-Data.xlsx"

# Population by enumeration area (EA) for Malawi
# FILE_POP_MW = ADD YOUR CODE

Part 1: Processing Excel Files#

The primary goal is to preprocess an Excel file with multiple sheets into a unified CSV dataset that consolidates multiple indicators. Having all indicators in a single file at the same analytical unit (national, subnational) is more efficient than managing separate files and enables easier cross-indicator analysis.

Task 1: Generate National-Level Summaries#

For each indicator, compute a single national-level value using appropriate aggregation functions such as mean, sum or count. For this one, all available indicators can be summarized at national level, so we will have a CSV file with one row and

Expected Output Structure#

  1. DataFrame display in Jupyter Notebook

  2. CSV file with columns:

  • indicator_name: Name of the indicator

  • aggregated_value: Computed national value

  • indicator_year: Survey year or something similar

  • survey_name: Name of the survey where information is coming from

  • aggregation_method: Statistical method used (optional)

Task 2: Subnational-Level Indicator Dataset#

Create a merged dataset for indicators with subnational data (ADM2/ADM3 levels), ensuring spatial alignment and consistent administrative boundaries.

Expected Output Structure#

  • indicator_name: Name of the indicator

  • aggregated_value: Computed national value

  • indicator_year: Survey year or something similar

  • survey_name: Name of the survey where information is coming from

  • aggregation_method: Statistical method used (optional)

This structure enables both single-indicator and multi-indicator analysis at the subnational level.