Data Processing with Pandas#

What is Data Processing or Preprocessing?#

This notebook introduces essential data processing techniques in Python using pandas and other helpful libraries.

Data processing (or preprocessing) refers to the steps we take to transform raw data into a clean, structured format suitable for downstream tasks such as analysis, modeling, or visualization. Common preprocessing steps include:

  1. Ingesting raw data from various sources

  2. Cleaning data by handling missing values and removing irrelevant columns

  3. Performing transformations such as type conversions, variable creation, and reshaping

  4. Merging and combining datasets from different sources

In this notebook, we’ll go beyond the basics by addressing common challenges that arise at each of these stages.

Learning Outcomes#

By the end of this notebook, participants will be able to:

  • Load and explore datasets using pandas

  • Work with large files and manage memory efficiently

  • Clean inconsistent or messy data types

  • Combine datasets using joins and merges

Import Required Libraries#

from pathlib import Path
import pandas as pd

Datasets#

In this notebook, we will be working with the following datasets:

  1. hh_data_ml.csv. A sample of census dataset, containing household-level data.

  2. RW-Health-Data.xlsx. A compilation of health indicators for Rwanda, including national-level and subnational-level data.

  3. simulated_cdrs. Folder with multiple small files simulating call detail records (CDRs).

  4. rw-pop-density-gridded. A folder with population density data for Rwanda, including children under five and elderly populations.

Download all the datasets. if needed, unzip, keep all the data in the project data folder.

Data Inputs Setup#

In this section, make sure to define the folders where your data is stored on your machine.
I find it helpful to set up the working directory and input data folders right at the start of the notebook.
To keep things organized, I use the naming convention: FILE_{NAME} for files and DIR_{NAME} for folders.

We’ll be using the pathlib library—it’s the recommended approach for managing file paths in Python.

# ========================================================================
# ENSURE THAT YOU MODIFY THESE PATHS TO MATCH YOUR LOCAL SETUP
# =========================================================================
DIR_DATA = Path.cwd().parents[1] / "data"

# Large CSV file
FILE_HH_ML = DIR_DATA / "hh_data_ml.csv"
# Multiple CSV files 
DIR_MULTIPLE_CSV = DIR_DATA / "simulated_cdrs"

# Rwanda Health indicatos=rs Excel File
FILE_RWANDA_HEALTH_INDICATORS = DIR_DATA /"RW-Health-Data.xlsx"

# Rwanda population density files
DIR_RWANDA_POPULATION_DENSITY = DIR_DATA / "population/rw-pop-density-gridded"

You can check if file exists using the following code:

DIR_RWANDA_POPULATION_DENSITY.exists()

1. Data Ingestion and Preprocessing#

During data ingestion, there are several challenges that can come up such as loading a dataset which is very large due to number or rows or number of columns. Second, we can have datasets with mixed data types in one column which can create challenges. We will look at the following examples:

  1. Loading a large CSV file using chunking

  2. Loading multiple CSV files

  3. Loading a data file with specialised data format and many columns

Loading a Large CSV File Using Chunking#

When working with very large CSV files, loading the entire file into memory at once may not be feasible due to limited RAM. In such cases, chunking allows us to read the file in smaller, more manageable pieces using the chunksize parameter in pandas.read_csv().

Chunking is especially useful for:

  • Processing datasets that exceed available memory

  • Applying operations incrementally (e.g., filtering or aggregating)

  • Streaming and processing data in pipelines

Key Considerations When Choosing Chunk Size#

  • Memory Constraints: Adjust chunk size based on available memory. Larger chunks are faster but require more RAM.

  • I/O Performance: Test different sizes to balance read speed and overhead from frequent reads.

  • Task Complexity: Simpler operations (e.g., counting) can use larger chunks; complex transformations may require smaller chunks for efficiency.

  • Final Aggregation: Plan how to combine results from all chunks after processing.

# Initialize an empty list to store chunks
chunks = []
chunk_size = 10000  # Adjust chunk size based on your system's memory

# Read the CSV file in chunks
cnt = 0
for chunk in pd.read_csv(FILE_HH_ML, chunksize=chunk_size,sep='|'):
    # Process each chunk if needed
    chunks.append(chunk)
    cnt += 1
    if cnt == 100:
        print(f"Processed {cnt} chunks...")
        break
Processed 100 chunks...

EXERCISE-1:

  • Read pandas documentation on pd.concat, what inputs does it take, what args are available

  • Try different chunk sizes (e.g., 5000, 10000, 20000) and measure the time taken to read the file.

  • Use the time module to measure the time taken for each chunk size.

  • For this dataset, what other pandas strategy would you use to deal with large file size?

EXERCISE-2: Further preprocessing during chunking loop

  • Remove unnecessary columns. Identify at least one column to remove

  • Filter/subset rows. If value 1 is Rural and 2 is Urban. Using urban_rural column, keep only households.

Hints In pandas, you can subset data to select rows based on a conditions in multiple ways.

  1. Use df.query. df.query(“age > 30 and gender == ‘Male’”)

  2. Boolean Masking: df[df[‘age’] > 30]

Loading Multiple CSV Files from a Directory#

When dealing with multiple CSV files, we can use a combination of pathlib and pandas to:

  • List all CSV files in a directory

  • Read each file into a dataframe

  • Combine all dataframes efficiently

Some key considerations:

  • Parallel Processing: For large number of files, consider using parallel processing

  • Memory Management: Monitor memory usage when combining multiple files

  • File Structure: Ensure consistent column structure across files

  • Error Handling: Implement robust error handling for corrupt/invalid files

Process without Error handling#

# Create a list ofmfiles from folder:
files = list(DIR_MULTIPLE_CSV.glob("*.csv"))
files[:10]
# Process without Error handling 
df_list = []
for file in DIR_MULTIPLE_CSV.iterdir():
    print(f"Processing file: {file.parts[-1]}")
    df = pd.read_csv(file)
    df_list.append(df)
# Combine all DataFrames into a single DataFrame
df_combined = pd.concat(df_list, ignore_index=True)

Add Error Handling#

When dealing with multiple files, it’s essential to handle potential errors gracefully as many things can go wrong, such as:

  • File not found

  • Incorrect file format

  • Read errors (e.g., permission issues)

  • Data type mismatches and issues

### Lets Add Error Handling to Deal with Bad Files
df_list = []
for file in DIR_MULTIPLE_CSV.iterdir():
    print(f"Processing file: {file.parts[-1]}")
    try:
        df = pd.read_csv(file)
        df_list.append(df)
    except Exception as e:
        print(f"Error processing {file.parts[-1]}: {e}")
# Combine all DataFrames into a single DataFrame
df_combined = pd.concat(df_list, ignore_index=True)

EXERCISE-3:

  • Instead of pathlib use package glob and a list comprehension to load the CSV files into a pandas Dataframe

  • What if there other file types in the folder such as word file, PDF. Write code which can handle this problem?

2. Data Cleaning and Processing#

Data cleaning and transformation are crucial steps in preparing data for analysis. They involve identifying and correcting errors, inconsistencies, and formatting issues to ensure that the dataset is accurate, complete, and analysis-ready.

Typical data and processsing tasks#

  • Handling missing values

    • Remove, impute, or flag missing entries

  • Removing duplicates

    • Identify and drop duplicate rows

  • Correcting data types

    • Convert columns to appropriate types (e.g., strings to dates, objects to numeric)

  • Cleaning text data

    • Strip whitespace, fix capitalization, remove unwanted characters

  • Handling invalid entries

    • Detect and fix out-of-range values or categorical mismatches

  • Standardizing formats

    • Ensure consistency in units, naming conventions, or column formats

  • Creating new variables

    • Derive new columns based on existing data

  • Reshaping data

    • Use pivot, melt, or stack/unstack to restructure the dataset

  • Merging and joining datasets

    • Combine multiple data sources using merge, concat, or join

Task-1#

Generate a single population density dataset given multiple files. We have the following csv files with population density numbers for different age groups:

  • rwa_children_under_five_2020.csv

  • rwa_elderly_60_plus_2020.csv

  • rwa_general_2020.csv

  • rwa_men_2020.csv

  • rwa_women_2020.csv

  • rwa_youth_15_24_2020.csv

The objective is combine all these variables into a single spreadsheet with all variables in one table.

Inspect the Datasets to Check Columns#

df1 = pd.read_csv(DIR_RWANDA_POPULATION_DENSITY/"rwa_children_under_five_2020.csv")
df2 = pd.read_csv(DIR_RWANDA_POPULATION_DENSITY/"rwa_elderly_60_plus_2020.csv")
df3 = pd.read_csv(DIR_RWANDA_POPULATION_DENSITY/"rwa_general_2020.csv")

We can verify whether all files have the same number of rows and columns#

# Get list of all CSV files in the directory
population_files = [file for _ in DIR_RWANDA_POPULATION_DENSITY.iterdir() if file.suffix == '.csv']

# Create a dictionary to store dataframes and their info
dfs = {}
shapes = {}

# Load each CSV file and store shape information
for file in population_files:
    try:
        name = file.stem  # Get filename without extension
        df = pd.read_csv(file)
        dfs[name] = df
        shapes[name] = df.shape
        print(f"{name}: {df.shape}")
    except Exception as e:
        print(f"Error loading {file.name}: {e}")

# Check if all dataframes have same dimensions
first_shape = list(shapes.values())[0]
all_same = all(shape == first_shape for shape in shapes.values())
print(f"\nAll files have same dimensions: {all_same}")

# Show column names for each dataframe
print("\nColumn names in each file:")
for name, df in dfs.items():
    print(f"\n{name}:")
    print(df.columns.tolist())

Check number of unique observations in each dataset#

Processing Strategy#

Given what we have, we will try using columns longitude and latitude to merge all the files. This is okay for now but this a spatial dataset so there are better waya to do this which we will see on Friday.

We can do the following to try to merge all of the files.

  • Load each CSV file into pandad dataframe

  • Create a new column lat_lon to hold unique identifier

  • Check number of unique observations in each dataframe

  • merge dataframes using the created lat_lon column

  • Check that all rows merged

# Dictionary to store DataFrames
pop_dfs = {}

# Load each CSV file and create lat_lon column
for file in DIR_RWANDA_POPULATION_DENSITY.glob('*.csv'):
    # Get name without extension as key
    name = file.stem
    
    # Read CSV file
    df = pd.read_csv(file)
    
    # Create lat_lon column
    df['lat_lon'] = df['latitude'].astype(str) + '_' + df['longitude'].astype(str)
    
    # Store in dictionary
    pop_dfs[name] = df
    
    # Print unique counts
    print(f"{name}:")
    print(f"Total rows: {len(df)}")
    print(f"Unique lat_lon combinations: {df['lat_lon'].nunique()}\n")

# Start with the first dataframe
merged_df = pop_dfs[list(pop_dfs.keys())[0]]

# Merge all other dataframes
for name in list(pop_dfs.keys())[1:]:
    # Get value column name from the file name
    value_col = [col for col in pop_dfs[name].columns if col not in ['latitude', 'longitude', 'lat_lon']][0]
    
    # Merge on lat_lon
    merged_df = merged_df.merge(
        pop_dfs[name][['lat_lon', value_col]], 
        on='lat_lon', 
        how='outer'
    )

# Drop the lat_lon column used for merging
merged_df = merged_df.drop('lat_lon', axis=1)

# Display info about final merged dataset
print("\nFinal merged dataset info:")
print(merged_df.info())

EXERCISE-4:

  • Instead of a dictionary, lets use a list to store the DataFrames and then merge them sequentially.

  • Find out a way to check to make sure all rows merged.

3. Group or Individual Exercises in Data Processing with Pandas#

3.1 Processing Excel File#

About the Dataset#

The Excel file was generated by combining multiple CSV files, each containing data on different health indicators for Rwanda, such as:

  • access-to-health-care_subnational_rwa

  • child-mortality-rates_subnational_rwa

  • dhs-mobile_subnational_rwa

You can download the dataset from here.

Task 1: Generate National-Level Summaries#

For each indicator, your goal is to compute a single national-level value. Depending on the nature of the indicator, you may use aggregation functions such as mean, median, or sum.

The final output should be a dataframe printed in Jupyter Notebook as well as a saved CSV file with the following columns:

  • indicator_name: The name of the indicator

  • value: The aggregated national value. You may name this column based on your chosen aggregation method, e.g., mean_indicator_name or median_indicator_name.

Task 2: Subnational Level Indicator Dataset#

For indicators with subnational (administrative level 2 or 3) data available, lets merge them and a create a dataset with all those available indicators. The output dataset should have the following columns:

Summary#

In this notebook you explored how to ingest datasets and perfom basic preprocessing steps.

Pushing to GitHub#

Since we have a GitHub repository for this project. Lets make sure we push the changes we have made to our repo to GitHub.