Data Processing with Pandas#
What is Data Processing or Preprocessing?#
This notebook introduces essential data processing techniques in Python using pandas
and other helpful libraries.
Data processing (or preprocessing) refers to the steps we take to transform raw data into a clean, structured format suitable for downstream tasks such as analysis, modeling, or visualization. Common preprocessing steps include:
Ingesting raw data from various sources
Cleaning data by handling missing values and removing irrelevant columns
Performing transformations such as type conversions, variable creation, and reshaping
Merging and combining datasets from different sources
In this notebook, we’ll go beyond the basics by addressing common challenges that arise at each of these stages.
Learning Outcomes#
By the end of this notebook, participants will be able to:
Load and explore datasets using pandas
Work with large files and manage memory efficiently
Clean inconsistent or messy data types
Combine datasets using joins and merges
Import Required Libraries#
from pathlib import Path
import pandas as pd
Datasets#
In this notebook, we will be working with the following datasets:
hh_data_ml.csv. A sample of census dataset, containing household-level data.
RW-Health-Data.xlsx. A compilation of health indicators for Rwanda, including national-level and subnational-level data.
simulated_cdrs. Folder with multiple small files simulating call detail records (CDRs).
rw-pop-density-gridded. A folder with population density data for Rwanda, including children under five and elderly populations.
Download all the datasets. if needed, unzip, keep all the data in the project data folder.
Data Inputs Setup#
In this section, make sure to define the folders where your data is stored on your machine.
I find it helpful to set up the working directory and input data folders right at the start of the notebook.
To keep things organized, I use the naming convention: FILE_{NAME}
for files and DIR_{NAME}
for folders.
We’ll be using the pathlib
library—it’s the recommended approach for managing file paths in Python.
# ========================================================================
# ENSURE THAT YOU MODIFY THESE PATHS TO MATCH YOUR LOCAL SETUP
# =========================================================================
DIR_DATA = Path.cwd().parents[1] / "data"
# Large CSV file
FILE_HH_ML = DIR_DATA / "hh_data_ml.csv"
# Multiple CSV files
DIR_MULTIPLE_CSV = DIR_DATA / "simulated_cdrs"
# Rwanda Health indicatos=rs Excel File
FILE_RWANDA_HEALTH_INDICATORS = DIR_DATA /"RW-Health-Data.xlsx"
# Rwanda population density files
DIR_RWANDA_POPULATION_DENSITY = DIR_DATA / "population/rw-pop-density-gridded"
You can check if file exists using the following code:
DIR_RWANDA_POPULATION_DENSITY.exists()
1. Data Ingestion and Preprocessing#
During data ingestion, there are several challenges that can come up such as loading a dataset which is very large due to number or rows or number of columns. Second, we can have datasets with mixed data types in one column which can create challenges. We will look at the following examples:
Loading a large CSV file using chunking
Loading multiple CSV files
Loading a data file with specialised data format and many columns
Loading a Large CSV File Using Chunking#
When working with very large CSV files, loading the entire file into memory at once may not be feasible due to limited RAM. In such cases, chunking allows us to read the file in smaller, more manageable pieces using the chunksize
parameter in pandas.read_csv()
.
Chunking is especially useful for:
Processing datasets that exceed available memory
Applying operations incrementally (e.g., filtering or aggregating)
Streaming and processing data in pipelines
Key Considerations When Choosing Chunk Size#
Memory Constraints: Adjust chunk size based on available memory. Larger chunks are faster but require more RAM.
I/O Performance: Test different sizes to balance read speed and overhead from frequent reads.
Task Complexity: Simpler operations (e.g., counting) can use larger chunks; complex transformations may require smaller chunks for efficiency.
Final Aggregation: Plan how to combine results from all chunks after processing.
# Initialize an empty list to store chunks
chunks = []
chunk_size = 10000 # Adjust chunk size based on your system's memory
# Read the CSV file in chunks
cnt = 0
for chunk in pd.read_csv(FILE_HH_ML, chunksize=chunk_size,sep='|'):
# Process each chunk if needed
chunks.append(chunk)
cnt += 1
if cnt == 100:
print(f"Processed {cnt} chunks...")
break
Processed 100 chunks...
EXERCISE-1:
Read pandas documentation on pd.concat, what inputs does it take, what args are available
Try different chunk sizes (e.g., 5000, 10000, 20000) and measure the time taken to read the file.
Use the
time
module to measure the time taken for each chunk size.For this dataset, what other
pandas
strategy would you use to deal with large file size?
EXERCISE-2: Further preprocessing during chunking loop
Remove unnecessary columns. Identify at least one column to remove
Filter/subset rows. If value 1 is Rural and 2 is Urban. Using
urban_rural
column, keep only households.
Hints In pandas, you can subset data to select rows based on a conditions in multiple ways.
Use
df.query
. df.query(“age > 30 and gender == ‘Male’”)Boolean Masking: df[df[‘age’] > 30]
Loading Multiple CSV Files from a Directory#
When dealing with multiple CSV files, we can use a combination of pathlib
and pandas
to:
List all CSV files in a directory
Read each file into a dataframe
Combine all dataframes efficiently
Some key considerations:
Parallel Processing: For large number of files, consider using parallel processing
Memory Management: Monitor memory usage when combining multiple files
File Structure: Ensure consistent column structure across files
Error Handling: Implement robust error handling for corrupt/invalid files
Process without Error handling#
# Create a list ofmfiles from folder:
files = list(DIR_MULTIPLE_CSV.glob("*.csv"))
files[:10]
# Process without Error handling
df_list = []
for file in DIR_MULTIPLE_CSV.iterdir():
print(f"Processing file: {file.parts[-1]}")
df = pd.read_csv(file)
df_list.append(df)
# Combine all DataFrames into a single DataFrame
df_combined = pd.concat(df_list, ignore_index=True)
Add Error Handling#
When dealing with multiple files, it’s essential to handle potential errors gracefully as many things can go wrong, such as:
File not found
Incorrect file format
Read errors (e.g., permission issues)
Data type mismatches and issues
### Lets Add Error Handling to Deal with Bad Files
df_list = []
for file in DIR_MULTIPLE_CSV.iterdir():
print(f"Processing file: {file.parts[-1]}")
try:
df = pd.read_csv(file)
df_list.append(df)
except Exception as e:
print(f"Error processing {file.parts[-1]}: {e}")
# Combine all DataFrames into a single DataFrame
df_combined = pd.concat(df_list, ignore_index=True)
EXERCISE-3:
Instead of
pathlib
use packageglob
and a list comprehension to load the CSV files into a pandas DataframeWhat if there other file types in the folder such as word file, PDF. Write code which can handle this problem?
2. Data Cleaning and Processing#
Data cleaning and transformation are crucial steps in preparing data for analysis. They involve identifying and correcting errors, inconsistencies, and formatting issues to ensure that the dataset is accurate, complete, and analysis-ready.
Typical data and processsing tasks#
Handling missing values
Remove, impute, or flag missing entries
Removing duplicates
Identify and drop duplicate rows
Correcting data types
Convert columns to appropriate types (e.g., strings to dates, objects to numeric)
Cleaning text data
Strip whitespace, fix capitalization, remove unwanted characters
Handling invalid entries
Detect and fix out-of-range values or categorical mismatches
Standardizing formats
Ensure consistency in units, naming conventions, or column formats
Creating new variables
Derive new columns based on existing data
Reshaping data
Use
pivot
,melt
, orstack/unstack
to restructure the dataset
Merging and joining datasets
Combine multiple data sources using
merge
,concat
, orjoin
Task-1#
Generate a single population density dataset given multiple files. We have the following csv files with population density numbers for different age groups:
rwa_children_under_five_2020.csv
rwa_elderly_60_plus_2020.csv
rwa_general_2020.csv
rwa_men_2020.csv
rwa_women_2020.csv
rwa_youth_15_24_2020.csv
The objective is combine all these variables into a single spreadsheet with all variables in one table.
Inspect the Datasets to Check Columns#
df1 = pd.read_csv(DIR_RWANDA_POPULATION_DENSITY/"rwa_children_under_five_2020.csv")
df2 = pd.read_csv(DIR_RWANDA_POPULATION_DENSITY/"rwa_elderly_60_plus_2020.csv")
df3 = pd.read_csv(DIR_RWANDA_POPULATION_DENSITY/"rwa_general_2020.csv")
We can verify whether all files have the same number of rows and columns#
# Get list of all CSV files in the directory
population_files = [file for _ in DIR_RWANDA_POPULATION_DENSITY.iterdir() if file.suffix == '.csv']
# Create a dictionary to store dataframes and their info
dfs = {}
shapes = {}
# Load each CSV file and store shape information
for file in population_files:
try:
name = file.stem # Get filename without extension
df = pd.read_csv(file)
dfs[name] = df
shapes[name] = df.shape
print(f"{name}: {df.shape}")
except Exception as e:
print(f"Error loading {file.name}: {e}")
# Check if all dataframes have same dimensions
first_shape = list(shapes.values())[0]
all_same = all(shape == first_shape for shape in shapes.values())
print(f"\nAll files have same dimensions: {all_same}")
# Show column names for each dataframe
print("\nColumn names in each file:")
for name, df in dfs.items():
print(f"\n{name}:")
print(df.columns.tolist())
Check number of unique observations in each dataset#
Processing Strategy#
Given what we have, we will try using columns longitude
and latitude
to merge all the files. This is okay for now but this a spatial dataset so there are better waya to do this which we will see on Friday.
We can do the following to try to merge all of the files.
Load each CSV file into pandad dataframe
Create a new column
lat_lon
to hold unique identifierCheck number of unique observations in each dataframe
merge dataframes using the created
lat_lon
columnCheck that all rows merged
# Dictionary to store DataFrames
pop_dfs = {}
# Load each CSV file and create lat_lon column
for file in DIR_RWANDA_POPULATION_DENSITY.glob('*.csv'):
# Get name without extension as key
name = file.stem
# Read CSV file
df = pd.read_csv(file)
# Create lat_lon column
df['lat_lon'] = df['latitude'].astype(str) + '_' + df['longitude'].astype(str)
# Store in dictionary
pop_dfs[name] = df
# Print unique counts
print(f"{name}:")
print(f"Total rows: {len(df)}")
print(f"Unique lat_lon combinations: {df['lat_lon'].nunique()}\n")
# Start with the first dataframe
merged_df = pop_dfs[list(pop_dfs.keys())[0]]
# Merge all other dataframes
for name in list(pop_dfs.keys())[1:]:
# Get value column name from the file name
value_col = [col for col in pop_dfs[name].columns if col not in ['latitude', 'longitude', 'lat_lon']][0]
# Merge on lat_lon
merged_df = merged_df.merge(
pop_dfs[name][['lat_lon', value_col]],
on='lat_lon',
how='outer'
)
# Drop the lat_lon column used for merging
merged_df = merged_df.drop('lat_lon', axis=1)
# Display info about final merged dataset
print("\nFinal merged dataset info:")
print(merged_df.info())
EXERCISE-4:
Instead of a dictionary, lets use a list to store the DataFrames and then merge them sequentially.
Find out a way to check to make sure all rows merged.
3. Group or Individual Exercises in Data Processing with Pandas#
3.1 Processing Excel File#
About the Dataset#
The Excel file was generated by combining multiple CSV files, each containing data on different health indicators for Rwanda, such as:
access-to-health-care_subnational_rwa
child-mortality-rates_subnational_rwa
dhs-mobile_subnational_rwa
You can download the dataset from here.
Task 1: Generate National-Level Summaries#
For each indicator, your goal is to compute a single national-level value. Depending on the nature of the indicator, you may use aggregation functions such as mean, median, or sum.
The final output should be a dataframe printed in Jupyter Notebook as well as a saved CSV file with the following columns:
indicator_name
: The name of the indicatorvalue
: The aggregated national value. You may name this column based on your chosen aggregation method, e.g.,mean_indicator_name
ormedian_indicator_name
.
Task 2: Subnational Level Indicator Dataset#
For indicators with subnational (administrative level 2 or 3) data available, lets merge them and a create a dataset with all those available indicators. The output dataset should have the following columns:
Summary#
In this notebook you explored how to ingest datasets and perfom basic preprocessing steps.
Pushing to GitHub#
Since we have a GitHub repository for this project. Lets make sure we push the changes we have made to our repo to GitHub.