Introduction to Data Processing with Pandas#
Pandas is one of the most powerful and widely-used Python libraries for working with structured data. It provides fast, flexible, and intuitive tools to load, explore, clean, transform, and analyze data — especially tabular data like spreadsheets or CSV files. Whether you’re handling messy survey results, merging data from multiple sources, or preparing data for machine learning, pandas offers a robust foundation for data wrangling and analysis in Python. Its syntax is readable and expressive, making it ideal for both beginners and professionals.
This notebook introduces the fundamentals of working with tabular data using the pandas library.
Learning Outcomes#
By the end of this tutorial, participants will be able to:
Understanda pandas core data structures#
Dataframes and Series
Create Dataframes and Series from other Python data structures (e.g., lists)
Load and Explore Data#
Load and Explore Data from different sources
Understand data structure using
.head(), .info(), .describe()
Preprocess Data#
Identify and handle missing values using isna(), fillna(), dropna()
Convert data types using
.astype()
andto_datetime()
Rename and reorder columns
Transform and Manipulate Data#
Create new columns and apply functions with .apply() and vectorized operations
Use conditional logic for filtering and assignment
Subset and Filter Data#
Select columns and rows using
[], .loc[], .iloc[]
Use .query() and boolean indexing to filter rows
Group and Aggregate Data#
Group data using
.groupby()
and apply aggregation functionsCalculate summaries like mean, median, and counts by groups
Merge and Join Datasets#
Combine datasets using
merge(), concat(), and join()
Understand inner, outer, left, and right joins
import pandas as pd
from pathlib import Path
Input Folders Setup#
In order to make learning fast, we will mostly use synthetic data for this notebook except for a few cases.
# ===========================================================
# MAKE SURE YOU CHANGE THE PATHS BELOW TO YOUR LOCAL SETUP
# ==========================================================
DIR_DATA = Path.cwd().parents[1] / "data"
FIlE_EMPLOYEES = DIR_DATA / "synthetic-data-employees.csv"
FILE_CUSTOMERS = DIR_DATA / "synthetic-messy-customer-data.csv"
FILE_SALES = DIR_DATA / "synthetic-sales-data.csv"
# Multiple CSV files
DIR_MULTIPLE_CSV = DIR_DATA / "simulated_cdrs"
# Population Density Datasets
DIR_POPULATION_DENSITY = DIR_DATA / "population/rw-pop-density-gridded"
Getting Help with Pandas Functions#
Understanding how to use functions and methods in Python—especially in pandas—requires consulting the documentation. It’s essential to read the documentation to learn what each function does and what its arguments mean.
In pandas, there are several convenient ways to access help:
help(pd.DataFrame)
– Use the built-inhelp()
function to view documentation for any pandas class or function.pd.read_csv?
or?pd.read_csv
– In Jupyter Notebook or IPython, placing a?
before or after a function displays its signature and docstring.df.apply?
– Ifdf
is your DataFrame, you can inspect theapply
method using the same approach.
These tools are invaluable when exploring new methods or troubleshooting unexpected behavior.
For more detailed and up-to-date information, refer to the official pandas API documentation. It provides comprehensive explanations, examples, and parameter details for every function in the library.
EXERCISE-0: Getting Help on Pandas Functions Use the following tasks to familiarize yourself with how to explore pandas functions and understand their parameters:
Use
help(pd.read_csv)
to read the documentation for loading CSV files.Use
df.head?
to learn how thehead()
method works.Try
help(pd.merge)
and identify what the argumentshow
andon
do.Use
df.groupby?
to check how to group your data and what options are available.
Pandas Core Data Structures#
Series (1-dimensional)#
DataFrame (2-dimensional)#
We’ll focus primarily on the DataFrame data structure. However, you’re encouraged to explore how to convert between DataFrames and Series for greater flexibility.
Creating DataFrames from Python Data Structures#
# Create pandas DataFrames from a dictionary
# Step-1 Create a dictionary with sample data
data = {
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"city": ["New York", "Los Angeles", "Chicago"]}
Dataframe from a List#
list_data = [["Alice", 25, "New York"],
["Bob", 30, "Los Angeles"],
["Charlie", 35, "Chicago"] ]
df_from_list = pd.DataFrame(list_data, columns=["name", "age", "city"])
# Method 1: From dictionary
students_dict = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [23, 22, 24, 23],
'Grade': ['A', 'B', 'A', 'C'],
'Score': [92, 85, 94, 78]
}
students_df = pd.DataFrame(students_dict)
# Method 2: From lists
data_lists = [
['Alice', 23, 'A', 92],
['Bob', 22, 'B', 85],
['Charlie', 24, 'A', 94],
['Diana', 23, 'C', 78]
]
students_df2 = pd.DataFrame(data_lists, columns=['Name', 'Age', 'Grade', 'Score'])
print("From dictionary:")
print(students_df)
print("\nFrom lists:")
print(students_df2)
EXERCISE-1: GETTING BACK A DICTIONARY FROM A DATAFRAME
Make sure you have a Dataframe
Use the
to_dict()
method to convert the DataFrame back into a dictionary.Use help to understand the parameters of
to_dict()
and discuss the differences between the different formats it can return.
df_from_list.to_dict(orient='list')
{'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'city': ['New York', 'Los Angeles', 'Chicago']}
Basic DataFrame Inspection#
It’s important to get a high-level overview of your DataFrame—especially when it’s loaded from an external source. This first look helps you understand the structure, size, and contents of the dataset before diving into detailed analysis.
df_from_dict.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 3 non-null object
1 age 3 non-null int64
2 city 3 non-null object
dtypes: int64(1), object(2)
memory usage: 204.0+ bytes
EXERCISE-2: Try the following:
Use
df.shape
to view the dimensions of the DataFrame.Use
df.head()
to display the first few rows of the DataFrame.Use
df.columns
to get the names of the columns.Use
df.info()
to get a concise summary of the DataFrame, including data types and non-null counts.Use
df.describe()
to get a statistical summary of the numerical columns in the DataFrame.
Understanding Indexing and Columns#
In a DataFrame, columns represent the variables (or features) in your dataset, while the index represents the row labels. The index is used to uniquely identify each row and can be either the default integer labels or a custom label like a name or ID.
We wil see later that index
can be use in merging sometimes.
df_from_list.index = df_from_list['name']
students_dict = {
'Name': ['Alice', 'Alice', 'Charlie', 'Diana'],
'Age': [23, 22, 24, 23],
'Grade': ['A', 'B', 'A', 'C'],
'Score': [92, 85, 94, 78]
}
df_students = pd.DataFrame(students_dict)
df_students
Name | Age | Grade | Score | |
---|---|---|---|---|
0 | Alice | 23 | A | 92 |
1 | Alice | 22 | B | 85 |
2 | Charlie | 24 | A | 94 |
3 | Diana | 23 | C | 78 |
df_students2 = df_students
df_students2.index = df_students['Name']
EXERCISE-3: Explore Index
Check the current index using
Loading Data from External Sources#
Pandas makes it easy to load data from a wide range of external sources into a common tabular format (DataFrame) for analysis. This includes traditional statistical software formats like SPSS (.sav
), Stata (.dta
), Excel (.xlsx
), and more. Once loaded, all data can be handled consistently using pandas tools, regardless of the original format.
#df_employees = pd.read_csv(FIlE_EMPLOYEES)
df_employees = pd.read_csv("/Users/dmatekenya/My Drive (dmatekenya@gmail.com)/TEACHING/AIMS-DSCBI/data/synthetic-data-employees.csv")
Basic Column Selection#
# Selecting single columns
names = df_employees['name']
print("Single column (Series):")
print(type(names))
print(names.head())
# Selecting multiple columns
basic_info = df_employees[['name', 'department', 'salary']]
print("\nMultiple columns (DataFrame):")
print(type(basic_info))
print(basic_info.head())
# Different ways to select columns
print("\nDifferent selection methods:")
print("Dot notation:", type(df_employees.name)) # Works for valid Python names
print("Bracket notation:", type(df_employees['name']))
df_tmp = df_employees['name']
df_tmp.head()
0 John Smith
1 Jane Doe
2 Mike Johnson
3 Sarah Wilson
4 David Brown
Name: name, dtype: object
Basic Row Selection#
# Row selection using indexing
# df_employees[:10]
# Row selection using loc
df_employees.loc[:9] # Selects rows 0 to 9
employee_id | name | department | salary | years_experience | hire_date | |
---|---|---|---|---|---|---|
0 | 101 | John Smith | IT | 75000 | 5 | 2018-01-01 |
1 | 102 | Jane Doe | HR | 65000 | 3 | 2018-04-01 |
2 | 103 | Mike Johnson | Finance | 70000 | 7 | 2018-06-30 |
3 | 104 | Sarah Wilson | IT | 80000 | 6 | 2018-09-28 |
4 | 105 | David Brown | Marketing | 60000 | 2 | 2018-12-27 |
5 | 106 | Lisa Davis | Finance | 72000 | 8 | 2019-03-27 |
6 | 107 | Tom Miller | IT | 78000 | 4 | 2019-06-25 |
7 | 108 | Amy Garcia | HR | 63000 | 3 | 2019-09-23 |
8 | 109 | Chris Lee | Marketing | 58000 | 1 | 2019-12-22 |
9 | 110 | Maria Rodriguez | Finance | 74000 | 9 | 2020-03-21 |
.loc
works with labels, not integer positions#
df_employees.loc[5, ['employee_id', 'name']]
employee_id 106
name Lisa Davis
Name: 5, dtype: object
df_employees['employee_id'] = df_employees['employee_id'].astype(str)
df_employees2 = df_employees.set_index('employee_id')
df_employees.head()
employee_id | name | department | salary | years_experience | hire_date | |
---|---|---|---|---|---|---|
0 | 101 | John Smith | IT | 75000 | 5 | 2018-01-01 |
1 | 102 | Jane Doe | HR | 65000 | 3 | 2018-04-01 |
2 | 103 | Mike Johnson | Finance | 70000 | 7 | 2018-06-30 |
3 | 104 | Sarah Wilson | IT | 80000 | 6 | 2018-09-28 |
4 | 105 | David Brown | Marketing | 60000 | 2 | 2018-12-27 |
df_employees.iloc[1, [1, 2]].head() # Example with a specific index label
name Jane Doe
department HR
Name: 1, dtype: object
df_employees2 = df_employees.set_index('employee_id')
df_employees2.head()
name | department | salary | years_experience | hire_date | country | |
---|---|---|---|---|---|---|
employee_id | ||||||
101 | John Smith | IT | 75000 | 5 | 2018-01-01 | Rwanda |
102 | Jane Doe | HR | 65000 | 3 | 2018-04-01 | Rwanda |
103 | Mike Johnson | Finance | 70000 | 7 | 2018-06-30 | Malawi |
104 | Sarah Wilson | IT | 80000 | 6 | 2018-09-28 | Rwanda |
105 | David Brown | Marketing | 60000 | 2 | 2018-12-27 | Rwanda |
# Using .loc[] - label-based selection
print("Using .loc[] - first 3 rows:")
print(df_employees.loc[0:2]) # Includes end index
print("\nUsing .loc[] - specific rows and columns:")
print(df_employees.loc[0:2, ['name', 'salary']])
# Using .iloc[] - position-based selection
print("\nUsing .iloc[] - first 3 rows:")
print(df_employees.iloc[0:3]) # Excludes end index
print("\nUsing .iloc[] - specific positions:")
print(df_employees.iloc[0:3, [1, 3]]) # First 3 rows, columns 1 and 3
EXERCISE-ROW SELECTION: TRY THE FOLLOWING
SET INDEX OF DATAFRAME TO
name
columnUSE
.iloc
TO MAKE THIS SELECTION:EMPLOYEE NAME: ‘John Smith’
salary
anddepartment
TRY THE SAME THING USING
.loc
WHATS HAPPENING
df_employees2 = df_employees.set_index('name')
df_employees2.loc['John Smith', ['employee_id', 'department']]
employee_id 101
department IT
Name: John Smith, dtype: object
Data Selection and Filtering#
The goal here is to be able to do the following:
Create and apply boolean masks for data filtering
Use comparison operators to filter data based on conditions
Combine multiple conditions using logical operators
Apply string methods for text-based filtering
Filter DataFrames using complex conditional logic
Select specific subsets of data based on multiple criteria
Boolean Indexing Basics#
Boolean filtering in One Liner#
df_high_salary_employees = df_employees[df_employees['salary'] > 70000]
df_high_salary_employees.head()
employee_id | name | department | salary | years_experience | hire_date | |
---|---|---|---|---|---|---|
0 | 101 | John Smith | IT | 75000 | 5 | 2018-01-01 |
3 | 104 | Sarah Wilson | IT | 80000 | 6 | 2018-09-28 |
5 | 106 | Lisa Davis | Finance | 72000 | 8 | 2019-03-27 |
6 | 107 | Tom Miller | IT | 78000 | 4 | 2019-06-25 |
9 | 110 | Maria Rodriguez | Finance | 74000 | 9 | 2020-03-21 |
Boolean filtering in Two Steps#
high_salary_mask = df_employees['salary'] > 70000
high_salary_mask
# APPLY THE MASK TO THE DATAFRAME
df_high_salary_employees = df_employees[high_salary_mask]
df_high_salary_employees.shape
(10, 6)
# Creating boolean masks
high_salary_mask = df_employees['salary'] > 70000
print("Boolean mask for high salary:")
print(high_salary_mask.head(10))
print(f"Mask type: {type(high_salary_mask)}")
# Applying the mask to filter data
high_salary_employees = df_employees[high_salary_mask]
print(f"\nEmployees with salary > $70,000:")
print(high_salary_employees[['name', 'salary', 'department']])
print(f"Count: {len(high_salary_employees)} out of {len(df_employees)}")
Comparison Operators#
SELECT ALL EMPLOYEES WORKING IN IT DEPARTMENT#
df_IT = df_employees[df_employees['department'] == 'IT']
df_IT.department.value_counts()
department
IT 6
Name: count, dtype: int64
# Different comparison operators
print("IT Department employees:")
it_employees = df_employees[df_employees['department'] == 'IT']
print(it_employees[['name', 'department', 'salary']])
print("\nExperienced employees (>= 5 years):")
experienced = df_employees[df_employees['years_experience'] >= 5]
print(experienced[['name', 'years_experience', 'salary']])
print("\nNot in HR department:")
non_hr = df_employees[df_employees['department'] != 'HR']
print(f"Non-HR employees: {len(non_hr)} out of {len(df_employees)}")
# Using .isin() for multiple values
target_departments = ['IT', 'Finance']
it_finance = df_employees[df_employees['department'].isin(target_departments)]
print(f"\nIT and Finance employees: {len(it_finance)}")
print(it_finance[['name', 'department', 'salary']])
Filtering with the .query
Method#
df_employees = pd.read_csv(FIlE_EMPLOYEES)
FILTER FOR HIGH SALARY EMPLOYEES#
METHOD-1: BOOLEAN MASK#
df_high_salary_employees = df_employees[df_employees['salary'] > 70000]
METHOD-2: .query
#
Single Condition#
df_high_salary_employees = df_employees.query('salary > 70000 ')
df_high_salary_employees = df_employees.query('salary > 70000 and department == "IT"')
# Using query() method for readable conditions
high_performers = df_employees.query('salary > 70000 and years_experience >= 5')
print("High performers using query():")
print(high_performers[['name', 'salary', 'years_experience']])
EXERCISE-4: EXPLORE FILTERING WITH MULTIPLE CONDITIONS
Boolean Mask. create a new dataframe for employees who work in HR and receive salary higher than 50,000
.query
. Do the above using.query
methodReport how many employees are in this new Dataframe
Challenge: Add ‘years_experience’ condition such as less than 5 years
# More Filtering Methods
pd.DataFrame.filter?
pd.DataFrame.isin?
department = ['IT', 'HR']
df_employees[df_employees['department'].isin(department)]
employee_id | name | department | salary | years_experience | hire_date | |
---|---|---|---|---|---|---|
0 | 101 | John Smith | IT | 75000 | 5 | 2018-01-01 |
1 | 102 | Jane Doe | HR | 65000 | 3 | 2018-04-01 |
3 | 104 | Sarah Wilson | IT | 80000 | 6 | 2018-09-28 |
6 | 107 | Tom Miller | IT | 78000 | 4 | 2019-06-25 |
7 | 108 | Amy Garcia | HR | 63000 | 3 | 2019-09-23 |
10 | 111 | James Taylor | IT | 82000 | 7 | 2020-06-19 |
11 | 112 | Linda Martinez | HR | 64000 | 4 | 2020-09-17 |
14 | 115 | Michael Jackson | IT | 79000 | 5 | 2021-06-14 |
15 | 116 | Jennifer White | HR | 66000 | 3 | 2021-09-12 |
18 | 119 | Richard Lewis | IT | 81000 | 6 | 2022-06-09 |
19 | 120 | Susan Robinson | HR | 67000 | 4 | 2022-09-07 |
Data Cleaning Basics with Pandas#
The goal is to be able to do some of the following:
Identify and handle missing data using various strategies
Detect and remove duplicate records from DataFrames
Convert data types appropriately for analysis
Rename columns for better readability and consistency
df_messy = pd.read_csv("/Users/dmatekenya/My Drive (dmatekenya@gmail.com)/TEACHING/AIMS-DSCBI/data/synthetic-messy-customer-data.csv")
Identifying and Handling Missing Data#
df_messy.shape
(10, 7)
df_messy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Customer_ID 10 non-null int64
1 customer name 8 non-null object
2 age 9 non-null float64
3 email 9 non-null object
4 purchase_amount 10 non-null object
5 signup_date 9 non-null object
6 status 10 non-null object
dtypes: float64(1), int64(1), object(5)
memory usage: 692.0+ bytes
GET TOTAL NUM OF MISSING VALEUS IN A COLUMN#
df_messy['customer name'].isnull().sum()
2
new_columns = {'customer name': 'customer_name'}
df_messy.rename(columns=new_columns, inplace=True)
OTHER METHODS DEALING WITH MISSING VALIUES:#
DataFrame.isnull : Alias of isna.
DataFrame.notna : Boolean inverse of isna.
DataFrame.dropna : Omit axes labels with missing values.
DROPPING NULL VALUES#
df_messy2 = df_messy.dropna(subset=['customer_name'])
DROP ROWS IF THERE IS NULL IN ANY COLUMN#
df_messy_not_null.info()
<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, 0 to 9
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Customer_ID 6 non-null int64
1 customer_name 6 non-null object
2 age 6 non-null float64
3 email 6 non-null object
4 purchase_amount 6 non-null object
5 signup_date 6 non-null object
6 status 6 non-null object
dtypes: float64(1), int64(1), object(5)
memory usage: 384.0+ bytes
df_messy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Customer_ID 10 non-null int64
1 customer_name 8 non-null object
2 age 9 non-null float64
3 email 9 non-null object
4 purchase_amount 10 non-null object
5 signup_date 9 non-null object
6 status 10 non-null object
dtypes: float64(1), int64(1), object(5)
memory usage: 692.0+ bytes
df_messy.dtypes
Customer_ID int64
customer_name object
age float64
email object
purchase_amount object
signup_date object
status object
dtype: object
df_messy.customer_name.fillna('Unknown', inplace=True)
print('Original missing values:')
print(df_messy.customer_name.isnull().sum(), df_messy.shape[0])
print('Original non-missing values:')
print(df_messy2.customer_name.isnull().sum(), df_messy2.shape[0])
Original missing values:
2 10
Original non-missing values:
0 8
# Check for missing values
print("Missing values per column:")
print(df_messy.isnull().sum())
print("\nPercentage of missing values:")
missing_percentages = (df_messy.isnull().sum() / len(df_messy) * 100).round(2)
print(missing_percentages)
# Visualize missing data pattern
print("\nMissing data pattern:")
print(df_messy.isnull())
Data Type Conversion#
df_messy['income'] = "90000"
df_messy["income_num"] = df_messy['income'].astype(float)
CONVERT STRING TIME TO PYTHON TIME OBJECT#
# Convert signup_date to datetime
df_messy['signup_date_ts'] = pd.to_datetime(df_messy['signup_date'], errors='coerce')
print('Modified Data Type:', type(df_messy['signup_date_ts'].iloc[0]))
print("Original Data Type:", type(df_messy['signup_date'].iloc[0]))
Modified Data Type: <class 'pandas._libs.tslibs.timestamps.Timestamp'>
Original Data Type: <class 'str'>
# Check current data types
print("Current data types:")
print(df_messy.dtypes)
# Convert data types
df_typed = df_messy.copy()
# Convert purchase_amount to numeric (handling invalid values)
df_typed['purchase_amount'] = pd.to_numeric(df_typed['purchase_amount'], errors='coerce')
# Convert signup_date to datetime
df_typed['signup_date'] = pd.to_datetime(df_typed['signup_date'], errors='coerce')
# Convert Customer_ID to integer (after handling missing values)
df_typed['Customer_ID'] = df_typed['Customer_ID'].astype('int64')
# Convert status to category for memory efficiency
df_typed['status'] = df_typed['status'].astype('category')
print("\nAfter type conversion:")
print(df_typed.dtypes)
# Show the cleaned numeric column
print("\nCleaned purchase amounts:")
print(df_typed[['Customer_ID', 'purchase_amount']].head(10))
EXERCISES-5: Detecting Duplicates and Renaming Columns
Explore pndas duplicated method to identify duplicate rows in a DataFrame.
Use the
drop_duplicates()
method to remove duplicates.Next rename columns to something you like
Data Transformation#
Creating New Columns with Calculations#
When creating new columns based on existing data, you’re often performing operations row by row. The .apply()
function in pandas is a powerful and efficient way to do this. It allows you to apply a custom function across rows or columns, making it ideal for generating new variables based on specific logic or transformations.
While it’s possible to loop through DataFrame rows manually using .iterrows()
, this approach is generally slower and less efficient. The recommended method is to use .apply()
in one of the following ways:
Lambda functions – For simple, one-line conditions or calculations, use an in-line lambda function with
.apply()
.Custom functions – For more complex logic or multi-step conditions, define a separate function and pass it to
.apply()
.
This method not only makes your code cleaner and more readable but also improves performance in most cases.
df_sales = pd.read_csv(FILE_SALES)
ADD COLUMN WITH SIMPLE CALCULATION#
df_sales['total'] = df_sales['unit_price'] * df_sales['quantity']
ADDING A COLUMN WITH COMPLICATED LOGIC#
total = X + 2*unit_price, where X=10
total = X + 2*unit_price + unit_price*quantity
def add_complicated_total(row):
output = 10 + 2 * row['unit_price'] + row['unit_price'] * row['quantity']
return output
df_sales['total_complicated'] = df_sales.apply(add_complicated_total, axis=1)
df_sales['total_comp1'] = df_sales.apply(lambda row: 10+ 2*row['unit_price'], axis=1)
df_sales[['total_comp1', 'unit_price', 'quantity']].head()
total_comp1 | unit_price | quantity | |
---|---|---|---|
0 | 821.62 | 405.81 | 7 |
1 | 1916.36 | 953.18 | 1 |
2 | 1500.78 | 745.39 | 4 |
3 | 1247.46 | 618.73 | 4 |
4 | 406.44 | 198.22 | 5 |
df_sales[['total', 'unit_price', 'quantity']].head()
total | unit_price | quantity | |
---|---|---|---|
0 | 2840.67 | 405.81 | 7 |
1 | 953.18 | 953.18 | 1 |
2 | 2981.56 | 745.39 | 4 |
3 | 2474.92 | 618.73 | 4 |
4 | 991.10 | 198.22 | 5 |
# Basic mathematical operations
df_sales['total_amount'] = df_sales['unit_price'] * df_sales['quantity']
df_sales['discount_5_percent'] = df_sales['total_amount'] * 0.05
df_sales['final_amount'] = df_sales['total_amount'] - df_sales['discount_5_percent']
print("New calculated columns:")
print(df_sales[['unit_price', 'quantity', 'total_amount', 'final_amount']].head())
# Conditional calculations
df_sales['order_size'] = df_sales['quantity'].apply(
lambda x: 'Large' if x >= 7 else 'Medium' if x >= 4 else 'Small'
)
# Using np.where for conditional logic
df_sales['price_category'] = np.where(
df_sales['unit_price'] >= 500, 'Premium',
np.where(df_sales['unit_price'] >= 200, 'Standard', 'Budget')
)
print("\nConditional columns:")
print(df_sales[['quantity', 'order_size', 'unit_price', 'price_category']].head(10))
# Multiple conditions with np.select
conditions = [
(df_sales['total_amount'] >= 1000) & (df_sales['quantity'] >= 5),
(df_sales['total_amount'] >= 500) & (df_sales['quantity'] >= 3),
df_sales['total_amount'] >= 200
]
choices = ['VIP Order', 'Standard Order', 'Regular Order']
df_sales['order_type'] = np.select(conditions, choices, default='Small Order')
print("\nOrder type classification:")
print(df_sales[['total_amount', 'quantity', 'order_type']].value_counts('order_type'))
Using Apply() for Custom Functions#
def calculate_shipping_cost(row):
"""Calculate shipping cost based on order value and quantity"""
base_cost = 10
if row['total_amount'] > 500:
return 0 # Free shipping for orders over $500
elif row['quantity'] > 5:
return base_cost * 0.5 # 50% discount for bulk orders
else:
return base_cost
# Simple apply with lambda
df_sales['price_per_letter'] = df_sales.apply(
lambda row: row['unit_price'] / len(row['product_name']), axis=1
)
# Custom function for complex logic
df_sales['shipping_cost'] = df_sales.apply(calculate_shipping_cost, axis=1)
print("Custom calculations with apply:")
print(df_sales[['total_amount', 'quantity', 'shipping_cost']].head())
Grouping and Aggregation#
The goal is to understand groupby basics.
Understanding GroupBy Basics#
# The split-apply-combine concept
print("Understanding GroupBy:")
# Simple grouping by one column
category_groups = sales_df.groupby('category')
print(f"Number of groups: {category_groups.ngroups}")
print(f"Group sizes: {category_groups.size()}")
# Basic aggregation
category_sales = sales_df.groupby('category')['total_amount'].sum()
print("\nTotal sales by category:")
print(category_sales)
# Multiple aggregations
category_stats = sales_df.groupby('category')['total_amount'].agg(['sum', 'mean', 'count', 'std'])
print("\nCategory statistics:")
print(category_stats.round(2))
Single Column Grouping#
# Single column grouping with multiple aggregations
# The split-apply-combine concept
print("Understanding GroupBy:")
# Simple grouping by one column
category_groups = df_sales.groupby('category')
print(f"Number of groups: {category_groups.ngroups}")
print(f"Group sizes: {category_groups.size()}")
# Basic aggregation
category_sales = df_sales.groupby('category')['total_amount'].sum()
print("\nTotal sales by category:")
print(category_sales)
# Multiple aggregations
category_stats = df_sales.groupby('category')['total_amount'].agg(['sum', 'mean', 'count', 'std'])
print("\nCategory statistics:")
print(category_stats.round(2))
Combining Datasets in Pandas#
In real-world data analysis, it’s common to work with data spread across multiple tables or files. Pandas provides powerful tools to combine these datasets efficiently:
Concatenation (pd.concat
)#
What it does: Stacks DataFrames either vertically (row-wise) or horizontally (column-wise).
When to use: When the datasets have the same structure (e.g., same columns for vertical stacking).
Example: Combining monthly sales reports stored as separate DataFrames into one long DataFrame.
Merge (pd.merge
)#
What it does: Combines DataFrames based on one or more common columns (similar to SQL joins).
When to use: When you need to enrich one dataset with columns from another based on shared keys (e.g., customer ID, district).
Types of joins:
inner
– Only rows with matching keys in both DataFramesleft
– All rows from the left DataFrame, with matches from the rightright
– All rows from the right DataFrame, with matches from the leftouter
– All rows from both, matching where possible
Concatenating Multiple DataFrames#
# Loading multiple CSV files into a list using ```glob```
csv_files = [f for f in DIR_MULTIPLE_CSV.glob("*.csv")]
# Read the first 5 CSV files
dfs = [pd.read_csv(file) for file in csv_files[:5]]
# Concatenate the DataFrames
df_combined = pd.concat(dfs, ignore_index=True)
EXERCISES-6: What conditions do you think are important for concatenating multiple DataFrames? Consider aspects like column consistency, data types, and handling of missing values. How would you ensure that the combined DataFrame maintains data integrity and usability for further analysis?
Merging DataFrames#
To quickly grasp how merging works, we’ll use a simple, AI-generated synthetic dataset.
def create_customer_datasets():
"""Create related customer datasets for merge examples"""
# Customer basic info
customers = pd.DataFrame({
'customer_id': [101, 102, 103, 104, 105],
'name': ['Alice Johnson', 'Bob Smith', 'Carol Brown', 'David Wilson', 'Eve Davis'],
'email': ['alice@email.com', 'bob@email.com', 'carol@email.com', 'david@email.com', 'eve@email.com'],
'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
'signup_date': pd.date_range('2023-01-01', periods=5, freq='30D')
})
# Orders data (includes some customers not in customers table)
orders = pd.DataFrame({
'order_id': range(1001, 1013),
'customer_id': [101, 102, 101, 103, 104, 102, 105, 106, 103, 107, 101, 104],
'product_name': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard', 'Mouse',
'Headphones', 'Speaker', 'Webcam', 'Microphone', 'Charger', 'Cable'],
'amount': [999, 699, 399, 299, 49, 25, 149, 199, 79, 89, 39, 15],
'order_date': pd.date_range('2023-02-01', periods=12, freq='5D')
})
# Customer preferences
preferences = pd.DataFrame({
'customer_id': [101, 102, 103, 104, 108], # Note: 108 not in customers
'preferred_category': ['Electronics', 'Electronics', 'Accessories', 'Electronics', 'Books'],
'communication_preference': ['Email', 'SMS', 'Email', 'Phone', 'Email'],
'loyalty_tier': ['Gold', 'Silver', 'Bronze', 'Gold', 'Silver']
})
# Product categories
products = pd.DataFrame({
'product_name': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard', 'Mouse',
'Headphones', 'Speaker', 'Webcam', 'Microphone'],
'category': ['Electronics', 'Electronics', 'Electronics', 'Electronics',
'Accessories', 'Accessories', 'Audio', 'Audio', 'Electronics', 'Audio'],
'unit_cost': [500, 300, 200, 150, 20, 10, 75, 100, 40, 45],
'supplier': ['TechCorp', 'PhoneCo', 'TabletInc', 'DisplayTech', 'KeyMaker',
'ClickCorp', 'AudioPro', 'SoundTech', 'VisionCorp', 'AudioPro']
})
return customers, orders, preferences, products
customers, orders, preferences, products = create_customer_datasets()
Basic Merge Operations#
# Inner join (only matching records)
inner_merge = pd.merge(customers, orders, on='customer_id', how='inner')
print("Inner merge (customers with orders):")
print(inner_merge[['name', 'customer_id', 'order_id', 'product_name', 'amount']].head(10))
print(f"Records: {len(inner_merge)}")
# Left join (all customers, matching orders)
left_merge = pd.merge(customers, orders, on='customer_id', how='left')
print("\nLeft merge (all customers):")
print(left_merge[['name', 'customer_id', 'order_id', 'product_name', 'amount']].head(10))
print(f"Records: {len(left_merge)}")
# Right join (all orders, matching customers)
right_merge = pd.merge(customers, orders, on='customer_id', how='right')
print("\nRight merge (all orders):")
print(right_merge[['name', 'customer_id', 'order_id', 'product_name', 'amount']].head(10))
print(f"Records: {len(right_merge)}")
# Outer join (all records from both tables)
outer_merge = pd.merge(customers, orders, on='customer_id', how='outer')
print("\nOuter merge (all customers and orders):")
print(outer_merge[['name', 'customer_id', 'order_id', 'product_name', 'amount']].head(15))
print(f"Records: {len(outer_merge)}")
EXERCISE-7: Merging Population Density Datasets [rw-pop-density-gridded]
We have multiple CSV files all with information about population density. Please use the strategy below to merge all of them.
Create a list of csv files using
glob
. You can read up briefly on howglob
works.Load each CSV file into Pandas Dataframe
Create a new column
lat_lon
to hold unique identifierCheck number of unique observations in each dataframe
merge dataframes using the created
lat_lon
columnCheck that all rows merged
csv_files = [f for f in DIR_POPULATION_DENSITY.glob("*.csv")]
print(f"Found {len(csv_files)} population density CSV files.")
df = pd.read_csv(csv_files[0])
df.dtypes
df_list = []
for file in csv_files:
df = pd.read_csv(file)
# Convert latitude and longitude to string
df['latitude'] = df['latitude'].astype(str)
df['longitude'] = df['longitude'].astype(str)
df['lat_lon'] = df['latitude'] + '_' + df['longitude']
df_list.append(df)
Lets merge the first two for practice#
# First Dataframe to use as base for merging
df1 = df_list[0]
df2 = df_list[1]
print(df1.shape, df2.shape)
df = df1.merge(df2, on='lat_lon', how='inner', indicator=True)
EXTENDED EXERCISE: Processing Excel File#
About the Dataset#
The Excel file was generated by combining multiple CSV files, each containing data on different health indicators for Rwanda, such as:
access-to-health-care_subnational_rwa
child-mortality-rates_subnational_rwa
dhs-mobile_subnational_rwa
Dataset Filename: Download the dataset from here.
Task-0: Explore Data in Excel#
Take time to explore the Excel sheets and locate where each variable of interest is stored. If anything is unclear, don’t hesitate to ask the tutors for clarification on the data source.
Ensure you identify the column that holds the indicator values so that you extract only the relevant data.
In most cases, the actual indicator values can be found in the column labeled
value
.Additionally, be sure to retain the following important columns:
ISO3
,Location
, andSurveyId
.
Task-1: Generate National-Level Summaries#
For each indicator, your goal is to compute a single national-level value. Depending on the nature of the indicator, you may use aggregation functions such as mean, median, or sum.
The final output should be a dataframe printed in Jupyter Notebook as well as a saved CSV file with the following columns:
indicator_name
: The name of the indicatorvalue
: The aggregated national value. You may name this column based on your chosen aggregation method, e.g.,mean_indicator_name
ormedian_indicator_name
.
Task 2: Subnational Level Indicator Dataset#
For indicators with subnational (administrative level 2 or 3) data available, lets merge them and a create a dataset with all those available indicators. The output dataset should have the following columns: