Introduction to Data Processing with Pandas

Contents

Introduction to Data Processing with Pandas#

Pandas is one of the most powerful and widely-used Python libraries for working with structured data. It provides fast, flexible, and intuitive tools to load, explore, clean, transform, and analyze data — especially tabular data like spreadsheets or CSV files. Whether you’re handling messy survey results, merging data from multiple sources, or preparing data for machine learning, pandas offers a robust foundation for data wrangling and analysis in Python. Its syntax is readable and expressive, making it ideal for both beginners and professionals.

This notebook introduces the fundamentals of working with tabular data using the pandas library.

Learning Outcomes#

By the end of this tutorial, participants will be able to:

Understanda pandas core data structures#

  • Dataframes and Series

  • Create Dataframes and Series from other Python data structures (e.g., lists)

Load and Explore Data#

  • Load and Explore Data from different sources

  • Understand data structure using .head(), .info(), .describe()

Preprocess Data#

  • Identify and handle missing values using isna(), fillna(), dropna()

  • Convert data types using .astype() and to_datetime()

  • Rename and reorder columns

Transform and Manipulate Data#

  • Create new columns and apply functions with .apply() and vectorized operations

  • Use conditional logic for filtering and assignment

Subset and Filter Data#

  • Select columns and rows using [], .loc[], .iloc[]

  • Use .query() and boolean indexing to filter rows

Group and Aggregate Data#

  • Group data using .groupby() and apply aggregation functions

  • Calculate summaries like mean, median, and counts by groups

Merge and Join Datasets#

  • Combine datasets using merge(), concat(), and join()

  • Understand inner, outer, left, and right joins

import pandas as pd
from pathlib import Path

Input Folders Setup#

In order to make learning fast, we will mostly use synthetic data for this notebook except for a few cases.

# ===========================================================
# MAKE SURE YOU CHANGE THE PATHS BELOW TO YOUR LOCAL SETUP
# ==========================================================
DIR_DATA = Path.cwd().parents[1] / "data"
FIlE_EMPLOYEES = DIR_DATA / "synthetic-data-employees.csv"
FILE_CUSTOMERS = DIR_DATA / "synthetic-messy-customer-data.csv"
FILE_SALES = DIR_DATA / "synthetic-sales-data.csv"

# Multiple CSV files 
DIR_MULTIPLE_CSV = DIR_DATA / "simulated_cdrs"

# Population Density Datasets
DIR_POPULATION_DENSITY = DIR_DATA / "population/rw-pop-density-gridded"

Getting Help with Pandas Functions#

Understanding how to use functions and methods in Python—especially in pandas—requires consulting the documentation. It’s essential to read the documentation to learn what each function does and what its arguments mean.

In pandas, there are several convenient ways to access help:

  1. help(pd.DataFrame) – Use the built-in help() function to view documentation for any pandas class or function.

  2. pd.read_csv? or ?pd.read_csv – In Jupyter Notebook or IPython, placing a ? before or after a function displays its signature and docstring.

  3. df.apply? – If df is your DataFrame, you can inspect the apply method using the same approach.

These tools are invaluable when exploring new methods or troubleshooting unexpected behavior.

For more detailed and up-to-date information, refer to the official pandas API documentation. It provides comprehensive explanations, examples, and parameter details for every function in the library.

EXERCISE-0: Getting Help on Pandas Functions Use the following tasks to familiarize yourself with how to explore pandas functions and understand their parameters:

  1. Use help(pd.read_csv) to read the documentation for loading CSV files.

  2. Use df.head? to learn how the head() method works.

  3. Try help(pd.merge) and identify what the arguments how and on do.

  4. Use df.groupby? to check how to group your data and what options are available.

Pandas Core Data Structures#

Series (1-dimensional)#

DataFrame (2-dimensional)#

We’ll focus primarily on the DataFrame data structure. However, you’re encouraged to explore how to convert between DataFrames and Series for greater flexibility.

Creating DataFrames from Python Data Structures#

# Create pandas DataFrames from a dictionary

# Step-1 Create a dictionary with sample data
data = {
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "city": ["New York", "Los Angeles", "Chicago"]}
    

Dataframe from a List#

list_data = [["Alice", 25, "New York"],
             ["Bob", 30, "Los Angeles"],
             ["Charlie", 35, "Chicago"] ]
df_from_list = pd.DataFrame(list_data, columns=["name", "age", "city"])
# Method 1: From dictionary
students_dict = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [23, 22, 24, 23],
    'Grade': ['A', 'B', 'A', 'C'],
    'Score': [92, 85, 94, 78]
}
students_df = pd.DataFrame(students_dict)

# Method 2: From lists
data_lists = [
    ['Alice', 23, 'A', 92],
    ['Bob', 22, 'B', 85],
    ['Charlie', 24, 'A', 94],
    ['Diana', 23, 'C', 78]
]
students_df2 = pd.DataFrame(data_lists, columns=['Name', 'Age', 'Grade', 'Score'])

print("From dictionary:")
print(students_df)
print("\nFrom lists:")
print(students_df2)

EXERCISE-1: GETTING BACK A DICTIONARY FROM A DATAFRAME

  • Make sure you have a Dataframe

  • Use the to_dict() method to convert the DataFrame back into a dictionary.

  • Use help to understand the parameters of to_dict() and discuss the differences between the different formats it can return.

df_from_list.to_dict(orient='list')
{'name': ['Alice', 'Bob', 'Charlie'],
 'age': [25, 30, 35],
 'city': ['New York', 'Los Angeles', 'Chicago']}

Basic DataFrame Inspection#

It’s important to get a high-level overview of your DataFrame—especially when it’s loaded from an external source. This first look helps you understand the structure, size, and contents of the dataset before diving into detailed analysis.

df_from_dict.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    3 non-null      object
 1   age     3 non-null      int64 
 2   city    3 non-null      object
dtypes: int64(1), object(2)
memory usage: 204.0+ bytes

EXERCISE-2: Try the following:

  1. Use df.shape to view the dimensions of the DataFrame.

  2. Use df.head() to display the first few rows of the DataFrame.

  3. Use df.columns to get the names of the columns.

  4. Use df.info() to get a concise summary of the DataFrame, including data types and non-null counts.

  5. Use df.describe() to get a statistical summary of the numerical columns in the DataFrame.

Understanding Indexing and Columns#

In a DataFrame, columns represent the variables (or features) in your dataset, while the index represents the row labels. The index is used to uniquely identify each row and can be either the default integer labels or a custom label like a name or ID.

We wil see later that index can be use in merging sometimes.

df_from_list.index = df_from_list['name']
students_dict = {
    'Name': ['Alice', 'Alice', 'Charlie', 'Diana'],
    'Age': [23, 22, 24, 23],
    'Grade': ['A', 'B', 'A', 'C'],
    'Score': [92, 85, 94, 78]
}
df_students = pd.DataFrame(students_dict)
df_students
Name Age Grade Score
0 Alice 23 A 92
1 Alice 22 B 85
2 Charlie 24 A 94
3 Diana 23 C 78
df_students2 = df_students
df_students2.index = df_students['Name']

EXERCISE-3: Explore Index

  • Check the current index using

Loading Data from External Sources#

Pandas makes it easy to load data from a wide range of external sources into a common tabular format (DataFrame) for analysis. This includes traditional statistical software formats like SPSS (.sav), Stata (.dta), Excel (.xlsx), and more. Once loaded, all data can be handled consistently using pandas tools, regardless of the original format.

#df_employees = pd.read_csv(FIlE_EMPLOYEES)
df_employees = pd.read_csv("/Users/dmatekenya/My Drive (dmatekenya@gmail.com)/TEACHING/AIMS-DSCBI/data/synthetic-data-employees.csv")

Basic Column Selection#

# Selecting single columns
names = df_employees['name']
print("Single column (Series):")
print(type(names))
print(names.head())

# Selecting multiple columns
basic_info = df_employees[['name', 'department', 'salary']]
print("\nMultiple columns (DataFrame):")
print(type(basic_info))
print(basic_info.head())

# Different ways to select columns
print("\nDifferent selection methods:")
print("Dot notation:", type(df_employees.name))  # Works for valid Python names
print("Bracket notation:", type(df_employees['name']))
df_tmp = df_employees['name']
df_tmp.head()
0      John Smith
1        Jane Doe
2    Mike Johnson
3    Sarah Wilson
4     David Brown
Name: name, dtype: object

Basic Row Selection#

# Row selection using indexing
# df_employees[:10]
# Row selection using loc
df_employees.loc[:9]  # Selects rows 0 to 9
employee_id name department salary years_experience hire_date
0 101 John Smith IT 75000 5 2018-01-01
1 102 Jane Doe HR 65000 3 2018-04-01
2 103 Mike Johnson Finance 70000 7 2018-06-30
3 104 Sarah Wilson IT 80000 6 2018-09-28
4 105 David Brown Marketing 60000 2 2018-12-27
5 106 Lisa Davis Finance 72000 8 2019-03-27
6 107 Tom Miller IT 78000 4 2019-06-25
7 108 Amy Garcia HR 63000 3 2019-09-23
8 109 Chris Lee Marketing 58000 1 2019-12-22
9 110 Maria Rodriguez Finance 74000 9 2020-03-21

.loc works with labels, not integer positions#

df_employees.loc[5, ['employee_id', 'name']]
employee_id           106
name           Lisa Davis
Name: 5, dtype: object
df_employees['employee_id'] = df_employees['employee_id'].astype(str)
df_employees2 = df_employees.set_index('employee_id')
df_employees.head()
employee_id name department salary years_experience hire_date
0 101 John Smith IT 75000 5 2018-01-01
1 102 Jane Doe HR 65000 3 2018-04-01
2 103 Mike Johnson Finance 70000 7 2018-06-30
3 104 Sarah Wilson IT 80000 6 2018-09-28
4 105 David Brown Marketing 60000 2 2018-12-27
df_employees.iloc[1, [1, 2]].head()  # Example with a specific index label
name          Jane Doe
department          HR
Name: 1, dtype: object
df_employees2 = df_employees.set_index('employee_id')
df_employees2.head()
name department salary years_experience hire_date country
employee_id
101 John Smith IT 75000 5 2018-01-01 Rwanda
102 Jane Doe HR 65000 3 2018-04-01 Rwanda
103 Mike Johnson Finance 70000 7 2018-06-30 Malawi
104 Sarah Wilson IT 80000 6 2018-09-28 Rwanda
105 David Brown Marketing 60000 2 2018-12-27 Rwanda
# Using .loc[] - label-based selection
print("Using .loc[] - first 3 rows:")
print(df_employees.loc[0:2])  # Includes end index

print("\nUsing .loc[] - specific rows and columns:")
print(df_employees.loc[0:2, ['name', 'salary']])

# Using .iloc[] - position-based selection
print("\nUsing .iloc[] - first 3 rows:")
print(df_employees.iloc[0:3])  # Excludes end index

print("\nUsing .iloc[] - specific positions:")
print(df_employees.iloc[0:3, [1, 3]])  # First 3 rows, columns 1 and 3

EXERCISE-ROW SELECTION: TRY THE FOLLOWING

  1. SET INDEX OF DATAFRAME TO name column

  2. USE .iloc TO MAKE THIS SELECTION:

    EMPLOYEE NAME: ‘John Smith’ salary and department

  3. TRY THE SAME THING USING .loc

  4. WHATS HAPPENING

df_employees2 = df_employees.set_index('name')
df_employees2.loc['John Smith', ['employee_id', 'department']]
employee_id    101
department      IT
Name: John Smith, dtype: object

Data Selection and Filtering#

The goal here is to be able to do the following:

  • Create and apply boolean masks for data filtering

  • Use comparison operators to filter data based on conditions

  • Combine multiple conditions using logical operators

  • Apply string methods for text-based filtering

  • Filter DataFrames using complex conditional logic

  • Select specific subsets of data based on multiple criteria

Boolean Indexing Basics#

Boolean filtering in One Liner#

df_high_salary_employees = df_employees[df_employees['salary'] > 70000]
df_high_salary_employees.head()
employee_id name department salary years_experience hire_date
0 101 John Smith IT 75000 5 2018-01-01
3 104 Sarah Wilson IT 80000 6 2018-09-28
5 106 Lisa Davis Finance 72000 8 2019-03-27
6 107 Tom Miller IT 78000 4 2019-06-25
9 110 Maria Rodriguez Finance 74000 9 2020-03-21

Boolean filtering in Two Steps#

high_salary_mask = df_employees['salary'] > 70000
high_salary_mask
# APPLY THE MASK TO THE DATAFRAME
df_high_salary_employees = df_employees[high_salary_mask]
df_high_salary_employees.shape
(10, 6)
# Creating boolean masks
high_salary_mask = df_employees['salary'] > 70000
print("Boolean mask for high salary:")
print(high_salary_mask.head(10))
print(f"Mask type: {type(high_salary_mask)}")

# Applying the mask to filter data
high_salary_employees = df_employees[high_salary_mask]
print(f"\nEmployees with salary > $70,000:")
print(high_salary_employees[['name', 'salary', 'department']])
print(f"Count: {len(high_salary_employees)} out of {len(df_employees)}")

Comparison Operators#

SELECT ALL EMPLOYEES WORKING IN IT DEPARTMENT#

df_IT = df_employees[df_employees['department'] == 'IT']
df_IT.department.value_counts()
department
IT    6
Name: count, dtype: int64
# Different comparison operators
print("IT Department employees:")
it_employees = df_employees[df_employees['department'] == 'IT']
print(it_employees[['name', 'department', 'salary']])

print("\nExperienced employees (>= 5 years):")
experienced = df_employees[df_employees['years_experience'] >= 5]
print(experienced[['name', 'years_experience', 'salary']])

print("\nNot in HR department:")
non_hr = df_employees[df_employees['department'] != 'HR']
print(f"Non-HR employees: {len(non_hr)} out of {len(df_employees)}")

# Using .isin() for multiple values
target_departments = ['IT', 'Finance']
it_finance = df_employees[df_employees['department'].isin(target_departments)]
print(f"\nIT and Finance employees: {len(it_finance)}")
print(it_finance[['name', 'department', 'salary']])

Filtering with the .query Method#

df_employees = pd.read_csv(FIlE_EMPLOYEES)

FILTER FOR HIGH SALARY EMPLOYEES#

METHOD-1: BOOLEAN MASK#

df_high_salary_employees = df_employees[df_employees['salary'] > 70000]

METHOD-2: .query#

Single Condition#

df_high_salary_employees = df_employees.query('salary > 70000 ')
df_high_salary_employees = df_employees.query('salary > 70000 and department == "IT"')
# Using query() method for readable conditions
high_performers = df_employees.query('salary > 70000 and years_experience >= 5')
print("High performers using query():")
print(high_performers[['name', 'salary', 'years_experience']])

EXERCISE-4: EXPLORE FILTERING WITH MULTIPLE CONDITIONS

  • Boolean Mask. create a new dataframe for employees who work in HR and receive salary higher than 50,000

  • .query. Do the above using .query method

  • Report how many employees are in this new Dataframe

  • Challenge: Add ‘years_experience’ condition such as less than 5 years

# More Filtering Methods
pd.DataFrame.filter?
pd.DataFrame.isin?
department = ['IT', 'HR']
df_employees[df_employees['department'].isin(department)]
employee_id name department salary years_experience hire_date
0 101 John Smith IT 75000 5 2018-01-01
1 102 Jane Doe HR 65000 3 2018-04-01
3 104 Sarah Wilson IT 80000 6 2018-09-28
6 107 Tom Miller IT 78000 4 2019-06-25
7 108 Amy Garcia HR 63000 3 2019-09-23
10 111 James Taylor IT 82000 7 2020-06-19
11 112 Linda Martinez HR 64000 4 2020-09-17
14 115 Michael Jackson IT 79000 5 2021-06-14
15 116 Jennifer White HR 66000 3 2021-09-12
18 119 Richard Lewis IT 81000 6 2022-06-09
19 120 Susan Robinson HR 67000 4 2022-09-07

Data Cleaning Basics with Pandas#

The goal is to be able to do some of the following:

  • Identify and handle missing data using various strategies

  • Detect and remove duplicate records from DataFrames

  • Convert data types appropriately for analysis

  • Rename columns for better readability and consistency

df_messy = pd.read_csv("/Users/dmatekenya/My Drive (dmatekenya@gmail.com)/TEACHING/AIMS-DSCBI/data/synthetic-messy-customer-data.csv")

Identifying and Handling Missing Data#

df_messy.shape
(10, 7)
df_messy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Customer_ID      10 non-null     int64  
 1   customer name    8 non-null      object 
 2   age              9 non-null      float64
 3   email            9 non-null      object 
 4   purchase_amount  10 non-null     object 
 5   signup_date      9 non-null      object 
 6   status           10 non-null     object 
dtypes: float64(1), int64(1), object(5)
memory usage: 692.0+ bytes

GET TOTAL NUM OF MISSING VALEUS IN A COLUMN#

df_messy['customer name'].isnull().sum()
2
new_columns = {'customer name': 'customer_name'}
df_messy.rename(columns=new_columns, inplace=True)

OTHER METHODS DEALING WITH MISSING VALIUES:#

DataFrame.isnull : Alias of isna.

DataFrame.notna : Boolean inverse of isna.

DataFrame.dropna : Omit axes labels with missing values.

DROPPING NULL VALUES#

df_messy2 = df_messy.dropna(subset=['customer_name'])

DROP ROWS IF THERE IS NULL IN ANY COLUMN#

df_messy_not_null.info()
<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, 0 to 9
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Customer_ID      6 non-null      int64  
 1   customer_name    6 non-null      object 
 2   age              6 non-null      float64
 3   email            6 non-null      object 
 4   purchase_amount  6 non-null      object 
 5   signup_date      6 non-null      object 
 6   status           6 non-null      object 
dtypes: float64(1), int64(1), object(5)
memory usage: 384.0+ bytes
df_messy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Customer_ID      10 non-null     int64  
 1   customer_name    8 non-null      object 
 2   age              9 non-null      float64
 3   email            9 non-null      object 
 4   purchase_amount  10 non-null     object 
 5   signup_date      9 non-null      object 
 6   status           10 non-null     object 
dtypes: float64(1), int64(1), object(5)
memory usage: 692.0+ bytes
df_messy.dtypes
Customer_ID          int64
customer_name       object
age                float64
email               object
purchase_amount     object
signup_date         object
status              object
dtype: object
df_messy.customer_name.fillna('Unknown', inplace=True)
print('Original missing values:')
print(df_messy.customer_name.isnull().sum(), df_messy.shape[0])

print('Original non-missing values:')
print(df_messy2.customer_name.isnull().sum(), df_messy2.shape[0])
Original missing values:
2 10
Original non-missing values:
0 8
# Check for missing values
print("Missing values per column:")
print(df_messy.isnull().sum())

print("\nPercentage of missing values:")
missing_percentages = (df_messy.isnull().sum() / len(df_messy) * 100).round(2)
print(missing_percentages)

# Visualize missing data pattern
print("\nMissing data pattern:")
print(df_messy.isnull())

Data Type Conversion#

df_messy['income'] = "90000"
df_messy["income_num"]  = df_messy['income'].astype(float)

CONVERT STRING TIME TO PYTHON TIME OBJECT#

# Convert signup_date to datetime
df_messy['signup_date_ts'] = pd.to_datetime(df_messy['signup_date'], errors='coerce')
print('Modified Data Type:', type(df_messy['signup_date_ts'].iloc[0]))
print("Original Data Type:", type(df_messy['signup_date'].iloc[0]))
Modified Data Type: <class 'pandas._libs.tslibs.timestamps.Timestamp'>
Original Data Type: <class 'str'>
# Check current data types
print("Current data types:")
print(df_messy.dtypes)

# Convert data types
df_typed = df_messy.copy()

# Convert purchase_amount to numeric (handling invalid values)
df_typed['purchase_amount'] = pd.to_numeric(df_typed['purchase_amount'], errors='coerce')

# Convert signup_date to datetime
df_typed['signup_date'] = pd.to_datetime(df_typed['signup_date'], errors='coerce')

# Convert Customer_ID to integer (after handling missing values)
df_typed['Customer_ID'] = df_typed['Customer_ID'].astype('int64')

# Convert status to category for memory efficiency
df_typed['status'] = df_typed['status'].astype('category')

print("\nAfter type conversion:")
print(df_typed.dtypes)

# Show the cleaned numeric column
print("\nCleaned purchase amounts:")
print(df_typed[['Customer_ID', 'purchase_amount']].head(10))

EXERCISES-5: Detecting Duplicates and Renaming Columns

  • Explore pndas duplicated method to identify duplicate rows in a DataFrame.

  • Use the drop_duplicates() method to remove duplicates.

  • Next rename columns to something you like

Data Transformation#

Creating New Columns with Calculations#

When creating new columns based on existing data, you’re often performing operations row by row. The .apply() function in pandas is a powerful and efficient way to do this. It allows you to apply a custom function across rows or columns, making it ideal for generating new variables based on specific logic or transformations.

While it’s possible to loop through DataFrame rows manually using .iterrows(), this approach is generally slower and less efficient. The recommended method is to use .apply() in one of the following ways:

  • Lambda functions – For simple, one-line conditions or calculations, use an in-line lambda function with .apply().

  • Custom functions – For more complex logic or multi-step conditions, define a separate function and pass it to .apply().

This method not only makes your code cleaner and more readable but also improves performance in most cases.

df_sales = pd.read_csv(FILE_SALES)

ADD COLUMN WITH SIMPLE CALCULATION#

df_sales['total'] = df_sales['unit_price'] * df_sales['quantity']

ADDING A COLUMN WITH COMPLICATED LOGIC#

total = X + 2*unit_price, where X=10

total = X + 2*unit_price + unit_price*quantity

def add_complicated_total(row):
    output = 10 + 2 * row['unit_price'] + row['unit_price'] * row['quantity']
    return output
df_sales['total_complicated'] = df_sales.apply(add_complicated_total, axis=1)
df_sales['total_comp1'] = df_sales.apply(lambda row: 10+ 2*row['unit_price'], axis=1)
df_sales[['total_comp1', 'unit_price', 'quantity']].head()
total_comp1 unit_price quantity
0 821.62 405.81 7
1 1916.36 953.18 1
2 1500.78 745.39 4
3 1247.46 618.73 4
4 406.44 198.22 5
df_sales[['total', 'unit_price', 'quantity']].head()
total unit_price quantity
0 2840.67 405.81 7
1 953.18 953.18 1
2 2981.56 745.39 4
3 2474.92 618.73 4
4 991.10 198.22 5
# Basic mathematical operations
df_sales['total_amount'] = df_sales['unit_price'] * df_sales['quantity']
df_sales['discount_5_percent'] = df_sales['total_amount'] * 0.05
df_sales['final_amount'] = df_sales['total_amount'] - df_sales['discount_5_percent']

print("New calculated columns:")
print(df_sales[['unit_price', 'quantity', 'total_amount', 'final_amount']].head())

# Conditional calculations
df_sales['order_size'] = df_sales['quantity'].apply(
    lambda x: 'Large' if x >= 7 else 'Medium' if x >= 4 else 'Small'
)

# Using np.where for conditional logic
df_sales['price_category'] = np.where(
    df_sales['unit_price'] >= 500, 'Premium',
    np.where(df_sales['unit_price'] >= 200, 'Standard', 'Budget')
)

print("\nConditional columns:")
print(df_sales[['quantity', 'order_size', 'unit_price', 'price_category']].head(10))

# Multiple conditions with np.select
conditions = [
    (df_sales['total_amount'] >= 1000) & (df_sales['quantity'] >= 5),
    (df_sales['total_amount'] >= 500) & (df_sales['quantity'] >= 3),
    df_sales['total_amount'] >= 200
]
choices = ['VIP Order', 'Standard Order', 'Regular Order']
df_sales['order_type'] = np.select(conditions, choices, default='Small Order')

print("\nOrder type classification:")
print(df_sales[['total_amount', 'quantity', 'order_type']].value_counts('order_type'))

Using Apply() for Custom Functions#

def calculate_shipping_cost(row):
    """Calculate shipping cost based on order value and quantity"""
    base_cost = 10
    if row['total_amount'] > 500:
        return 0  # Free shipping for orders over $500
    elif row['quantity'] > 5:
        return base_cost * 0.5  # 50% discount for bulk orders
    else:
        return base_cost
# Simple apply with lambda
df_sales['price_per_letter'] = df_sales.apply(
    lambda row: row['unit_price'] / len(row['product_name']), axis=1
)

# Custom function for complex logic
df_sales['shipping_cost'] = df_sales.apply(calculate_shipping_cost, axis=1)

print("Custom calculations with apply:")
print(df_sales[['total_amount', 'quantity', 'shipping_cost']].head())

Grouping and Aggregation#

The goal is to understand groupby basics.

Understanding GroupBy Basics#

# The split-apply-combine concept
print("Understanding GroupBy:")

# Simple grouping by one column
category_groups = sales_df.groupby('category')
print(f"Number of groups: {category_groups.ngroups}")
print(f"Group sizes: {category_groups.size()}")

# Basic aggregation
category_sales = sales_df.groupby('category')['total_amount'].sum()
print("\nTotal sales by category:")
print(category_sales)

# Multiple aggregations
category_stats = sales_df.groupby('category')['total_amount'].agg(['sum', 'mean', 'count', 'std'])
print("\nCategory statistics:")
print(category_stats.round(2))

Single Column Grouping#

# Single column grouping with multiple aggregations
# The split-apply-combine concept
print("Understanding GroupBy:")

# Simple grouping by one column
category_groups = df_sales.groupby('category')
print(f"Number of groups: {category_groups.ngroups}")
print(f"Group sizes: {category_groups.size()}")

# Basic aggregation
category_sales = df_sales.groupby('category')['total_amount'].sum()
print("\nTotal sales by category:")
print(category_sales)

# Multiple aggregations
category_stats = df_sales.groupby('category')['total_amount'].agg(['sum', 'mean', 'count', 'std'])
print("\nCategory statistics:")
print(category_stats.round(2))

Combining Datasets in Pandas#

In real-world data analysis, it’s common to work with data spread across multiple tables or files. Pandas provides powerful tools to combine these datasets efficiently:

Concatenation (pd.concat)#

  • What it does: Stacks DataFrames either vertically (row-wise) or horizontally (column-wise).

  • When to use: When the datasets have the same structure (e.g., same columns for vertical stacking).

  • Example: Combining monthly sales reports stored as separate DataFrames into one long DataFrame.

Merge (pd.merge)#

  • What it does: Combines DataFrames based on one or more common columns (similar to SQL joins).

  • When to use: When you need to enrich one dataset with columns from another based on shared keys (e.g., customer ID, district).

  • Types of joins:

    • inner – Only rows with matching keys in both DataFrames

    • left – All rows from the left DataFrame, with matches from the right

    • right – All rows from the right DataFrame, with matches from the left

    • outer – All rows from both, matching where possible

Concatenating Multiple DataFrames#

# Loading multiple CSV files into a list using ```glob```
csv_files = [f for f in DIR_MULTIPLE_CSV.glob("*.csv")]

# Read the first 5 CSV files
dfs = [pd.read_csv(file) for file in csv_files[:5]]

# Concatenate the DataFrames
df_combined = pd.concat(dfs, ignore_index=True)

EXERCISES-6: What conditions do you think are important for concatenating multiple DataFrames? Consider aspects like column consistency, data types, and handling of missing values. How would you ensure that the combined DataFrame maintains data integrity and usability for further analysis?

Merging DataFrames#

To quickly grasp how merging works, we’ll use a simple, AI-generated synthetic dataset.

def create_customer_datasets():
    """Create related customer datasets for merge examples"""
    
    # Customer basic info
    customers = pd.DataFrame({
        'customer_id': [101, 102, 103, 104, 105],
        'name': ['Alice Johnson', 'Bob Smith', 'Carol Brown', 'David Wilson', 'Eve Davis'],
        'email': ['alice@email.com', 'bob@email.com', 'carol@email.com', 'david@email.com', 'eve@email.com'],
        'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
        'signup_date': pd.date_range('2023-01-01', periods=5, freq='30D')
    })
    
    # Orders data (includes some customers not in customers table)
    orders = pd.DataFrame({
        'order_id': range(1001, 1013),
        'customer_id': [101, 102, 101, 103, 104, 102, 105, 106, 103, 107, 101, 104],
        'product_name': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard', 'Mouse', 
                        'Headphones', 'Speaker', 'Webcam', 'Microphone', 'Charger', 'Cable'],
        'amount': [999, 699, 399, 299, 49, 25, 149, 199, 79, 89, 39, 15],
        'order_date': pd.date_range('2023-02-01', periods=12, freq='5D')
    })
    
    # Customer preferences
    preferences = pd.DataFrame({
        'customer_id': [101, 102, 103, 104, 108],  # Note: 108 not in customers
        'preferred_category': ['Electronics', 'Electronics', 'Accessories', 'Electronics', 'Books'],
        'communication_preference': ['Email', 'SMS', 'Email', 'Phone', 'Email'],
        'loyalty_tier': ['Gold', 'Silver', 'Bronze', 'Gold', 'Silver']
    })
    
    # Product categories
    products = pd.DataFrame({
        'product_name': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard', 'Mouse', 
                        'Headphones', 'Speaker', 'Webcam', 'Microphone'],
        'category': ['Electronics', 'Electronics', 'Electronics', 'Electronics', 
                    'Accessories', 'Accessories', 'Audio', 'Audio', 'Electronics', 'Audio'],
        'unit_cost': [500, 300, 200, 150, 20, 10, 75, 100, 40, 45],
        'supplier': ['TechCorp', 'PhoneCo', 'TabletInc', 'DisplayTech', 'KeyMaker', 
                    'ClickCorp', 'AudioPro', 'SoundTech', 'VisionCorp', 'AudioPro']
    })
    
    return customers, orders, preferences, products

customers, orders, preferences, products = create_customer_datasets()

Basic Merge Operations#

# Inner join (only matching records)
inner_merge = pd.merge(customers, orders, on='customer_id', how='inner')
print("Inner merge (customers with orders):")
print(inner_merge[['name', 'customer_id', 'order_id', 'product_name', 'amount']].head(10))
print(f"Records: {len(inner_merge)}")

# Left join (all customers, matching orders)
left_merge = pd.merge(customers, orders, on='customer_id', how='left')
print("\nLeft merge (all customers):")
print(left_merge[['name', 'customer_id', 'order_id', 'product_name', 'amount']].head(10))
print(f"Records: {len(left_merge)}")

# Right join (all orders, matching customers)
right_merge = pd.merge(customers, orders, on='customer_id', how='right')
print("\nRight merge (all orders):")
print(right_merge[['name', 'customer_id', 'order_id', 'product_name', 'amount']].head(10))
print(f"Records: {len(right_merge)}")

# Outer join (all records from both tables)
outer_merge = pd.merge(customers, orders, on='customer_id', how='outer')
print("\nOuter merge (all customers and orders):")
print(outer_merge[['name', 'customer_id', 'order_id', 'product_name', 'amount']].head(15))
print(f"Records: {len(outer_merge)}")

EXERCISE-7: Merging Population Density Datasets [rw-pop-density-gridded]

We have multiple CSV files all with information about population density. Please use the strategy below to merge all of them.

  • Create a list of csv files using glob. You can read up briefly on how glob works.

  • Load each CSV file into Pandas Dataframe

  • Create a new column lat_lon to hold unique identifier

  • Check number of unique observations in each dataframe

  • merge dataframes using the created lat_lon column

  • Check that all rows merged

csv_files = [f for f in DIR_POPULATION_DENSITY.glob("*.csv")]
print(f"Found {len(csv_files)} population density CSV files.")
df = pd.read_csv(csv_files[0])
df.dtypes
df_list = []
for file in csv_files:
    df = pd.read_csv(file)

    # Convert latitude and longitude to string
    df['latitude'] = df['latitude'].astype(str)
    df['longitude'] = df['longitude'].astype(str)

    df['lat_lon'] = df['latitude'] + '_' + df['longitude']
    df_list.append(df)

Lets merge the first two for practice#

# First Dataframe to use as base for merging
df1 = df_list[0]
df2 = df_list[1]

print(df1.shape, df2.shape)

df = df1.merge(df2, on='lat_lon', how='inner', indicator=True)

EXTENDED EXERCISE: Processing Excel File#

About the Dataset#

The Excel file was generated by combining multiple CSV files, each containing data on different health indicators for Rwanda, such as:

  • access-to-health-care_subnational_rwa

  • child-mortality-rates_subnational_rwa

  • dhs-mobile_subnational_rwa

Dataset Filename: Download the dataset from here.

Task-0: Explore Data in Excel#

  • Take time to explore the Excel sheets and locate where each variable of interest is stored. If anything is unclear, don’t hesitate to ask the tutors for clarification on the data source.

  • Ensure you identify the column that holds the indicator values so that you extract only the relevant data.

  • In most cases, the actual indicator values can be found in the column labeled value.

  • Additionally, be sure to retain the following important columns: ISO3, Location, and SurveyId.

Task-1: Generate National-Level Summaries#

For each indicator, your goal is to compute a single national-level value. Depending on the nature of the indicator, you may use aggregation functions such as mean, median, or sum.

The final output should be a dataframe printed in Jupyter Notebook as well as a saved CSV file with the following columns:

  • indicator_name: The name of the indicator

  • value: The aggregated national value. You may name this column based on your chosen aggregation method, e.g., mean_indicator_name or median_indicator_name.

Task 2: Subnational Level Indicator Dataset#

For indicators with subnational (administrative level 2 or 3) data available, lets merge them and a create a dataset with all those available indicators. The output dataset should have the following columns: