Solutions-Programming Assignment#
Estimating Chichewa Speakers#
Background#
This notebook provides statistics about the language of Chichewa. More specifically, the goal is to generate an accurate estimate of total number of people who speak this language. Note that Chichewa also includes the alternate names: Chewa, Nyanja, Chinyanja. In order to provide context, first, let us indicate the countries where the language is spoken.
Malawi. About 70% of the population speak Chichewa. [1]
Zambia. About 20% of the population speak the language. [2]
Mozamabique. Less than 1% of the population speak the language. [3]
Zimbabwe. Although Chichewa seem to be one of the official languages for Zimbabwe, I havent found any data yet showing how many people speak the language.
Tanzania. Has a border with Malawi in the northen region where people speak Tumbuka, so it makes sense that there maybe no Chichewa speaking people there. Otherwise, I didnot find any data on proportion of the population who speak the language.
Based on the analysis in this notebook, as of 2023, there are 21,482,292 people who speak Chichewa distributed across three countries: Malawi (70%), Zambia (18%) and Mozambique(12%).
The Humanitarian data website contains data about languages for some of these countries. The HUMDATA links are provided below:
Since I was not very sure of some of the numbers from humdata, I decided to check with the actual sources as follows:
Malawi and Zambia. I could not find any current surveys with data on languages spoken but I still found something in the DHS. In the DHS, they ask about survey respondent’s native language. Although this data is not included in DHS reports (as they seem to collect this piece of data as interview metadata), its still a useful source of data for languages spoken. For Zimbabwe, Tanzania and Mozambique the DHS does’nt have this information as they only provide languages as language-1, language-2 etc except for the major languages.
Assignment Tasks#
Read the notebook carefully, answer the questions as asked.
In some cases, you will complete missing code. Add your own code where it says “ADD YOUR CODE”.
import pandas as pd
import random
from pathlib import Path
pd.set_option('display.float_format', lambda x: '%.3f' % x)
Setup Input Folders#
In this section, make sure to define the folders where your data is stored on your machine.
I find it helpful to set up the working directory and input data folders right at the start of the notebook.
To keep things organized, I use the naming convention: FILE_{NAME}
for files and DIR_{NAME}
for folders.
We’ll be using the pathlib
library—it’s the recommended approach for managing file paths in Python.
DIR_DATA = Path.cwd().parents[1] / "data"
DIR_DHS = DIR_DATA / "DHS"
# Population by enumeration area (EA) for Malawi
FILE_POP_MW = DIR_DATA / "population/malawi-ea-population.csv"
# MW DHS-2015-16 HH members stata data file
FILE_MW_DHS = DIR_DHS / "MWPR7ADT/MWPR7AFL.DTA"
Chichewa speakers in Malawi Based on Tribe#
Malawi is the primary country where Chichewa is spoken. In the absence of precise language data, we begin by estimating the number of Chichewa speakers based on tribal affiliation.
def mw_estimate_chichewa_speaking_based_on_tribe():
"""
Get an estimate of people who speak Chichewa and Chinyanja based
on tribes
data source: 2018 census main report:
http://www.nsomalawi.mw/images/stories/data_on_line/demography/census_2018/
"""
# from MW 2018 census report
total_pop_malawi = 17506022
# =======================
# POPULATION BY TRIBE
# =======================
mw_chewa_trb = 6020945
mw_tumbuka_trb = 1614955
mw_lomwe_trb = 3302634
mw_tonga_trb = 310031
mw_yao_trb = 2321763
mw_sena_trb = 670908
mw_nkhonde_trb = 174430
mw_lambya_trb = 106769
mw_sukwa_trb = 93762
mw_manganja_trb = 559887
mw_nyanja_trb = 324272
mw_ngoni_tribe = 1819347
mw_other_trb = 186319
# check that totals from tribes match with given total pop
tot = mw_chewa_trb + mw_lambya_trb + mw_sukwa_trb + mw_other_trb+ mw_manganja_trb+ mw_sena_trb+\
mw_lomwe_trb + mw_ngoni_tribe + mw_tumbuka_trb + mw_tonga_trb + mw_nyanja_trb + mw_yao_trb + mw_nkhonde_trb
assert tot == total_pop_malawi
# =====================================
# SUM POP FOR CHICHEWA SPEAKING TRIBES
# ====================================
# the following tribes speak chichewa:
# chewa, ngoni,mang'anja, nyanja and other
chichewa_speaking_pop = mw_chewa_trb + mw_other_trb + mw_ngoni_tribe + \
mw_manganja_trb + mw_nyanja_trb
print('Chichewa speaking population: {:,}'.format(chichewa_speaking_pop))
print("="*50)
print(" Malawi, Chichewa Speakers Based on Tribe")
print("="*50)
mw_estimate_chichewa_speaking_based_on_tribe()
==================================================
Malawi, Chichewa Speakers Based on Tribe
==================================================
Chichewa speaking population: 8,910,770
Chichewa speakers in Malawi Based on District of Residence#
In this section, we estimate Chichewa speakers based on the district of residence. For example, we can assume that people from the following districts speak Chichewa:
Mzuzu City, Kasungu, Nkhotakota, Ntchisi, Dowa, Salima, Lilongwe,
Mchinji, Dedza, Ntcheu, Lilongwe City,
Zomba, Chiradzulu, Blantyre, Mwanza, Thyolo,
Mulanje, Phalombe, Balaka, Neno,
Zomba City, Blantyre City
def mw_estimate_chichewa_speakers_by_district(dist_pop:pd.DataFrame):
"""
Get an estimate of people who speak Chichewa and Chinyanja based
on tribes using data 2018 census main report for population by tribe:
http://www.nsomalawi.mw/images/stories/data_on_line/demography/census_2018/
Parameters:
dist_pop (pd.DataFrame): DataFrame with population by district for MW
Returns:
int:Population of Chichewa speakers in MW
"""
# ===========================
# CHICHEWA SPEAKING DISTRICTS
# ===========================
# List of chichewa speaking districts
chichewa_speaking_dists = ['Mzuzu City',
'Kasungu', 'Nkhotakota', 'Ntchisi', 'Dowa', 'Salima', 'Lilongwe',
'Mchinji', 'Dedza', 'Ntcheu', 'Lilongwe City',
'Zomba', 'Chiradzulu', 'Blantyre', 'Mwanza', 'Thyolo',
'Mulanje', 'Phalombe', 'Balaka', 'Neno',
'Zomba City', 'Blantyre City']
df_chichewa_speaking = df_dist.query('DIST_NAME in @chichewa_speaking_dists')
chich_pop = df_chichewa_speaking['TOTAL_POP'].sum()
return chich_pop
# Get population by district for Malawi
df_pop_ea = pd.read_csv(FILE_POP_MW)
df_dist = df_pop_ea.groupby('DIST_NAME')['TOTAL_POP'].sum().reset_index()
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[18], line 2
1 # Get population by district for Malawi
----> 2 df_pop_ea = pd.read_csv(FILE_POP_MW)
3 df_dist = df_pop_ea.groupby('DIST_NAME')['TOTAL_POP'].sum().reset_index()
File /opt/anaconda3/lib/python3.13/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
1013 kwds_defaults = _refine_defaults_read(
1014 dialect,
1015 delimiter,
(...)
1022 dtype_backend=dtype_backend,
1023 )
1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)
File /opt/anaconda3/lib/python3.13/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
617 _validate_names(kwds.get("names", None))
619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
622 if chunksize or iterator:
623 return parser
File /opt/anaconda3/lib/python3.13/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
1617 self.options["has_index_names"] = kwds["has_index_names"]
1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)
File /opt/anaconda3/lib/python3.13/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
1878 if "b" not in mode:
1879 mode += "b"
-> 1880 self.handles = get_handle(
1881 f,
1882 mode,
1883 encoding=self.options.get("encoding", None),
1884 compression=self.options.get("compression", None),
1885 memory_map=self.options.get("memory_map", False),
1886 is_text=is_text,
1887 errors=self.options.get("encoding_errors", "strict"),
1888 storage_options=self.options.get("storage_options", None),
1889 )
1890 assert self.handles is not None
1891 f = self.handles.handle
File /opt/anaconda3/lib/python3.13/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
868 elif isinstance(handle, str):
869 # Check whether the filename is to be opened in binary mode.
870 # Binary mode does not support 'encoding' and 'newline'.
871 if ioargs.encoding and "b" not in ioargs.mode:
872 # Encoding
--> 873 handle = open(
874 handle,
875 ioargs.mode,
876 encoding=ioargs.encoding,
877 errors=errors,
878 newline="",
879 )
880 else:
881 # Binary mode
882 handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/dmatekenya/Library/CloudStorage/GoogleDrive-dmatekenya@gmail.com/My Drive/TEACHING/AIMS-DSCBI/data/population/malawi-ea-population.csv'
# Chichewa Speaking Districts
print("="*60)
print(" Malawi, Chichewa Speakers Based on District of Residence")
print("="*60)
pop = mw_estimate_chichewa_speakers_by_district(dist_pop=df_dist)
print('Chichewa speaking population: {:,}'.format(pop))
============================================================
Malawi, Chichewa Speakers Based on District of Residence
============================================================
Chichewa speaking population: 12,747,340.0
Chichewa speakers in Malawi Based on DHS Data#
In the Demographic and Health Survey (DHS), we estimate the number of Chichewa speakers using responses to the question on the primary language spoken by the respondent.
While the DHS does not provide exhaustive linguistic data, the self-reported language question offers a useful proxy for estimating language distribution across the population.
This approach allows us to approximate the number of Chichewa speakers based on individual-level survey data that is nationally representative.
def chich_speaking_pop_DHS_based(stata_file, total_pop, chichewa_lan_codes,
dhs_tot_hhs):
"""
Estimate Chichewa speaking people from DHS survey question
on native language of respondent.
Parameters:
stata_file (str): Path to STATA (.dta) file for household members. Data
can be accessed here: https://dhsprogram.com/data/dataset/Malawi_Standard-DHS_2015.cfm?flag=0
total_pop (int): Total population for the country
chichewa_lan_codes(list): which language codes represent chichewa. For example: [2], [2,3]
dhs_tot_hhs(int): Total number of househols in DHS, for verification purpose.
Returns:
int:Population of Chichewa speakers in MW
"""
# Load the stata file
df = pd.read_stata(stata_file, convert_categoricals=False)
# Rename the columns
# Grab these from STATA-Do file available in same folder as the data
cols = {'hv045b': 'intv_lan', 'hv045c':'resp_nativ_lan', 'hv046': 'translator',
'hv002':'hh_num','hv005': 'weight',
'hv045a': 'qn_lan', 'hv001': "cluster_number", 'hv004': "area_unit"}
keep_cols = ['hhid'] + list(cols.keys())
df = df[keep_cols]
df.rename(columns=cols, inplace=True)
df['hh_id'] = df.apply(lambda x:
str(x['cluster_number']).zfill(3) +
str(x['hh_num']).zfill(3), axis=1)
# Check that we have all households as expected: 26, 361 as indicated
# in the report
try:
assert df.hh_id.nunique() == dhs_tot_hhs
except:
print('{:,} households from this file compared to {:,} reported number'.format(
df.hh_id.nunique(), dhs_tot_hhs))
print()
# ========================================
# TABULATE RESPONDENT NATIVE LANGUAGE
# ========================================
# Since we are getting national stats, we
# will not weight
chich_prop = df.resp_nativ_lan.value_counts(normalize=True)[chichewa_lan_codes]
print(chich_prop)
chich_prop_total = chich_prop.sum()
# Get population from the proportion
chich_pop = int(chich_prop_total*total_pop)
return chich_prop_total, chich_pop
# ===================
# Malawi
# ===================
language_names = {1: "English", 2:"Chichewa", 3:"Tumbuka", 6: "Other"}
# MW DHS-2015-16 total sample households
mw_dhs_hhs = 26361
# MW 2018 census population projection
# From here: http://www.nsomalawi.mw/images/stories/data_on_line/demography/census_2018/\
# Thematic_Reports/Population%20Projections%202018-2050.pdf
mw_proj_pop_2023 = 19809511
mw_chich_prop, mw_chich_pop = chich_speaking_pop_DHS_based(stata_file=FILE_MW_DHS, chichewa_lan_codes=[2],
total_pop=mw_proj_pop_2023,
dhs_tot_hhs=mw_dhs_hhs)
print('================================================')
print(' Based on the 2015-16 Malawi DHS and 2018 census')
print('================================================')
print('Estimated number of Chichewa speaking people in Malawi is : {:,}'.format(mw_chich_pop ))
================================================
Based on the 2015-16 Malawi DHS and 2018 census
================================================
Estimated number of Chichewa speaking people in Malawi is : 15,050,638
Summary: Chichewa Speaking Population in Malawi#
print("==============================================================")
print("Comparison of Chichewa Speaking Population Estimates (Malawi)")
print("==============================================================")
print(f"Based on Tribe: {6_021_945 + 186_319 + 1_819_347 + 559_887 + 324_272:,} people")
print(f"Based on District of Residence: {int(pop):,} people")
print(f"Based on DHS Survey: {mw_chich_pop:,} people")
==============================================================
Comparison of Chichewa Speaking Population Estimates (Malawi)
==============================================================
Based on Tribe: 8,911,770 people
Based on District of Residence: 12,747,340 people
Based on DHS Survey: 15,050,638 people