Solutions-Programming Assignment

Solutions-Programming Assignment#

Estimating Chichewa Speakers#

Background#

This notebook provides statistics about the language of Chichewa. More specifically, the goal is to generate an accurate estimate of total number of people who speak this language. Note that Chichewa also includes the alternate names: Chewa, Nyanja, Chinyanja. In order to provide context, first, let us indicate the countries where the language is spoken.

Malawi. About 70% of the population speak Chichewa. ^[1]
Zambia. About 20% of the population speak the language. ^[2]
Mozamabique. Less than 1% of the population speak the language. ^[3]
Zimbabwe. Although Chichewa seem to be one of the official languages for Zimbabwe, I havent found any data yet showing how many people speak the language.
Tanzania. Has a border with Malawi in the northen region where people speak Tumbuka, so it makes sense that there maybe no Chichewa speaking people there. Otherwise, I didnot find any data on proportion of the population who speak the language.

Based on the analysis in this notebook, as of 2023, there are 21,482,292 people who speak Chichewa distributed across three countries: Malawi (70%), Zambia (18%) and Mozambique(12%).

The Humanitarian data website contains data about languages for some of these countries. The HUMDATA links are provided below:

Since I was not very sure of some of the numbers from humdata, I decided to check with the actual sources as follows:

Malawi and Zambia. I could not find any current surveys with data on languages spoken but I still found something in the DHS. In the DHS, they ask about survey respondent’s native language. Although this data is not included in DHS reports (as they seem to collect this piece of data as interview metadata), its still a useful source of data for languages spoken. For Zimbabwe, Tanzania and Mozambique the DHS does’nt have this information as they only provide languages as language-1, language-2 etc except for the major languages.

1. ^ 2015-16 MDHS and humdata

2. ^ 2018 Zambia DHS and humdata

3. ^ Humdata

Assignment Tasks#

Read the notebook carefully, answer the questions as asked.
In some cases, you will complete missing code. Add your own code where it says “ADD YOUR CODE”.

import pandas as pd
import random
from pathlib import Path
pd.set_option('display.float_format', lambda x: '%.3f' % x)

Setup Input Folders#

In this section, make sure to define the folders where your data is stored on your machine.
I find it helpful to set up the working directory and input data folders right at the start of the notebook.
To keep things organized, I use the naming convention: FILE_{NAME} for files and DIR_{NAME} for folders.

We’ll be using the pathlib library—it’s the recommended approach for managing file paths in Python.

DIR_DATA = Path.cwd().parents[1] / "data"
DIR_DHS = DIR_DATA / "DHS"

# Population by enumeration area (EA) for Malawi
FILE_POP_MW = DIR_DATA / "population/malawi-ea-population.csv"

# MW DHS-2015-16 HH members stata data file
FILE_MW_DHS = DIR_DHS / "MWPR7ADT/MWPR7AFL.DTA"

Chichewa speakers in Malawi Based on Tribe#

Malawi is the primary country where Chichewa is spoken. In the absence of precise language data, we begin by estimating the number of Chichewa speakers based on tribal affiliation.

def mw_estimate_chichewa_speaking_based_on_tribe():
    """
    Get an estimate of people who speak Chichewa and Chinyanja based
    on tribes
    data source: 2018 census main report: 
    http://www.nsomalawi.mw/images/stories/data_on_line/demography/census_2018/
    """
    # from MW 2018 census report 
    total_pop_malawi = 17506022
    
    # =======================
    # POPULATION BY TRIBE
    # =======================
    mw_chewa_trb = 6020945
    mw_tumbuka_trb = 1614955
    mw_lomwe_trb = 3302634
    mw_tonga_trb = 310031
    mw_yao_trb = 2321763
    mw_sena_trb = 670908
    mw_nkhonde_trb = 174430
    mw_lambya_trb = 106769
    mw_sukwa_trb = 93762
    mw_manganja_trb = 559887
    mw_nyanja_trb = 324272
    mw_ngoni_tribe = 1819347
    mw_other_trb = 186319
    
    # check that totals from tribes match with given total pop
    tot = mw_chewa_trb + mw_lambya_trb + mw_sukwa_trb + mw_other_trb+ mw_manganja_trb+ mw_sena_trb+\
    mw_lomwe_trb + mw_ngoni_tribe + mw_tumbuka_trb + mw_tonga_trb + mw_nyanja_trb + mw_yao_trb + mw_nkhonde_trb
    assert tot == total_pop_malawi
    
    # =====================================
    # SUM POP FOR CHICHEWA SPEAKING TRIBES
    # ====================================
    # the following tribes speak chichewa:
    # chewa, ngoni,mang'anja, nyanja and other
    chichewa_speaking_pop = mw_chewa_trb + mw_other_trb + mw_ngoni_tribe + \
    mw_manganja_trb + mw_nyanja_trb
    print('Chichewa speaking population: {:,}'.format(chichewa_speaking_pop))

print("="*50)
print(" Malawi, Chichewa Speakers Based on Tribe")
print("="*50)
mw_estimate_chichewa_speaking_based_on_tribe()

==================================================
 Malawi, Chichewa Speakers Based on Tribe
==================================================
Chichewa speaking population: 8,910,770

Chichewa speakers in Malawi Based on District of Residence#

In this section, we estimate Chichewa speakers based on the district of residence. For example, we can assume that people from the following districts speak Chichewa:

Mzuzu City, Kasungu, Nkhotakota, Ntchisi, Dowa, Salima, Lilongwe,
Mchinji, Dedza, Ntcheu, Lilongwe City,
Zomba, Chiradzulu, Blantyre, Mwanza, Thyolo,
Mulanje, Phalombe, Balaka, Neno,
Zomba City, Blantyre City

def mw_estimate_chichewa_speakers_by_district(dist_pop:pd.DataFrame):
    """
    Get an estimate of people who speak Chichewa and Chinyanja based
    on tribes using data 2018 census main report for population by tribe:
    http://www.nsomalawi.mw/images/stories/data_on_line/demography/census_2018/
    Parameters:
    dist_pop (pd.DataFrame): DataFrame with population by district for MW
    Returns:
    int:Population of Chichewa speakers in MW

   """
    # ===========================
    # CHICHEWA SPEAKING DISTRICTS
    # ===========================
    # List of chichewa speaking districts
    chichewa_speaking_dists = ['Mzuzu City',
       'Kasungu', 'Nkhotakota', 'Ntchisi', 'Dowa', 'Salima', 'Lilongwe',
       'Mchinji', 'Dedza', 'Ntcheu', 'Lilongwe City', 
       'Zomba', 'Chiradzulu', 'Blantyre', 'Mwanza', 'Thyolo',
       'Mulanje', 'Phalombe', 'Balaka', 'Neno',
       'Zomba City', 'Blantyre City']
    
    df_chichewa_speaking = df_dist.query('DIST_NAME in @chichewa_speaking_dists')
    chich_pop = df_chichewa_speaking['TOTAL_POP'].sum()
    return chich_pop

 # Get population by district for Malawi
 df_pop_ea = pd.read_csv(FILE_POP_MW)
 df_dist = df_pop_ea.groupby('DIST_NAME')['TOTAL_POP'].sum().reset_index()

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[18], line 2
# Get population by district for Malawi
----> 2 df_pop_ea = pd.read_csv(FILE_POP_MW)
df_dist = df_pop_ea.groupby('DIST_NAME')['TOTAL_POP'].sum().reset_index()

File /opt/anaconda3/lib/python3.13/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
kwds_defaults = _refine_defaults_read(
   dialect,
   delimiter,
   (...)
   dtype_backend=dtype_backend,
)
kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)

File /opt/anaconda3/lib/python3.13/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
_validate_names(kwds.get("names", None))
# Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
if chunksize or iterator:
   return parser

File /opt/anaconda3/lib/python3.13/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   self.options["has_index_names"] = kwds["has_index_names"]
self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)

File /opt/anaconda3/lib/python3.13/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   if "b" not in mode:
       mode += "b"
-> 1880 self.handles = get_handle(
   f,
   mode,
   encoding=self.options.get("encoding", None),
   compression=self.options.get("compression", None),
   memory_map=self.options.get("memory_map", False),
   is_text=is_text,
   errors=self.options.get("encoding_errors", "strict"),
   storage_options=self.options.get("storage_options", None),
)
assert self.handles is not None
f = self.handles.handle

File /opt/anaconda3/lib/python3.13/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
elif isinstance(handle, str):
   # Check whether the filename is to be opened in binary mode.
   # Binary mode does not support 'encoding' and 'newline'.
   if ioargs.encoding and "b" not in ioargs.mode:
       # Encoding
--> 873         handle = open(
           handle,
           ioargs.mode,
           encoding=ioargs.encoding,
           errors=errors,
           newline="",
       )
   else:
       # Binary mode
       handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/dmatekenya/Library/CloudStorage/GoogleDrive-dmatekenya@gmail.com/My Drive/TEACHING/AIMS-DSCBI/data/population/malawi-ea-population.csv'

# Chichewa Speaking Districts
print("="*60)
print(" Malawi, Chichewa Speakers Based on District of Residence")
print("="*60)
pop = mw_estimate_chichewa_speakers_by_district(dist_pop=df_dist)
print('Chichewa speaking population: {:,}'.format(pop))

============================================================
 Malawi, Chichewa Speakers Based on District of Residence
============================================================
Chichewa speaking population: 12,747,340.0

Chichewa speakers in Malawi Based on DHS Data#

In the Demographic and Health Survey (DHS), we estimate the number of Chichewa speakers using responses to the question on the primary language spoken by the respondent.
While the DHS does not provide exhaustive linguistic data, the self-reported language question offers a useful proxy for estimating language distribution across the population.
This approach allows us to approximate the number of Chichewa speakers based on individual-level survey data that is nationally representative.

def chich_speaking_pop_DHS_based(stata_file, total_pop, chichewa_lan_codes, 
                                    dhs_tot_hhs):
    """
    Estimate Chichewa speaking people from DHS survey question
    on native language of respondent. 
    Parameters:
    stata_file (str): Path to STATA (.dta) file for household members. Data
    can be accessed here: https://dhsprogram.com/data/dataset/Malawi_Standard-DHS_2015.cfm?flag=0
    total_pop (int): Total population for the country
    chichewa_lan_codes(list): which language codes represent chichewa. For example: [2], [2,3]
    dhs_tot_hhs(int): Total number of househols in DHS, for verification purpose.
    
    Returns:
    int:Population of Chichewa speakers in MW

   """
    # Load the stata file
    df = pd.read_stata(stata_file, convert_categoricals=False)
    
    # Rename the columns
    # Grab these from STATA-Do file available in same folder as the data
    cols = {'hv045b': 'intv_lan', 'hv045c':'resp_nativ_lan', 'hv046': 'translator', 
        'hv002':'hh_num','hv005': 'weight',
 'hv045a': 'qn_lan', 'hv001':  "cluster_number", 'hv004':   "area_unit"}
    keep_cols = ['hhid'] + list(cols.keys())
    df = df[keep_cols]
    df.rename(columns=cols, inplace=True)
    df['hh_id'] = df.apply(lambda x: 
                           str(x['cluster_number']).zfill(3) + 
                           str(x['hh_num']).zfill(3), axis=1)
    
    # Check that we have all households as expected: 26, 361 as indicated
    # in the report 
    try:
        assert df.hh_id.nunique() == dhs_tot_hhs
    except:
        print('{:,} households from this file compared to {:,} reported number'.format(
            df.hh_id.nunique(), dhs_tot_hhs))
        print()
    
    # ========================================
    # TABULATE RESPONDENT NATIVE LANGUAGE
    # ========================================
    # Since we are getting national stats, we
    # will not weight 
    chich_prop = df.resp_nativ_lan.value_counts(normalize=True)[chichewa_lan_codes]
    print(chich_prop)
    chich_prop_total = chich_prop.sum()
    
    # Get population from the proportion
    chich_pop = int(chich_prop_total*total_pop)
    
    
    return chich_prop_total, chich_pop

# ===================
# Malawi
# ===================
language_names = {1: "English", 2:"Chichewa", 3:"Tumbuka", 6: "Other"}

# MW DHS-2015-16 total sample households
mw_dhs_hhs = 26361

# MW 2018 census population projection
# From here: http://www.nsomalawi.mw/images/stories/data_on_line/demography/census_2018/\
# Thematic_Reports/Population%20Projections%202018-2050.pdf
mw_proj_pop_2023 = 19809511
mw_chich_prop, mw_chich_pop = chich_speaking_pop_DHS_based(stata_file=FILE_MW_DHS, chichewa_lan_codes=[2], 
                                    total_pop=mw_proj_pop_2023,  
                                    dhs_tot_hhs=mw_dhs_hhs)
print('================================================')
print(' Based on the 2015-16 Malawi DHS and 2018 census')
print('================================================')
print('Estimated number of Chichewa speaking people in Malawi is : {:,}'.format(mw_chich_pop ))

================================================
Based on the 2015-16 Malawi DHS and 2018 census
================================================
Estimated number of Chichewa speaking people in Malawi is : 15,050,638

Summary: Chichewa Speaking Population in Malawi#

print("==============================================================")
print("Comparison of Chichewa Speaking Population Estimates (Malawi)")
print("==============================================================")
print(f"Based on Tribe: {6_021_945 + 186_319 + 1_819_347 + 559_887 + 324_272:,} people")
print(f"Based on District of Residence: {int(pop):,} people")
print(f"Based on DHS Survey: {mw_chich_pop:,} people")

==============================================================
Comparison of Chichewa Speaking Population Estimates (Malawi)
==============================================================
Based on Tribe: 8,911,770 people
Based on District of Residence: 12,747,340 people
Based on DHS Survey: 15,050,638 people