Benchmarking Google Translation for Chichewa#


Author: Dunstan Matekenya | Date: June 3, 2023

In this notebook, the goal is to investigate the accuracy of machine translation (MT) of English to Chinyanja. Note that I will often use Chichewa instead of Chinyanja but officially, this language is known as Chinyanja (ny). Out of all NLP capabilities for this language, MT is probably the most well developed and also commonly used. However, to the best of my knowledge, I haven’t found any publicly available doocumentation of the quality (accuracy) of Google translations for Chichwa. In this case, I will focus on the task of translating from English to Chichewa. In particular, I will address two questions in this notebook.

  1. Whats the quality of the translations from Google English to Chichewa translations?

  2. How does this quality compare to other high resource and low resource languages?

I believe responding to these questions is obviously important. For example, if one intends to use Google translate for official documents or tasks other daily personal use, then its important to understand the quality of the translations. Thus, having this information in the public domain is useful as Google often doesnt publish these metrics publicly.

For measuring the quality of translations, I elect to use the BLEU metric which is also used by Google and is perhaps the most commonly used metric. In the rest of the notebook, I provide the following:

  • Brief description of the BLEU metric

  • English to Chichewa parallel corpus used for tests.

  • Computation of BLEU score over Chichewa and other datasets.

  • Discussio of the results in the summary section.

import warnings
warnings.filterwarnings('ignore')
from pathlib import Path
import nltk
# nltk.download('all')
from google.cloud import translate_v2 as translate
from nltk.translate.bleu_score import sentence_bleu
from nltk.tokenize import word_tokenize, sent_tokenize
from googletrans import Translator, LANGUAGES
import pandas as pd
import numpy as np
from IPython.display import Image, display
pd.set_option('display.float_format', lambda x: '%.3f' % x)

Setup input data files#

DIR_DATA = Path.cwd().parents[0].joinpath('data', 'machine-translation')

# English to Chichewa Parallel corpus
DIR_CHICH_DATA = DIR_DATA.joinpath('chich2english')
FILE_EN2CH_ONLINE = DIR_CHICH_DATA.joinpath('english2chich-online-texts.csv')
FILE_EN2CH_PROF = DIR_CHICH_DATA.joinpath('english2chich-prof-translator.csv')
FILE_EN2CH_WEBSITE = DIR_CHICH_DATA.joinpath('chich2english-website-sourced.csv')

# English to other languages
FILE_EN2SP = DIR_DATA.joinpath('spa.txt')
FILE_EN2LUGANDA = DIR_DATA.joinpath('Luganda.csv')
FILE_EN2DE_EN = DIR_DATA.joinpath('train.en.txt')
FILE_EN2DE_DE = DIR_DATA.joinpath('train.de.txt')

Define utility functions#

def compute_bleu_score(df_sent: pd.DataFrame, chich_cols: list, en_col: str, target_lan: str):
    """
    Computes BLUE score over all sentences in a Dataframe.
    df_sent: Dataframe with sentences in English translated to Chichewa
    chich_cols: Columns containing human translated Chichewa sentences
    en_col: Column containing original English sentence to be translated
    
    returns the following:
        - mean BLEU score over all sentences successfully translated and number of sentences
        - Total sentences where BLEU is run
        - Total number of characters 
    """
    scores = []
    sent_cnts = 0
    char_cnts = 0
    for idx, row in df_sent.iterrows():
        try:
            eng_sent = row[en_col]
            eng_sent_tokens = word_tokenize(eng_sent)
            chich_google = translate_with_google_api(dest=target_lan, src='en', text=eng_sent)
            chich_google_tokens = [tok for tok in word_tokenize(chich_google) if tok.isalpha()]
            sent_cnts += 1
            char_cnts += len(eng_sent_tokens)
        except Exception as e:
            print(e)
        
        try:
            human_chich_tokens = []
            for col in chich_cols:
                chich_tokens = word_tokenize(row[col])
                chich_tokens_clean = [token for token in chich_tokens if token.isalpha()]
                human_chich_tokens.append(chich_tokens_clean)
            score = sentence_bleu(references=human_chich_tokens, hypothesis=chich_google_tokens)
            scores.append(score)
        except Exception as e:
            print('BLEU CALCULATION FAILED ...')
            print(e)
            
    mean_score = np.mean(scores)
    
    return mean_score, sent_cnts, char_cnts
def translate_with_google_api(dest: str, src: str, text: str, use_paid=True) -> dict:
    """Translates text into the target language using Google using either 
    the paid Cloud Translation API or free Googletrans Python package. 

    Target must be an ISO 639-1 language code.
    See https://g.co/cloud/translate/v2/translate-reference#supported_languages
    """
    # Google commercial API
    translate_client = translate.Client()
    
    # Google free API through Python
    translator = Translator()
    
    if isinstance(text, bytes):
        text = text.decode("utf-8")
    
    if use_paid:
        # Text can also be a sequence of strings, in which case this method
        # will return a sequence of results for each text.
        result = translate_client.translate(text, target_language=dest,source_language=src)
        return result["translatedText"]
    else:
        result = translator.translate(text, dest='ny', src='en')
        return result.text

BLEU score#

In this experiment, we use BLEU (BiLingual Evaluation Understudy) score as the evaluation metric simply because its the most commonly used metric in MT. In summary, BLEU measures a candidate translation closeneness to references sentences (ground truth) which can be one or more. A BLEU score ranges from 0 to 1 and it can be interepreted as a percentage for easier interpretation.

For context, this is how Google relates BLEU score with MT system reliability.

display(Image(filename='../docs/images/BLEU-score-interpreration.png'))
../_images/35d1f2c3ad182fefb60bb4adbe96a46ac8cf14f6e910e20f3db3b810118b3f34.png

Datasets for evaluation#

In this experiment, we use three independent datasets as follows.

Dataset-1: Translations based on online sources#

In this dataset, two non-professional native speakers of Chichewa performed the translation frmon English to Chichewa. The source English sentences came from several places online: Malawi news articles, WhatsApp posts and other online translation datasets.

df_online = pd.read_csv(FILE_EN2CH_ONLINE)
df_online1 = df_online[['sentence', 'chich_shadreck']].dropna()
df_online2 = df_online[['sentence', 'chich_gloria']].dropna()
df_online12 =  df_online[['sentence', 'chich_gloria', 'chich_shadreck']].dropna()
print('='*40)
print(' Summary Stats for Dataset-1')
print('='*40)
print('1. Number of English sentences: 686')
print('2. Translator-1: {}'.format(len(df_online)))
print('3. Translator-2: {}'.format(len(df_online2)))
print('4. Translator-1 and 2: {}'.format(len(df_online12)))
print('-'*40)
========================================
 Summary Stats for Dataset-1
========================================
1. Number of English sentences: 686
2. Translator-1: 686
3. Translator-2: 275
4. Translator-1 and 2: 245
----------------------------------------

Dataset-2: Profensional translation of a political speech#

A colleague, Cresencia Masautso donated this dataset to me. Its based on the translation she did of one of the political figure in Malawi who made the speech in English and she translated it to Chichewa.

The dataset was created by chunking the documents into short paragraphs in ordet to have matching English and Chichewa texts.As such, there is only translation from one person who can be considered as a professional translator.

df_prof = pd.read_csv(FILE_EN2CH_PROF)
sentences = []
for idx, row in df_prof.iterrows():
    sent = sent_tokenize(row['english'])
    sentences += sent
print('='*40)
print(' Summary Stats for Dataset-2')
print('='*40)
print(' 1. Number of English sentences: {}'.format(len(sentences)))
print('-'*40)
========================================
 Summary Stats for Dataset-2
========================================
 1. Number of English sentences: 259
----------------------------------------

Evaluation setup#

BLEU score computation#

For each dataset, we will compute a sentence level BLEU score and then average over all sentences in the dataset. For dataset-1 with 2 translators, we will generate three scores: one for each translator and one based on 2 reference translations from two translators. We use sentence_bleu function from the NLTK package to calculate sentence level BLEU.

Google translation API#

The Python package googletrans which is used here makes calls to [Google translate Ajax API] (https://translate.google.com/) and so we can assume its using the same model used on regular Google translation on the web and mobile.

Calculate BLEU score for English to Chichewa datasets#

BLEU score for dataset-1: Two translators#

score, sent_count, character_cnt = compute_bleu_score(df_sent=df_online12, en_col='sentence', 
                                                      chich_cols=['chich_shadreck', 'chich_gloria'], target_lan='ny')
print("="*65)
print(' AVERAGE BLEU SCORE FOR DATASET-1, TRANSLATOR-1 AND TRANSLATOR-2')
print("="*65)
print('1. Average score: {:.2f}%'.format(np.mean(score)*100))
print('2. Number of sentences translated: {:,}'.format(sent_count))
print('3. Number of characters translated: {:,}'.format(character_cnt))
=================================================================
 AVERAGE BLEU SCORE FOR DATASET-1, TRANSLATOR-1 AND TRANSLATOR-2
=================================================================
1. Average score: 8.52%
2. Number of sentences translated: 245
3. Number of characters translated: 4,490

BLEU score for dataset-1: Translator-1#

score, sent_count, character_cnt = compute_bleu_score(df_sent=df_online1, en_col='sentence', 
                                                      chich_cols=['chich_shadreck'],  target_lan='ny')
print("="*65)
print(' AVERAGE BLEU SCORE FOR DATASET-1, TRANSLATOR-1')
print("="*65)
print('1. Average score: {:.2f}%'.format(np.mean(score)*100))
print('2. Number of sentences translated: {:,}'.format(sent_count))
print('3. Number of characters translated: {:,}'.format(character_cnt))
=================================================================
 AVERAGE BLEU SCORE FOR DATASET-1, TRANSLATOR-1
=================================================================
1. Average score: 4.99%
2. Number of sentences translated: 645
3. Number of characters translated: 12,801

BLEU score for dataset-1: Translator-2#

score, sent_count, character_cnt = compute_bleu_score(df_sent=df_online2, en_col='sentence', 
                                                      chich_cols=['chich_gloria'],  target_lan='ny')
print("="*65)
print(' AVERAGE BLEU SCORE FOR DATASET-1, TRANSLATOR-2')
print("="*65)
print('1. Average score: {:.2f}%'.format(np.mean(score)*100))
print('2. Number of sentences translated: {:,}'.format(sent_count))
print('3. Number of characters translated: {:,}'.format(character_cnt))
=================================================================
 AVERAGE BLEU SCORE FOR DATASET-1, TRANSLATOR-2
=================================================================
1. Average score: 3.34%
2. Number of sentences translated: 275
3. Number of characters translated: 5,198

BLEU score for dataset-2#

score, sent_count, character_cnt = compute_bleu_score(df_sent=df_prof, en_col='english', 
                                                      chich_cols=['chichewa'],  target_lan='ny')
print("="*65)
print(' AVERAGE BLEU SCORE FOR DATASET-2')
print("="*65)
print('1. Average score: {:.2f}%'.format(score*100))
print('2. Number of sentences translated: {:,}'.format(sent_count))
print('3. Number of characters translated: {:,}'.format(character_cnt))
=================================================================
 AVERAGE BLEU SCORE FOR DATASET-2
=================================================================
1. Average score: 15.75%
2. Number of sentences translated: 61
3. Number of characters translated: 5,114

BLEU score for dataset-3#

score, sent_count, character_cnt = compute_bleu_score(df_sent=df_wb, en_col='english', 
                                                      chich_cols=['chichewa'],  target_lan='ny')
print("="*65)
print(' AVERAGE BLEU SCORE FOR DATASET-2')
print("="*65)
print('1. Average score: {:.2f}%'.format(score*100))
print('2. Number of sentences translated: {:,}'.format(sent_count))
print('3. Number of characters translated: {:,}'.format(character_cnt))
=================================================================
 AVERAGE BLEU SCORE FOR DATASET-2
=================================================================
1. Average score: 3.64%
2. Number of sentences translated: 102
3. Number of characters translated: 1,546

BLEU scores for other languages#

For the sake reference, I computed BLEU score for the following languages based purely on datasets availability

  1. English to Spanish. The dataset came from Tatoeba

  2. English to Luganda. Luganda is a language spoken in Uganda. The dataset is available on Zenodo.

  3. English to German. This is a WMT’14 English-German

BLEU score for English to Spanish#

df_sp = pd.read_csv(FILE_EN2SP, sep="\t", header=None, names=['en', 'sp'])
keep = []
for idx, row in df_sp.iterrows():
    sent = row['en']
    tokens = word_tokenize(sent)
    if len(sent) > 90:
        keep.append(idx)
df_sp = df_sp.loc[keep]
score, sent_count, character_cnt = compute_bleu_score(df_sent=df_sp, en_col='en', 
                                                      chich_cols=['sp'],  target_lan='es')
print("="*65)
print(' AVERAGE BLEU SCORE FOR ENGLISH TO SPANISH')
print("="*65)
print('1. Average score: {:.2f}%'.format(score*100))
print('2. Number of sentences translated: {:,}'.format(sent_count))
print('3. Number of characters translated: {:,}'.format(character_cnt))
=================================================================
 AVERAGE BLEU SCORE FOR ENGLISH TO SPANISH
=================================================================
1. Average score: 47.73%
2. Number of sentences translated: 330
3. Number of characters translated: 7,530

BLEU score for English to Luganda#

df_lug = pd.read_csv(FILE_EN2LUGANDA,  encoding = "ISO-8859-1")
df_lug = df_lug[['English', 'Luganda']].dropna()
score, sent_count, character_cnt = compute_bleu_score(df_sent=df_lug.sample(2000), en_col='English', 
                                                      chich_cols=['Luganda'],  target_lan='lg')
print("="*65)
print(' AVERAGE BLEU SCORE FOR ENGLISH TO LUGANDA')
print("="*65)
print('1. Average score: {:.2f}%'.format(score*100))
print('2. Number of sentences translated: {:,}'.format(sent_count))
print('3. Number of characters translated: {:,}'.format(character_cnt))
=================================================================
 AVERAGE BLEU SCORE FOR ENGLISH TO LUGANDA
=================================================================
1. Average score: 9.49%
2. Number of sentences translated: 2,000
3. Number of characters translated: 19,993

BLEU score for English to German#

def load_eng2german_wmt14dataset():
    
    data_en = []
    data_de = []
    with open(FILE_EN2DE_EN, 'r') as reader:
        # Read and print the entire file line by line
        for line in reader:
            data_en.append(line.strip('\n'))

    with open(FILE_EN2DE_DE, 'r') as reader:
        # Read and print the entire file line by line
        for line in reader:
            data_de.append(line.strip('\n'))

    df = pd.DataFrame({'en': data_en, 'de': data_de})
    
    return df
df_en2de = load_eng2german_wmt14dataset()
score, sent_count, character_cnt = compute_bleu_score(df_sent=df_en2de.sample(2000), en_col='en', 
                                                      chich_cols=['de'],  target_lan='de')
HTTPSConnectionPool(host='oauth2.googleapis.com', port=443): Max retries exceeded with url: /token (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x31bd03130>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
print("="*65)
print(' AVERAGE BLEU SCORE FOR ENGLISH TO GERMAN')
print("="*65)
print('1. Average score: {:.2f}%'.format(score*100))
print('2. Number of sentences translated: {:,}'.format(sent_count))
print('3. Number of characters translated: {:,}'.format(character_cnt))
=================================================================
 AVERAGE BLEU SCORE FOR ENGLISH TO GERMAN
=================================================================
1. Average score: 18.01%
2. Number of sentences translated: 1,999
3. Number of characters translated: 56,842

Summary#

Google Neural Machine Translation (NMT) for English to Chichewa translation#

An average BLEU score was compute for 5 scenarios as follows:

  1. Dataset-1: English to Chichewa translations with two translators. BLEU score = 8.63%

  2. Dataset-1: English to Chichewa translations using a single human translator. BLEU score = 4.92%

  3. Dataset-1: English to Chichewa translations using a single human translator. BLEU score = 3.43%

  4. Dataset-2: English to Chichewa translations using a single human professional translator. BLEU score = 15.88%%

  5. Dataset-3: English to Chichewa translations using a single/multiple translators. BLEU score = 3.64%%

In all the cases, the BLEU score is less than 20% which according to Google’s guidelines, overral these BLEU scores mean the translations are almost useless to very hard to get a gist of the translation.

Google Neural Machine Translation (NMT) on otther languages.#

In order to put into context the perfomance of Google NMT translation on Chichewa in relation to other languageas

  1. Tatoeba English to Spanish. BLEU score = 47.77%

  2. English-Luganda. BLEU score = 9.68%%

  3. English-German. BLEU score = 17.85%

As expected, the perfomrmance on high resource languages (Spanish and German) is very high.

Notes about about evalauting Machine Translation Systems#

Although BLEU score is perhaps the most commonly used metric for evaluating models of MT, its not the only one. There are other metrics such as word error rate (WER) and use of native speakers who can read the translations and rate them based on some scale. Furthermore, the BLEU score metric also depends on alot of other factors:

  • The number of test sentences

  • Whether reference translations are from one or multiple humans

  • The length reference sentences

  • and more