Not lost in translation: training AI to speak African languages

The African continent is home to more than 2,000 languages, yet a 2025 study of large language models – advanced artificial intelligence systems designed to understand, generate and interact with human language – published in the Proceedings of Machine Learning Research found only limited support for African tongues.

The paper, which comparatively analysed African language coverage across six large language models, eight small language models and six specialised small language models (SSLMs), found support for 41 African languages and 23 available public data sets. But it found “a big gap” with only four languages – Amharic, Swahili, Afrikaans and Malagasy – always handled, while over 98% of African languages went unsupported.

With so many African languages neglected, there are fears that entire groups of speakers could be cut out of the AI revolution. So one team of computer scientists from the University of Cape Town is looking to close the gap that has left millions underserved by mainstream AI tools.

Earlier this year, researchers Anri Lombard, Jan Buys, Francois Meyer and their team unveiled MzansiLM, a language model specifically built to include data from all 11 of South Africa’s official written languages. Alongside it, they released MzansiText, the curated multilingual dataset on which MzansiLM was trained.

Compared to the behemoth large language models (LLMs) developed by global tech companies, MzansiLM is a small player. Yet the team’s stringent testing showed it has outperformed larger, well-funded global LLMs in terms of accuracy. When writing in isiXhosa, it bettered the results of systems more than ten times its size in terms of accuracy and fluency.

“MzansiLM was meant to provide a small decoder-only baseline that future work can compare against and build on,” Lombard explains. “It is not a chatbot. It is a foundation; something developers and researchers can adapt for specific purposes, such as summarising documents or annotating data in a language most global AI cannot handle at all.”

Digital footprints

The inability of AI to handle most African languages comes down to the gap between the number of real-world speakers and a language’s digital footprint. IsiZulu is spoken by over 12m South Africans, and Hausa by more than 70m people across West Africa, yet both are considered “low-resource” languages in AI terms, because they leave a small “digital footprint” on websites and in books, so there is little data for AI developers to “scrape”.

“In language modelling, languages are considered low-resource primarily because there are much fewer and smaller textual datasets available in these languages for training language models,” says Buys, a senior lecturer in the University of Cape Town (UCT) department of computer science. “The internet has always skewed heavily towards English.”

“The data pipelines that feed large language models skew even more heavily. The result is a compounding inequality: the languages that already dominate online spaces get better AI tools, which makes them more useful digitally, which generates more data, which makes the AI even better. Everyone else falls further behind.”

Nine of South Africa’s 11 official written languages fall into this low-resource category. Languages such as isiZulu and isiXhosa have attracted some research attention, but others, including isiNdebele and Sepedi, have been largely overlooked even within African language AI. The UCT team is on a mission to change that.

Post-colonial data neglect

In a 2025 policy brief, AI and Language Data Flaring in Africa: Addressing the Low-Resource Challenge for the Centre for International Governance Innovation, Ife Adebara describes this phenomenon as “language data flaring” – which he likens to the gas flaring, common in oil extraction, in which a valuable resource is wasted through neglect.

Adebara argues that a multitude of factors – including under-investment in local languages and foreign language-dominant colonial and post-colonial policies – have meant that African language data goes under-collected, is poorly stored and remains largely unused in AI development.

“If AI continues to evolve without African linguistic inclusion, the continent will not only be a consumer of foreign technologies but will also have little say in shaping its development,” Mpho Primus, co-director of the Institute for Intelligent Systems at the University of Johannesburg, wrote in a February 2025 opinion piece for Independent Online (IOL).

“The digital divide will become a cultural and intellectual divide.”

This is more than just a mere inconvenience. If AI cannot understand or process a language, the consequences reach well beyond the odd mistranslation. Language accessibility effectively decides which countries and cultures get to participate in the digital economy, and which do not.

Many public services in South Africa are now piloting and adopting AI, from healthcare to banking and education.

Help me, but only in English

In 2025, for example, South Africa’s National Department of Health endorsed “Self-Cav”, a digital health chatbot piloted to talk via WhatsApp to young South Africans about sensitive topics such as HIV prevention medication, sex and mental health in a judgement-free way.

It is available only in English – and for life-saving tools like this to reach their full potential creators need to “hyperlocalise” them, training them in languages such as isiZulu and Sepedi and in slang. Without this, they will not be able to reach the people who need them most.

Bridging that language gap is what drove Lombard, Buys and Meyer to build MzansiLM, laying the groundwork for reclaiming digital sovereignty – and they are part of a rapidly growing movement to do so across the African continent.

In late 2025, a team backed by a $2.2m Gates Foundation grant released African Next Voices, described by the BBC as “the largest AI-ready speech dataset for African languages ever assembled,” covering 9,000 hours across 18 languages including Kikuyu, Dholuo, Hausa and Yoruba.

In February 2026 Google launched WAXAL, an open dataset spanning 21 African languages. The Masakhane research community, a pan-African natural language processing (NLP) network whose name means “we build together” in isiZulu, has published translation tools for over 48 African languages.

Outside of the continent, Cohere, a Canadian AI company, has partnered with HausaNLP to bring African language data into its Aya multilingual model.

These collaborative efforts are all steps in the right direction, but they are just the tip of the iceberg. Africa is home to over a billion people and more than 2,000 languages, yet global AI systems require a staggering mountain of data to learn how to speak a language fluently.

The UCT team is confident about where their own work sits in the scale of the solution.

Open research

“A lot of the progress we were able to make depends on earlier open research from the African Natural Language Processing research community, so continuing that openness is essential,” Lombard told UCT News.

“We still need better and broader data sources, stronger benchmarks and the kind of shared datasets, models, code and results that make it possible for others to reproduce and extend the work.”

Meyer agrees: “The research community plays an important role here by working openly, sharing datasets, models and findings so others can build on them.

“That kind of openness is often what leads to progress, especially compared to proprietary systems where much of the data and methodology isn’t accessible.”

MzansiLM is small compared to the behemoths of Silicon Valley, but its accuracy proves the potential of localised research compared solely to corporate funding.

The mountain of data left to build is still staggering, but the momentum has already begun.

Emily Allen

Not lost in translation: training AI to speak African languages