Mimi sisema Kiswahili, lakini ninapenda data

The title says it all: "I don't speak Swahili, but I love data".

In preparation for Wikimania 2025 in Nairobi, Kenya [1] I looked into the Swahili language. I can now greet people, say "thank you", and I can say who I am and where I'm from, but that's about it.

Swahili is a part of the Bantu language family. A series of migrations [2] that happened between about 6,000 and 1,500 years ago transformed the African continent linguistically and resulted in Bantu languages being spoken in West, Central, East, and South Africa. While Indo-European languages dominate in global reach and speaker numbers, the Bantu language family with its over 500 members is one of the richest and most diverse on Earth in terms of internal complexity and number of distinct languages.

Swahili, spoken as a lingua franca all over East Africa, is one of the official languages in Kenya and Tanzania. It is the native language of the Swahili people, who live along a roughly 1,500-kilometer stretch of coastline from southern Somalia to northern Mozambique, as well as of a steadily growing number of East African residents who are growing up with the language. About 100 million people speak Swahili, making it the most widely spoken Bantu language in the world – although only 5 to 10 million of them are native speakers.

Rich grammar in tables

As a Bantu language, Swahili grammar is rich. Instead of boring grammatical genders, Swahili has noun classes [3] – nine of them, actually. Its verbs are conjugated with affixes attached to the verb stem – just like Turkish and Korean, Swahili is an agglutinative language [4]. Take the verb stem "penda" (love) for example:

| Person | Past tense | Present tense | Perfect tense | Future tense

|---|---|---|---|---|

| 1st person singular (mimi) | nilipenda I loved | ninapenda I love | nimependa I have loved | nitapenda I will love|

| 2nd person singular (wewe) | ulipenda You loved | unapenda You love | umependa You have loved | utapenda You will love |

| 3rd person singular (yeye) | alipenda She/he loved | anapenda She/he loves | amependa She/he has loved | atapenda She/he will love |

If you're like me – nerdy, that is – your first thought upon seeing this table might be "aha, that's a pandas.DataFrame [5]!".

Data-driven Swahili

In my data-centric approach to Swahili, I found a paper ("Unveiling Swahili Verb Conjugations: A Comprehensive Dataset for Low-Resource NLP [6]") and the corresponding dataset [7] from last year. I imported the dataset into a DataFrame.

My initial idea was to merge it with lexicographical data from Wikidata. Unfortunately, Swahili verbs on Wikidata are far from numerous. To say it in the words of the authors of the aforementioned paper:

Despite its prominence, Swahili remains underrepresented in the field of Natural Language Processing, categorizing it as a low-resource language. Low-resource languages like Swahili face significant challenges due to the lack of structured and labelled data for complex linguistic tasks. The lack of sufficient linguistic datasets and digital resources has significantly hindered the development of robust models capable of processing Swahili with high accuracy; this has also hindered research progress in the field. Efforts to overcome this limitation have been attempted for other low-resource languages, yet Swahili-specific resources remain limited.

For Wikidata these low resources mean that only a handful of lexemes for Swahili verbs are available at the time of publishing this blog post. Here's the SPARQL query for Wikidata [8] to try it out:

Enter Wikifunctions

If we had more Swahili verbs and nouns in Wikidata, we might bring Swahili towards one of the goals of Abstract Wikipedia [9] using Wikifunctions [10].

Imagine functions that generate an opening sentence for a Wikipedia article like "Nairobi is a city". This is currently being worked on! Have a look at the current state of Natural Language Generation for fragments [11] on Wikifunctions.

Can we do that for Swahili? Of course. But right now, these kind of functions are probably some datasets away that need to be uploaded to Wikidata first. Nitakipenda – I will love it.

---

Written by Jens Ohlig on 2025-07-27.

References

[1] Wikimania 2025 in Nairobi, Kenya (https://wikimania.wikimedia.org/wiki/2025:About)

[2] series of migrations (https://en.wikipedia.org/wiki/Bantu_expansion)

[3] noun classes (https://en.wiktionary.org/wiki/Appendix:Swahili_noun_classes)

[4] agglutinative language (https://en.wikipedia.org/wiki/Agglutinative_language)

[5] pandas.DataFrame (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)

[6] Unveiling Swahili Verb Conjugations: A Comprehensive Dataset for Low-Resource NLP (https://dl.acm.org/doi/10.1145/3711542.3711596)

[7] dataset (https://data.mendeley.com/datasets/rvt89578g5/1)

[8] SPARQL query for Wikidata (https://w.wiki/Eryb)

[9] Abstract Wikipedia (https://en.wikipedia.org/wiki/Abstract_Wikipedia)

[10] Wikifunctions (https://www.wikifunctions.org/wiki/Wikifunctions:Main_Page)

[11] Natural Language Generation for fragments (https://www.wikifunctions.org/wiki/Wikifunctions:Abstract_Wikipedia/2025_fragment_experiments)

The Owl and the Bat

Just three talks at 39C3

Berlin Fediverse Day 2025

---

Back to the Index

Proxied content from gemini://tilde.pink/~johl/gemlog/mimi-sisema-kiswahili-lakini-ninapenda-data/index.gmi (external content)

Gemini request details:

Original URL
gemini://tilde.pink/~johl/gemlog/mimi-sisema-kiswahili-lakini-ninapenda-data/index.gmi
Status code
Success
Meta
text/gemini;lang=en
Proxied by
kineto

Be advised that no attempt was made to verify the remote SSL certificate.