Last updated: 2 years ago
The MADAR corpus is a collection of parallel sentences covering the Arabic dialects of 25 Arab cities, in addition to English, French, and MSA. In this article I will show how this database can be used for learners of Arabic dialects.
In Arabic, مَدار means axis, center but also orbit. MADAR, however, is an acronym and stands for Multi Arabic Dialect Applications and Resource. The goal of MADAR is to create a unified framework for Arabic dialects which could also be used for Machine Translation.
There are two datasets:
- Corpus-26: a set of 2,000 sentences which were translated to 25 city dialects (each of these sentences has 25 corresponding parallel translations), in addition to MSA.
- Corpus-6: a set of 12,000 sentences translated to the dialects of five selected cities: Doha, Beirut, Cairo, Tunis, and Rabat, in addition to MSA.
Unfortunately, the English or French translations are not publicly available. The authors state copyright restrictions.
How can you access the database?
You can access the database on MADAR’s website:
https://camel.abudhabi.nyu.edu/madar
If you are interested in the genesis and aim of the project, you can download a paper of the project members.
How can this database be useful for Arabic learners?
In principle, just typing in a word and looking at the results is enough – you can learn a lot from that alone. To show how valuable this database is, I would like to show two examples.
Example 1: How do people in Alexandria, Beirut, Mosul, Tunis, and Rabat express “to want”?
Let’s assume you want to see how the verb to want is expressed in Alexandria, Beirut, Mosul, Tunis, and Rabat.
We use the MADAR lexicon viewer for this. You choose the cities you want to analyze, enter the word in question – and this is what you get:
In my opinion, this is an outstanding resource for anyone interested in Arabic dialects. It is usually the most important verbs and nouns which are different and crucial for understanding. The less common words (except for nature and food) are usually quite the same in many dialects.
Example 2: How is “to want” used in Egyptian and Levantine Arabic in full sentences?
We use the MADAR Corpus Viewer for this. We choose the cities Alexandria and Beirut. In the option field “English”, we write “want”. The results are stunning! We see how the verb is used in colloquial Arabic and what would be the equivalent in Modern Standard Arabic.
Which 25 cities are covered?
It aims at producing a large parallel corpus of 25 Arabic city dialects, in addition to a preexisting parallel set for English, French and Modern Standard Arabic (MSA).
- Morocco: Rabat, Fes
- Algeria: Algiers
- Tunisia: Tunis, Sfax
- Libya: Tripoli, Benghazi
- Egypt: Cairo, Alexandria, Aswan
- Sudan: Khartoum
- Palestine: Jerusalem
- Jordan: Amman, Salt
- Lebanon: Beirut
- Syria: Damascus, Aleppo
- Iraq: Mosul, Baghdad, Basra
- Qatar: Doha
- Oman: Muscat
- Saudi-Arabia: Riyadh, Jeddah
- Yemen: Sana’a
How to download the corpus data
For people who would like to download the corpus data – press on this button and fill out the form. You will then receive a link to download a ZIP-file.
Other corpus data for Arabic
I must confess that I have quite little experience with corpora systems, that is, databases that collect Arabic texts and make them analyzable. However, I am convinced that these databases are an important tool for anyone who wants to develop a better feeling for Arabic.
Tunisian Arabic
Tunisiya.org is a project, led by Karen McNeil and Miled Faiza, seeking to build a four-million-word corpus of Tunisian Spoken Arabic. There are currently 2,006 texts in the corpus, comprising 881,964 words. It is free.
Collection for Modern Standard Arabic including Egyptian Arabic
The website https://arabicorpus.byu.edu is a fascinating tool if you want to analyze words in context. It offers a variety of different corpora data including many newspapers. It is a great tool to see how words are used in Modern Standard Arabic.
You need to register first, but it is completely for free.
The OPUS project
The OPUS project covers many languages. OPUS is a growing collection of translated texts from the web. OPUS provides the community with a publicly available parallel corpus. For example, you get side-by-side translations of TED talks, etc. You need some time to digest all the data and to know what you are looking for.
This will, surely, settle the everlasting argument about which dialect is most like Fus7a…
Bahrain, Muscat? :)
Oh, that’s wrong, of course. Must have been a typical copy/paste wrong line error. I have corrected it. Thanks for telling