Arabic

Updated 31 July, 2019 • tags arabic, scriptnotes

This page provides basic information about the Arabic script and its use for the Arabic language. The information was gathered from a variety of sources, and I have tried to make the summary as accurate as possible, but it has not been reviewed. For character-specific details follow the links to the Arabic character notes.

See also the Arabic picker, the All Arabic picker, and the notes on Hausa ajami, Kashmiri, Urdu and Uighur.

For similar information related to this and other scripts, see the script links pages.

Clicking on red text examples, or highlighting part of the sample text shows a list of characters, with links to more details. Click on the vertical blue bar (bottom right) to change font settings for the sample text. Colours and annotations on panels listing characters are relevant to their use for the Arabic language.

Sample (Arabic)

المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق. وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضًا بروح الإخاء.

المادة 2 لكل إنسان حق التمتع بكافة الحقوق والحريات الواردة في هذا الإعلان، دون أي تمييز، كالتمييز بسبب العنصر أو اللون أو الجنس أو اللغة أو الدين أو الرأي السياسي أو أي رأي آخر، أو الأصل الوطني أو الإجتماعي أو الثروة أو الميلاد أو أي وضع آخر، دون أية تفرقة بين الرجال والنساء. وفضلاً عما تقدم فلن يكون هناك أي تمييز أساسه الوضع السياسي أو القانوني أو الدولي لبلد أو البقعة التي ينتمي إليها الفرد سواء كان هذا البلد أو تلك البقعة مستقلاً أو تحت الوصاية أو غير متمتع بالحكم الذاتي أو كانت سيادته خاضعة لأي قيد من القيود.

Usage & history

From Scriptsource:

Arabic writing is the second most broadly-used script in the world, after the Latin alphabet. It descended from the Nabataean abjad, itself a descendant of the Phoenician script, and has been used since the 4th century for writing the Arabic language. Since the words of the Prophet Muhammed can only be written in Arabic, the Arabic script has traveled far and wide with the spread of Islam and came to be used for a number of languages throughout Asia, Africa and the Middle East. Many of these are non-Semitic languages, so employ very different sound systems from spoken Arabic, and as a result the script has had to be adapted and is used slightly differently by speakers of different languages. Many African languages use an Arabic-based transcription system called Ajami, which is different from the original Arabic script. Romance languages such as Mozarabic or Ladino are also sometimes written in a modified Arabic script, called Aljamiado.

Many variations on the script have developed over time and space, but these can be broadly classified into two groups; an angular kufic style which was originally used for stone inscriptions and which commonly employs no diacritics, and the naskh style which is more commonly used, more rounded in form, and governed by a set of principles regulating the proportions between the letters. There are a number of variant styles included in this group, including those used in Arabic calligraphy.

From Wikipedia:

The Arabic script is the writing system used for writing Arabic and several other languages of Asia and Africa, such as Persian, Urdu, Azerbaijani, Pashto, Central Kurdish, Luri, dialects of Mandinka, and others. Until the 16th century, it was also used to write some texts in Spanish. It is the second-most widely used writing system in the world by the number of countries using it and the third by the number of users, after Latin and Chinese characters. ...

The script was first used to write texts in Arabic, most notably the Qurʼān, the holy book of Islam. With the spread of Islam, it came to be used to write languages of many language families, leading to the addition of new letters and other symbols, with some versions, such as Kurdish, Uyghur, and old Bosnian being abugidas or true alphabets. It is also the basis for the tradition of Arabic calligraphy.

Key features

The Arabic script is an abjad. This means that in normal use the script represents only consonant and long vowel sounds. This approach is helped by the strong emphasis on consonant patterns in Semitic languages, however the Arabic script is also used for other kinds of language (such as Urdu and Uighur). See the table to the right for a brief overview of features, taken from the Script Comparison Table.

Arabic script is written horizontally, right-to-left, but numbers and embedded Latin text are read left-to-right. Words are separated by spaces, and contain a mixture of consonants and long vowels. Diacritics can be used to indicate short vowel sounds or other phonetic information, where needed.

The script is cursive, and some basic letter shapes change radically, depending on what they join to. It is also very common for adjacent characters to ligate and to stretch to fill available space. Many of the characters share a common base form, and are distinguished by the number and location of dots or other small diacritics, called i'jam. For example, س ‎ش ‎ݜ ‎ ݰ ‎ݽ ‎ݾ ‎ڛ ‎ښ ‎ڜ ‎ۺ.

Character lists

Version 12.0 of the Unicode Standard has the following blocks dedicated to the Arabic script: :

  1. Arabic 153 letters, 52 marks, 20 numbers, 12 punctuation, 10 symbols, 8 other : total 255
  2. Arabic Supplement 48 letters : total 48
  3. Arabic Extended-A 29 letters, 43 marks, 1 other : total 73

There are two additional blocks for presentation forms, but (with the exception of a handful of code points) these characters are only for compatibility with legacy encodings, and should not be used. Sometimes they are used by people to get around problems with Arabic support in applications, but this is a bad idea since it corrupts the underlying data, making it difficult to search, spellcheck, or do many other things that rely on the use of standard characters and their properties.

Apart from ASCII characters, the Arabic orthography described here uses 71 characters (and 9 more, used infrequently) from the following Unicode blocks:

  1. Arabic 36 letters, 11 marks, 10 numbers, 7 punctuation : total 64   (+7 infrequent)
  2. Arabic Supplement (1 infrequent)
  3. Arabic Extended-A (1 infrequent)
  4. General Punctuation 5 punctuation : total 5
  5. Latin-1 Supplement 2 punctuation : total 2

Character Usage has information about the following orthographies associated with this script: ArabicStandard ArabicAzerbaijaniCentral KurdishHausaKashmiriLuriMazanderaniPunjabiNorthern PashtoPersianWestern PanjabiDariSindhiSaraikiUyghurUrduNorthern UzbekMalay

For character-specific details see Arabic character notes.

In yellow boxes, show:

Text direction

Arabic script is written horizontally and right-to-left in the main but, as in most right-to-left scripts, numbers and embedded left-to-right script text are written left-to-right (producing 'bidirectional' text).

9 يناير/ كانون الثاني 2018

Arabic words are read RTL, starting on the right, but numbers are read left-to-right.

Expressions & sequences

A sequence of numbers, for example a range separated by hyphens, runs right to left in the Arabic language (and Thaana or Syriac scripts), whereas for Persian language text (and in Hebrew, N’Ko or Adlam scripts) it runs left to right.

In the following Arabic text, which is right-to-left overall, the numeric range is also ordered RTL, ie. it starts with 10 and ends with 12:

في 10-12 آدار

A numeric range in Arabic language text.

In Persian, however, the expression would run LTR, so this would be:

في ‎10-12 آدار

A numeric range in Persian language text.

The Unicode Bidirectional Algorithm automatically produces the Arabic ordering when a sequence or expression follows Arabic text. However, a sequence that appears alone on a line doesn't benefit from this, so to make the text appear correctly for Arabic you should add U+061C ARABIC LETTER MARK (ALM) at the start of the line. This is effectively an invisible Arabic script character.

؜10-01-2018

A numeric date in Arabic language text.

If you are writing in Persian, on the other hand, you don't need to add anything in this case.

10-01-2018

The same date in Persian language text.

However, if you are writing in Persian and the sequence or expression follows text you need to either isolate the sequence directionally or precede it with U+200E LEFT-TO-RIGHT MARK (LRM) to make it look correct (click on the example above with text to see that in action).

Similar special ordering is applied to numbers in equations, such as 1 + 2 = 3, for Arabic language text.

Vowels

Matres lectionis

In the spelling of Arabic and some other Semitic languages, matres lectionis refers to the use of certain consonants to indicate a vowel. w

In Arabic, the following may indicate the location of a long vowel, eg. قلوب qlwb quluːb hearts, تاريخ tɑryx tɑːriːx history. They are always visible, whether or not the text shows vowel diacritics.

ا␣و␣ي

These characters, especially ا [U+0627 ARABIC LETTER ALEF], may also be used with a number of other small marks, such as hamza, for particular effects. Read further for more details.

(At the risk of being pedantic, alef doesn't actually represent a consonant on it's own (unlike the other two). It is really only a support for a vowel and/or diacritic.)

Alef maksura

ى

ى [U+0649 ARABIC LETTER ALEF MAKSURA] represents the long a-vowel at the end of many words when it is written with yeh instead of an alef. In this case the yeh is typically printed without dots, to avoid confusion, although this is not universal. This spelling only occurs with certain words, and only when the final sound is , eg. معنى mæʕnaː. If any suffix is added, the spelling reverts to the normal alef, eg. معناهم mæʕnaː-hum.

Short vowels

Short vowels can be expressed using diacritics, eg. العَرَبِيَّة‎ ɑlʕarabiyaᵚẗ‎ (al-ʻarabīyah) Arabic, however for languages such as Arabic, Persian and Urdu they are typically not used, unless there is a particular need to help the reader understand the pronunciation. The previous example would therefore usually be written العربية‎ ɑlʕrbyẗ‎ (al-ʻarabīyah). On the other hand, when the script is used for Uighur, all vowels are shown, as a matter of course. These diacritics are also used in the Quran (though not originally), to reduce ambiguity.

The Arabic language uses the following vowel diacritics:

َ␣ُ␣ِ

There is a secondary set of vowel diacritics with origins in classical arabic, where indefinite nouns and adjectives were marked by a final n-sound, called تنوين tænwiːn or, in English, 'nunation'. This is normally indicated by visually doubling the vowel diacritic, but there are Unicode characters for each combination.

ً␣ٌ␣ٍ

On a word ending with an a-vowel (though not with a feminine ending or some other suffixes) an extra alef was also added at the end of the word. In modern arabic printing the fathatan is usually dropped, but the alef is retained. The pronunciation of the ending æn is also retained in many words, eg.كِتَابًا kitaɑbaⁿɑ kɪtæːbæn kɪtæːbæn, فَرَسًا farasaⁿɑ færæsæn.

Superscript alef

ٰ  [U+0670 ARABIC LETTER SUPERSCRIPT ALEF] is used in certain Arabic words such as هٰذَا this or ذٰلِكَ that, and not forgetting اللّٰه Allah.

Vowel absence

When text is vowelled, ْ   [U+0652 ARABIC SUKUN] can be used over a consonant to indicate that it is not followed by a vowel sound, eg. مَكْتَب maktab.

Letters

The main Unicode Arabic block contains 153 letters, with 77 more in the extended blocks. As shown in the previous section, only a small subset of those are used to write a given language. The others represent special characters added to the repertoire for one or other of the many languages for which the Arabic script is used.

The vast majority of letters represent consonants. A few represent long vowels.

The following letters are those generally recognised as constituting the alphabet for the Standard Arabic language.

ا␣ب␣ت␣ث␣ج␣ح␣خ␣د␣ذ␣ر␣ز␣س␣ش␣ص␣ض␣ط␣ظ␣ع␣غ␣ف␣ق␣ك␣ل␣م␣ن␣ه␣و␣ي

Of those, as mentioned earlier, some letters represent long vowel locations or combinations of consonant plus vowel.

Other Unicode letters commonly found in Arabic include:

ء␣آ␣أ␣إ␣ؤ␣ئ␣ى␣ة

Most of the above letters with diacritics decompose in Unicode Normalization Form D (NFD), however ة [U+0629 ARABIC LETTER TEH MARBUTA] does not.

Show all letters in the Unicode Arabic blocks.
ؠ␣ء␣آ␣أ␣ؤ␣إ␣ئ␣ا␣ب␣ة␣ت␣ث␣ج␣ح␣خ␣د␣ذ␣ر␣ز␣س␣ش␣ص␣ض␣ط␣ظ␣ع␣غ␣ػ␣ؼ␣ؽ␣ؾ␣ؿ␣ـ␣ف␣ق␣ك␣ل␣م␣ن␣ه␣و␣ى␣ي␣ٮ␣ٯ␣ٱ␣ٲ␣ٳ␣ٴ␣ٵ␣ٶ␣ٷ␣ٸ␣ٹ␣ٺ␣ٻ␣ټ␣ٽ␣پ␣ٿ␣ڀ␣ځ␣ڂ␣ڃ␣ڄ␣څ␣چ␣ڇ␣ڈ␣ډ␣ڊ␣ڋ␣ڌ␣ڍ␣ڎ␣ڏ␣ڐ␣ڑ␣ڒ␣ړ␣ڔ␣ڕ␣ږ␣ڗ␣ژ␣ڙ␣ښ␣ڛ␣ڜ␣ڝ␣ڞ␣ڟ␣ڠ␣ڡ␣ڢ␣ڣ␣ڤ␣ڥ␣ڦ␣ڧ␣ڨ␣ک␣ڪ␣ګ␣ڬ␣ڭ␣ڮ␣گ␣ڰ␣ڱ␣ڲ␣ڳ␣ڴ␣ڵ␣ڶ␣ڷ␣ڸ␣ڹ␣ں␣ڻ␣ڼ␣ڽ␣ھ␣ڿ␣ۀ␣ہ␣ۂ␣ۃ␣ۄ␣ۅ␣ۆ␣ۇ␣ۈ␣ۉ␣ۊ␣ۋ␣ی␣ۍ␣ێ␣ۏ␣ې␣ۑ␣ے␣ۓ␣ە␣ۥ␣ۦ␣ۮ␣ۯ␣ۺ␣ۻ␣ۼ␣ۿ␣ݐ␣ݑ␣ݒ␣ݓ␣ݔ␣ݕ␣ݖ␣ݗ␣ݘ␣ݙ␣ݚ␣ݛ␣ݜ␣ݝ␣ݞ␣ݟ␣ݠ␣ݡ␣ݢ␣ݣ␣ݤ␣ݥ␣ݦ␣ݧ␣ݨ␣ݩ␣ݪ␣ݫ␣ݬ␣ݭ␣ݮ␣ݯ␣ݰ␣ݱ␣ݲ␣ݳ␣ݴ␣ݵ␣ݶ␣ݷ␣ݸ␣ݹ␣ݺ␣ݻ␣ݼ␣ݽ␣ݾ␣ݿ␣ࢠ␣ࢡ␣ࢢ␣ࢣ␣ࢤ␣ࢥ␣ࢦ␣ࢧ␣ࢨ␣ࢩ␣ࢪ␣ࢫ␣ࢬ␣ࢭ␣ࢮ␣ࢯ␣ࢰ␣ࢱ␣ࢲ␣ࢳ␣ࢴ␣ࢶ␣ࢷ␣ࢸ␣ࢹ␣ࢺ␣ࢻ␣ࢼ␣ࢽ

Hamza

ء [U+0621 ARABIC LETTER HAMZA] represents the glottal stop sound. For historical reasons, it is treated as an orthographic sign rather than as a letter of the alphabet. It sometimes stands alone, but usually appears with a 'carrier' letter - alef, waw, or yeh for which separate precomposed characters are available in Unicode ( أ إ ؤ ئ ). Examples of use include إكرام ɑ̜krɑm ikrɑːm Ikram, نائم nɑy͑m nɑːʔim sleeping , and بناء bnɑʔ binɑːʔ building.

In modern printed arabic, the hamza is rarely shown when it occurs at the beginning of a word, but may appear in conjunction with another character. When the hamza is above another character you should typically use ٔ [U+0654 ARABIC HAMZA ABOVE] with the appropriate base character, although there are a number of exceptions. For more details, see the character description.

Classical arabic distinguishes between 'cutting' and 'joining' hamza. 'Cutting' means always pronounced, 'joining' means frequently elided.

The joining hamza is of little practical importance in modern arabic pronounced without the old case endings. When it does appear in modern Arabic, ٱ [U+0671 ARABIC LETTER ALEF WASLA] is used to indicate a joining hamza.

Alef madda

آ [U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE] is used when either of the two following combinations of glottal stop and a vowel appear in a word:

Normal pronunciation in both cases is ʔaː.

The madda sign is still very often shown in print.

Teh marbuta

ة [U+0629 ARABIC LETTER TEH MARBUTA] usually has no sound, eg. مدرسة mdrsẗ mædræsæ school, but is sometimes pronounced t in specific grammatical contexts.

It is used for historical reasons to write the feminine ending, æ – the dots are borrowed from teh (ت) – and is only used in final position. If any suffix is added, the ending is spelled with ت [U+062A ARABIC LETTER TEH], eg. مدرستنا mdrstnɑ mædræsæt-naː our school.

In modern arabic it is not uncommon to find the two dots omitted, particularly on masculine proper names that have the feminine ending, eg. طلبة t̴lbẗ tulbæ.

Vowelled text may omit the short æ diacritic before the teh marbuta, because the sound is always the same.

Other letters

The following characters also have the general property of Letter, but are less commonly used for modern Arabic language text.

ڢ␣ڧ␣ࢲ␣ـ␣ﷲ␣ٱ

ڢ [U+06A2 ARABIC LETTER FEH WITH DOT MOVED BELOW] and ڧ [U+06A7 ARABIC LETTER QAF WITH DOT ABOVE],] are alternative forms that are used in Northwest Africa. [U+08B2 ARABIC LETTER ZAIN WITH INVERTED V ABOVE is used for Berber.

ٱ [U+0671 ARABIC LETTER ALEF WASLA] is described in the section hamza. Whereas many of the above letters with diacritics decompose in Unicode Normalization Form D (NFD), this letter does not.

[U+FDF2 ARABIC LIGATURE ALLAH ISOLATED FORM] is a letter from the Arabic precomposed block used to write the name of Allah. The composition of this character differs from font to font in terms of glyph forms. With some fonts it is necessary to add diacritics, whereas with others it is not. 

ـ [U+0640 ARABIC TATWEEL] is used to stretch words for simple justification, or to make a word or phrase a particular width, or as a form of emphasis. For more information see justify.

Consonant clusters & gemination

The diacritic ّ  [U+0651 ARABIC SHADDA] doubles the value of the consonant it is attached to, which is phonemically significant in Arabic, eg. تاجر، تجّار tɑʒr, tʒᵚɑr (tajir, tujjar) trader, traders. It, too, is not often used, although sometimes it appears when vowel signs don't.

A common, though not universal, practice is to display any combining kasra below the shadda, rather than below the base consonant, eg. قَبِّل qæbːɪl. Some fonts, such as Amiri, don't do this.

Combining characters

The main arabic block contains 52 combining characters, with 43 more in the Arabic Extended-A block. However, only a small number are typically used for normal, written Arabic, Persian, etc.

The standard diacritics in the Arabic language repertoire include the following:

َ␣ُ␣ِ␣ً␣ٌ␣ٍ␣ّ␣ْ␣ٰ␣ٔ␣ٕ

All of these diacritics are discussed in earlier sections. Follow the links for more information.

Multiple combining characters may be used for a single base character, such as when both a shadda and a vowel diacritic are used together.

Show all combining characters in the Unicode Arabic blocks.
ؐ␣ؑ␣ؒ␣ؓ␣ؔ␣ؕ␣ؖ␣ؗ␣ؘ␣ؙ␣ؚ␣ً␣ٌ␣ٍ␣َ␣ُ␣ِ␣ّ␣ْ␣ٓ␣ٔ␣ٕ␣ٖ␣ٗ␣٘␣ٙ␣ٚ␣ٛ␣ٜ␣ٝ␣ٞ␣ٟ␣ٰ␣ۖ␣ۗ␣ۘ␣ۙ␣ۚ␣ۛ␣ۜ␣۟␣۠␣ۡ␣ۢ␣ۣ␣ۤ␣ۧ␣ۨ␣۪␣۫␣۬␣ۭ␣ࣔ␣ࣕ␣ࣖ␣ࣗ␣ࣘ␣ࣙ␣ࣚ␣ࣛ␣ࣜ␣ࣝ␣ࣞ␣ࣟ␣࣠␣࣡␣ࣣ␣ࣤ␣ࣥ␣ࣦ␣ࣧ␣ࣨ␣ࣩ␣࣪␣࣫␣࣬␣࣭␣࣮␣࣯␣ࣰ␣ࣱ␣ࣲ␣ࣳ␣ࣴ␣ࣵ␣ࣶ␣ࣷ␣ࣸ␣ࣹ␣ࣺ␣ࣻ␣ࣼ␣ࣽ␣ࣾ␣ࣿ

Punctuation

Modern Arabic text typically uses the following punctuation characters from the Unicode Arabic block.

٫␣٬␣٪␣؉␣،␣؛␣؟

The Arabic language also uses western punctuation, including the following non-ASCII characters from other Unicode blocks.

‰␣‐␣–␣—␣…␣«␣»

Other punctuation in the Unicode Arabic block, infrequently used for the Arabic language.

؍␣٬␣٭

For information about how these and punctuation marks from other blocks are used for the Arabic language, see the phrase and numbers sections below.

There are only 3 more characters with the general category of punctuation in the Unicode Arabic blocks.

؞␣۔␣؊

Symbols

Only the main Arabic Unicode block contains the symbols, none of which are widely used by Arabic language text.

؆␣؇␣؈␣؋␣؎␣؏␣۞␣۩␣۽␣۾

Characters in the Arabic Presentation Forms blocks should not normally be used, but they contain just a few symbols that are not just for compability use, including the following.

﴾␣﴿␣ﷲ␣ﷺ␣ﷻ␣﷽␣﷼

For more information about how they are used, click on them and follow the links to the the character notes page.

Formatting characters

The Arabic script uses a large number of Unicode characters that affect the way that other characters are rendered. Many of those have no visible form of their own.

The following set does have a visual representation. All these characters are found in Unicode's Arabic block, but none are commonly used for modern Arabic language text.

࣢␣؀␣؁␣؂␣؃␣؄␣؅␣۝

Modern Arabic text makes use of a relatively large set of invisible formatting characters, especially in plain text, many of which are used to manage text direction.

Managing text direction

RLE [U+202B RIGHT-TO-LEFT EMBEDDING], LRE [U+202A LEFT-TO-RIGHT EMBEDDING], and PDF [U+202C POP DIRECTIONAL FORMATTING] are in widespread use to set the base direction of a range of characters. RLE/LRE come at the start, and PDF at the end of a range of characters for which the base direction is to be set.

More recently, the Unicode Standard added a set of characters which do the same thing but also isolate the content from surrounding characters, in order to avoid spillover effects. They are RLI [U+2067 RIGHT-TO-LEFT ISOLATE], LRI [U+2066 LEFT-TO-RIGHT ISOLATE], and PDI [U+2069 POP DIRECTIONAL ISOLATE]. The Unicode Standard recommends that these be used instead, however some applications don't yet recognise them.

There is also FSI [U+2068 FIRST STRONG ISOLATE], used initially to set the base direction according to the first recognised strongly-directional character.

ALM [U+061C ARABIC LETTER MARK] is used to produce correct sequencing of numeric data. Follow the link for details. 

RLM [U+200F RIGHT-TO-LEFT MARK] and LRM [U+200E LEFT-TO-RIGHT MARK] are invisible characters with strong directional properties that are also sometimes used to produce the correct ordering of text.

For more information about how to use these formatting characters see How to use Unicode controls for bidi text. Note, however, that when writing HTML you should generally use markup rather than these control codes. For information about that, see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.

Managing glyph shaping

ZWJ [U+200D ZERO WIDTH JOINER] and ZWNJ [U+200C ZERO WIDTH NON-JOINER] are used to control the joining behaviour of cursive glyphs. They are particularly useful in educational contexts, but also have real world applications.

ZWJ permits a letter to form a cursive connection without a visible neighbour. For example, the marker for hijri dates is an initial form of heh, even though it doesn't join to the left, ie. ه‍. For this, use ZWJ immediately after the heh, eg. الاثنين 10 رجب 1415 ه‍..

ZWNJ prevents two adjacent letters forming a cursive connection with each other when rendered. For example, it is used in Persian for plural suffixes, some proper names, and Ottoman Turkish vowels. Ignoring or removing the ZWNJ will result in text with a different meaning or meaningless text, eg, تن‌ها is the plural of body, whereas تنها is the adjective alone.2 The only difference is the presence or absence of ZWNJ after noon.

CGJ [U+034F COMBINING GRAPHEME JOINER] is used in Arabic to produce special ordering of diacritics. The name is a misnomer, as it is generally used to break the normal sequence of diacritics. 

Numbers, dates, currency, etc.

٠␣١␣٢␣٣␣٤␣٥␣٦␣٧␣٨␣٩

A set of arabic-indic digits are typically used in Middle Eastern and Gulf countries, whereas North African countries tend to use European digits. In neither area, however, is one digit style used exclusively.

Still in the basic Unicode Arabic block, there is a second set of digits in Unicode for use in languages such as Persian and Urdu.

۰␣۱␣۲␣۳␣۴␣۵␣۶␣۷␣۸␣۹

The glyph shapes are typically different for 4 of the digits (although there can also be differences between Persian and Urdu shapes).

٠١٢٣٤٥٦٧٨٩

Arabic-indic numerals, used in Arabic language text.

۰۱۲۳۴۵۶۷۸۹

Extended-arabic numerals, used in Persian and Urdu language text.

See also the information about handling expressions or sequences of numbers, below.

Arabic script has its own number separators, which are used in Arabic language text when the non-European digits are used. They are ٫ [U+066B ARABIC DECIMAL SEPARATOR] and ٬ [U+066C ARABIC THOUSANDS SEPARATOR].

Arabic also has its own characters for ٪ [U+066A ARABIC PERCENT SIGN] and ؉ [U+0609 ARABIC-INDIC PER MILLE SIGN].

Glyph shaping & positioning

You can experiment with examples using the Arabic picker.

Cursive script

Is this script cursive? Is the basic shape of a letter radically changed? Is it sometimes not cursive? Are there any special features to note? Are Unicode joiner and non-joiner characters needed to override default joining behaviours?

Arabic script joins letters together. This results in four different shapes for most letters (including an isolated shape). The highlights in the example below show the same letter, ع [U+0639 ARABIC LETTER AIN], with three different joining forms.

على  •  متعددة  •  وسيجمع

The letter ain in 3 different joining contexts.

A few Arabic script letters only join on the right-hand side.

context-based shaping

Are special glyph forms needed, depending on the context in which a character is used? Do glyphs interact in some circumstances?

Ligated glyph forms are common in Arabic. Some, such as لا are mandatory. Most of the remainder depend on the font style. Traditional fonts tend to have more ligated forms than modern styles.

  vs. 

The same word with ligatures (right) and no ligatures (left).

In more traditional fonts, you will also often see the join between certain characters, when adjacent, above the baseline, rather than at the baseline, like this:

rather than on the baseline, like this:

But actually a good font will constantly change the shape of glyphs slightly so as to create a more aesthetically pleasing, and in some cases and more easily readable, flow.

ـدد   تتـ   سسـ  

Three examples where the same letter is repeated, but the glyph shapes differ.

Context-based positioning

Are there requirements to position diacritics or other items specially, depending on context? Does the script have multiple diacritics competing for the same location relative to the base?

When vowel or shadda diacritics are used they can be placed in different positions, according to the context.

يتكلّم  •  تسجّل

The position of the shadda diacritic depends on the height of the base character in many fonts.

When both shadda and vowel signs are present, a more complicated set of rules may be applied, depending on the font style, to determine the relevant positions. Vowel diacritics are placed above and below the shadda, rather than above and below the base character.

مَمِمّمَّمِّ

The kasra diacritic may appear above the base character when combined with shadda.

Font styles

Are italicisation, bolding, oblique, etc relevant? Do italic fonts lean in the right direction? Is synthesised italicisation problematic? Are there other problems relating to bolding or italicisation - perhaps relating to generalised assumptions of applicability?

Transforming characters

If the script is bicameral, are the special rules about case conversion? Are there other correspondences between glyphs, such as half- vs fullwidth presentation forms?

Structural boundaries & markers

Grapheme boundaries

A grapheme is a user-perceived unit of text. The Unicode Standard uses generalised rules to define 'grapheme clusters', which approximate the likely grapheme boundaries in a writing system.

Do Unicode grapheme clusters appropriately segment character units for the script? Are there special requirements when double-clicking on the text, or moving through the text with the cursor, or backspace, etc.?

Word boundaries

The concept of 'word' is difficult to define in any language (see What is a word?). Here, a word is a vaguely-defined semantic unit that is typically smaller than a phrase and may comprise one or more syllables.

Are words separated by spaces, or other characters? Are there special requirements when double-clicking on the text? Are words hyphenated?

Words are separated by spaces.

In Arabic, small words like 'and' (و) are written alongside the following word with no intervening space (eg. الجامعات والكليات means 'universities and colleges', but there is only one space). Such small words are handled typographically as part of the word they are attached to.

Phrase & section boundaries

What characters are used to indicate the boundaries of phrases, sentences, and sections?

Arabic language uses a mixture of western and arabic punctuation. Other languages using the Arabic script may use different punctuation, such as the full stop in Urdu.

For separators at the sentence level and below, the following are used in Arabic language text, where the right column indicates approximate equivalences to Latin script.

comma ، [U+060C ARABIC COMMA]
semi-colon ؛ [U+061B ARABIC SEMICOLON]
colon : [U+003A COLON]
sentence . [U+002E FULL STOP]
question mark ؟ [U+061F ARABIC QUESTION MARK] 

آخر، … والنساء.

Arabic language text using an arabic comma, but an ASCII full stop.

Arabic language text uses [U+2010 HYPHEN], [U+2013 EN DASH], and [U+2014 EM DASH].

Bracketing & parentheses

What characters are used as parentheses, or to bracket information?

Quotations

What characters are used to indicate quotations? Do quotations within quotations use different characters? What characters are used to indicate dialogue?

Arabic language text typically uses « [U+00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK] and » [U+00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK] for quotation marks.

It is important to note that the Unicode names for these marks should be ignored. 'Left-pointing' should be read 'Start', and 'Right-pointing' as 'End'. The direction in which the glyphs point will be automatically determined according to the base direction of the text.

Abbreviation, ellipsis & repetition

What characters are used to indicate abbreviation, ellipsis & repetition?

Emphasis & text decoration

How are emphasis and highlighting achieved? If lines are drawn alongside, over or through the text, do they need to be a special distance from the text itself? Is it important to skip characters when underlining, etc? How do things change for vertically set text?

Emphasis can sometimes be expressed by stretching the baseline of one or more words. See the section on justification below for more information about baseline stretching.

Inline notes & annotations

What mechanisms, if any, are used to create inline notes and annotations? (For referent-type notes such as footnotes, see below.)

Line & paragraph layout

This section focuses mainly on Arabic language text, however attention is sometimes drawn to differences when the Arabic script is used for other languages.

Line breaking & hyphenation

Are there special rules about the way text wraps when it hits the end of a line? Does line-breaking wrap whole 'words' at a time, or characters, or something else (such as syllables in Tibetan and Javanese)? What characters should not appear at the end or start of a line, and what should be done to prevent that?

Hyphenation

Is hyphenation used, or something else?

Text alignment & justification

Does text in a paragraph needs to have flush lines down both sides? Does the script need assistance to conform to a grid pattern? Does the script allow punctuation to hang outside the text box at the start or end of a line? Where adjustments are need to make a line flush, how is that done? Does the script shrink/stretch space between words and/or letters? Are word baselines stretched, as in Arabic? What about paragraph indents?

Arabic script justification can use a number of different techniques. These include stretching the baseline and the glyphs of the text, expanding inter-word spaces, application of ligatures or swash forms, etc.. Typically these will be applied in combination. Where baseline stretching is applied, the rules for what can be stretched, and how much, are complicated, and differ across writing systems. (Elongation is not normally used at all for the ruq'a style.) It is not a question of simply adding equal-length extensions across the line.

Justified Arabic text.

An example from a newspaper column of text justified using tatweel.

The baseline extension character ـ [U+0640 ARABIC TATWEEL] is sometimes suggested as a way of producing justification by extending the baseline, however when a browser window is resized, or when new text is added near the start of a paragraph, lines wrap differently and all the places where tatweel would be needed have to be recalibrated. Thus tatweels only work for static text with fixed dimensions.

Better quality justification systems stretch glyphs, rather than adding baseline extensions. This dynamic stretching of glyphs is often called 'kashida'. In some typesetting systems, such as InDesign, the tatweel character serves more to indicate opportunities for stretching, and the glyph for the character itself is not shown.

It is very common to see baseline stretching in modern Arabic text where a word or phrase is stretched to fill a particular space, eg. the Arabic tag line (الابداع المتجدد Creativity renewed) below the word Lexus in the following image is stretched to be the same width.

Arabic text stretched to fit the width of the word Lexus.
Arabic text being stretched to fit the width of text alongside it.

Use the control below to see how your browser justifies the text sample here.

المادة 7 كل الناس سواسية أمام القانون ولهم الحق في التمتع بحماية متكافئة عنه دون أية تفرقة، كما أن لهم جميعاً الحق في حماية متساوية ضد أي تمييز يُخل بهذا الإعلان وضد أي تحريض على تمييز كهذا.

Letter spacing

Does the script create emphasis or other effects by spacing out the words, letters or syllables in a word? (For justification related spacing, see above.).

Counters, lists, etc.

Are there list or other counter styles in use? If so, what is the format used? Do counters need to be upright in vertical text? Are there other aspects related to counters and lists that need to be addressed?

Styling initials

Does the script use special styling of the initial letter of a line or paragraph, such as for drop caps or similar? How about the size relationship between the large letter and the lines alongide? where does the large letter anchor relative to the lines alongside? is it normal to include initial quote marks in the large letter? is the large letter really a syllable? etc.

Baselines & inline alignment

Does the script have special requirements for baseline alignment between mixed scripts and in general?

The alphabetic baseline is a strong feature of Arabic script on the whole, since characters tend to join there. This is not always the case: for example, some adjacent pairs or ligatures have joins above the baseline, and initial letters in some fonts may start slightly above the baseline, but for most cases it remains a strong feature.

The nastaliq style of the script, on the other hand, uses arrangements of joined glyphs that cascade downwards from right to left, and ressemble a strongly sloping baseline.

مستحق  •  شخص  •  کیفیت

Sloping baselines in Urdu nastaliq text.

Page & book layout

General page layout & progression

How are the main text area and ancilliary areas positioned and defined? Are there any special requirements here, such as dimensions in characters for the Japanese kihon hanmen? The book cover for scripts that are read right-to-left scripts is on the right of the spine, rather than the left. When content can flow vertically and to the left or right, how to specify the location of objects, text, etc. relative to the flow? Do tables and grid layouts work as expected? How do columns work in vertical text? Can you mix block of vertical and horizontal text? Does text scroll in a different direction?

Grids & tables

Does the script have special requirements for character grids or tables?

Notes, footnotes, etc

Does the script have special requirements for notes, footnotes, endnotes or other necessary annotations of this kind? (There is a section above for purely inline annotations, such as ruby or warichu. This section is more about annotation systems that separate the reference marks and the content of the notes.)

Forms & user interaction

Are vertical form controls needed? Are scroll bars in an unusual position? Other special requirements for user interaction?

Page numbering, running headers, etc

Are there special conventions for page numbering, or the way that running headers and the like are handled?

Languages using the Arabic script

According to ScriptSource, the Arabic script is used for the following languages:

References

  1. [U] The Unicode Standard v10.0, Arabic, pp371-393.
  2. [D] Peter T. Daniels and William Bright, The World's Writing Systems, Oxford University Press, ISBN 0-19-507993-0, pp559-563.
  3. [W] Wikipedia, Arabic script.
  4. [S] Scriptsource, Arabic
Show stats
Main
Archaic
Auxiliary
Other
Deprecated
Last changed 2019-07-31 10:52 GMT.  •  Make a comment.  •  Licence CC-By © r12a.