Updated 14 July, 2019 • tags arabic, scriptnotes
This page provides basic information about the Arabic script and its use for the Arabic language. It is not authoritative, peer-reviewed information – these are just notes I have gathered or copied from various places as i learned. For character-specific details follow the links to the Arabic character notes.
For similar information related to this and other scripts, see the script links pages.
المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق. وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضًا بروح الإخاء.
المادة 2 لكل إنسان حق التمتع بكافة الحقوق والحريات الواردة في هذا الإعلان، دون أي تمييز، كالتمييز بسبب العنصر أو اللون أو الجنس أو اللغة أو الدين أو الرأي السياسي أو أي رأي آخر، أو الأصل الوطني أو الإجتماعي أو الثروة أو الميلاد أو أي وضع آخر، دون أية تفرقة بين الرجال والنساء. وفضلاً عما تقدم فلن يكون هناك أي تمييز أساسه الوضع السياسي أو القانوني أو الدولي لبلد أو البقعة التي ينتمي إليها الفرد سواء كان هذا البلد أو تلك البقعة مستقلاً أو تحت الوصاية أو غير متمتع بالحكم الذاتي أو كانت سيادته خاضعة لأي قيد من القيود.
Arabic writing is the second most broadly-used script in the world, after the Latin alphabet. It descended from the Nabataean abjad, itself a descendant of the Phoenician script, and has been used since the 4th century for writing the Arabic language. Since the words of the Prophet Muhammed can only be written in Arabic, the Arabic script has traveled far and wide with the spread of Islam and came to be used for a number of languages throughout Asia, Africa and the Middle East. Many of these are non-Semitic languages, so employ very different sound systems from spoken Arabic, and as a result the script has had to be adapted and is used slightly differently by speakers of different languages. Many African languages use an Arabic-based transcription system called Ajami, which is different from the original Arabic script. Romance languages such as Mozarabic or Ladino are also sometimes written in a modified Arabic script, called Aljamiado.
Many variations on the script have developed over time and space, but these can be broadly classified into two groups; an angular kufic style which was originally used for stone inscriptions and which commonly employs no diacritics, and the naskh style which is more commonly used, more rounded in form, and governed by a set of principles regulating the proportions between the letters. There are a number of variant styles included in this group, including those used in Arabic calligraphy.
The Arabic script is the writing system used for writing Arabic and several other languages of Asia and Africa, such as Persian, Urdu, Azerbaijani, Pashto, Central Kurdish, Luri, dialects of Mandinka, and others. Until the 16th century, it was also used to write some texts in Spanish. It is the second-most widely used writing system in the world by the number of countries using it and the third by the number of users, after Latin and Chinese characters. ...The script was first used to write texts in Arabic, most notably the Qurʼān, the holy book of Islam. With the spread of Islam, it came to be used to write languages of many language families, leading to the addition of new letters and other symbols, with some versions, such as Kurdish, Uyghur, and old Bosnian being abugidas or true alphabets. It is also the basis for the tradition of Arabic calligraphy.
The Arabic script is an abjad. This means that in normal use the script represents only consonant and long vowel sounds. This approach is helped by the strong emphasis on consonant patterns in Semitic languages, however the Arabic script is also used for other kinds of language (such as Urdu and Uighur). See the table to the right for a brief overview of features, taken from the Script Comparison Table.
Arabic script is written horizontally, right-to-left, but numbers and embedded Latin text are read left-to-right. Words are separated by spaces, and contain a mixture of consonants and long vowels. Diacritics can be used to indicate short vowel sounds or other phonetic information, where needed.
The script is cursive, and some basic letter shapes change radically, depending on what they join to. It is also very common for adjacent characters to ligate and to stretch to fill available space. Many of the characters share a common base form, and are distinguished by the number and location of dots or other small diacritics, called i'jam. For example, س ش ݜ ݰ ݽ ݾ ڛ ښ ڜ ۺ.
The Arabic script characters in Unicode 10.0 are spread across 3 blocks:
There are two additional blocks for presentation forms, but (with the exception of a handful of code points) these characters are only for compatibility with legacy encodings, and should not be used. Sometimes they are used by people to get around problems with Arabic support in applications, but this is a bad idea since it corrupts the underlying data, making it difficult to search, spellcheck, or do many other things that rely on the use of standard characters and their properties.
The following links give information about characters used for languages associated with this script. The numbers in parentheses are for non-ASCII characters.
There are separate pages about the Urdu Writing System and the Uighur Writing System, which map the Arabic script characters to sounds in a slightly different way from that of the Arabic language and treat the characters differently.
For character-specific details see Arabic character notes.
Arabic script is written horizontally and right-to-left in the main but, as in most right-to-left scripts, numbers and embedded left-to-right script text are written left-to-right (producing 'bidirectional' text).
A sequence of numbers, for example a range separated by hyphens, runs right to left in the Arabic language (and Thaana or Syriac scripts), whereas for Persian language text (and in Hebrew, N’Ko or Adlam scripts) it runs left to right.
In the following Arabic text, which is right-to-left overall, the numeric range is also ordered RTL, ie. it starts with 10 and ends with 12:
In Persian, however, the expression would run LTR, so this would be:
The Unicode Bidirectional Algorithm automatically produces the Arabic ordering when a sequence or expression follows Arabic text. However, a sequence that appears alone on a line doesn't benefit from this, so to make the text appear correctly for Arabic you should add U+061C ARABIC LETTER MARK (ALM) at the start of the line. This is effectively an invisible Arabic script character.
If you are writing in Persian, on the other hand, you don't need to add anything in this case.
However, if you are writing in Persian and the sequence or expression follows text you need to either isolate the sequence directionally or precede it with U+200E LEFT-TO-RIGHT MARK (LRM) to make it look correct (click on the example above with text to see that in action).
Similar special ordering is applied to numbers in equations, such as 1 + 2 = 3, for Arabic language text.
In Arabic, the following may indicate the location of a long vowel, eg. قلوب qlwb quluːb hearts, تاريخ tɑryx tɑːriːx history. They are always visible, whether or not the text shows vowel diacritics.
These characters, especially ا [U+0627 ARABIC LETTER ALEF], may also be used with a number of other small marks, such as hamza, for particular effects. Read further for more details.
(At the risk of being pedantic, alef doesn't actually represent a consonant on it's own (unlike the other two). It is really only a support for a vowel and/or diacritic.)
ى [U+0649 ARABIC LETTER ALEF MAKSURA] represents the long a-vowel at the end of many words when it is written with yeh instead of an alef. In this case the yeh is typically printed without dots, to avoid confusion, although this is not universal. This spelling only occurs with certain words, and only when the final sound is aː, eg. معنى mæʕnaː. If any suffix is added, the spelling reverts to the normal alef, eg. معناهم mæʕnaː-hum.
Short vowels can be expressed using diacritics, eg. العَرَبِيَّة ɑlʕarabiyaᵚẗ (al-ʻarabīyah) Arabic, however for languages such as Arabic, Persian and Urdu they are typically not used, unless there is a particular need to help the reader understand the pronunciation. The previous example would therefore usually be written العربية ɑlʕrbyẗ (al-ʻarabīyah). On the other hand, when the script is used for Uighur, all vowels are shown, as a matter of course. These diacritics are also used in the Quran (though not originally), to reduce ambiguity.
The Arabic language uses the following vowel diacritics:
There is a secondary set of vowel diacritics with origins in classical arabic, where indefinite nouns and adjectives were marked by a final n-sound, called تنوين tænwiːn or, in English, 'nunation'. This is normally indicated by visually doubling the vowel diacritic, but there are Unicode characters for each combination.
On a word ending with an a-vowel (though not with a feminine ending or some other suffixes) an extra alef was also added at the end of the word. In modern arabic printing the fathatan is usually dropped, but the alef is retained. The pronunciation of the ending æn is also retained in many words, eg.كِتَابًا kitaɑbaⁿɑ kɪtæːbæn kɪtæːbæn, فَرَسًا farasaⁿɑ færæsæn.
ٰ [U+0670 ARABIC LETTER SUPERSCRIPT ALEF] is used in certain Arabic words such as هٰذَا this or ذٰلِكَ that, and not forgetting اللّٰه Allah.
When text is vowelled, ْ [U+0652 ARABIC SUKUN] can be used over a consonant to indicate that it is not followed by a vowel sound, eg. مَكْتَب maktab.
The main Unicode Arabic block contains 153 letters, with 77 more in the extended blocks. As shown in the previous section, only a small subset of those are used to write a given language. The others represent special characters added to the repertoire for one or other of the many languages for which the Arabic script is used.
The vast majority of letters represent consonants. A few represent long vowels.
The following letters are those generally recognised as constituting the alphabet for the Standard Arabic language.
Of those, as mentioned earlier, some letters represent long vowel locations or combinations of consonant plus vowel.
Other Unicode letters commonly found in Arabic include:
Most of the above letters with diacritics decompose in Unicode Normalization Form D (NFD), however ة [U+0629 ARABIC LETTER TEH MARBUTA] does not.
ء [U+0621 ARABIC LETTER HAMZA] represents the glottal stop sound. For historical reasons, it is treated as an orthographic sign rather than as a letter of the alphabet. It sometimes stands alone, but usually appears with a 'carrier' letter - alef, waw, or yeh for which separate precomposed characters are available in Unicode ( أ إ ؤ ئ ). Examples of use include إكرام ɑ̜krɑm ikrɑːm Ikram, نائم nɑy͑m nɑːʔim sleeping , and بناء bnɑʔ binɑːʔ building.
In modern printed arabic, the hamza is rarely shown when it occurs at the beginning of a word, but may appear in conjunction with another character. When the hamza is above another character you should typically use ٔ [U+0654 ARABIC HAMZA ABOVE] with the appropriate base character, although there are a number of exceptions. For more details, see the character description.
Classical arabic distinguishes between 'cutting' and 'joining' hamza. 'Cutting' means always pronounced, 'joining' means frequently elided.
The joining hamza is of little practical importance in modern arabic pronounced without the old case endings. When it does appear in modern Arabic, ٱ [U+0671 ARABIC LETTER ALEF WASLA] is used to indicate a joining hamza.
آ [U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE] is used when either of the two following combinations of glottal stop and a vowel appear in a word:
ʔaʔ (hamza, short a, hamza) eg. آثار ɑ̄θɑr ʔaːθaːr effects
ʔaː (hamza, long a) eg. القرآن ɑlqrɑ̄n alqur'ʔaːn qur'ʔaːn
Normal pronunciation in both cases is ʔaː.
The madda sign is still very often shown in print.
ة [U+0629 ARABIC LETTER TEH MARBUTA] usually has no sound, eg. مدرسة mdrsẗ mædræsæ school, but is sometimes pronounced t in specific grammatical contexts.
It is used for historical reasons to write the feminine ending, æ – the dots are borrowed from teh (ت) – and is only used in final position. If any suffix is added, the ending is spelled with ت [U+062A ARABIC LETTER TEH], eg. مدرستنا mdrstnɑ mædræsæt-naː our school.
In modern arabic it is not uncommon to find the two dots omitted, particularly on masculine proper names that have the feminine ending, eg. طلبة t̴lbẗ tulbæ.
Vowelled text may omit the short æ diacritic before the teh marbuta, because the sound is always the same.
The following characters also have the general property of Letter, but are less commonly used for modern Arabic language text.
ڢ [U+06A2 ARABIC LETTER FEH WITH DOT MOVED BELOW] and ڧ [U+06A7 ARABIC LETTER QAF WITH DOT ABOVE],] are alternative forms that are used in Northwest Africa. ࢲ [U+08B2 ARABIC LETTER ZAIN WITH INVERTED V ABOVE is used for Berber.
ٱ [U+0671 ARABIC LETTER ALEF WASLA] is described in the section hamza. Whereas many of the above letters with diacritics decompose in Unicode Normalization Form D (NFD), this letter does not.
ﷲ [U+FDF2 ARABIC LIGATURE ALLAH ISOLATED FORM] is a letter from the Arabic precomposed block used to write the name of Allah. The composition of this character differs from font to font in terms of glyph forms. With some fonts it is necessary to add diacritics, whereas with others it is not.
ـ [U+0640 ARABIC TATWEEL] is used to stretch words for simple justification, or to make a word or phrase a particular width, or as a form of emphasis. For more information see justify.
The diacritic ّ [U+0651 ARABIC SHADDA] doubles the value of the consonant it is attached to, which is phonemically significant in Arabic, eg. تاجر، تجّار tɑʒr, tʒᵚɑr (tajir, tujjar) trader, traders. It, too, is not often used, although sometimes it appears when vowel signs don't.
A common, though not universal, practice is to display any combining kasra below the shadda, rather than below the base consonant, eg. قَبِّل qæbːɪl. Some fonts, such as Amiri, don't do this.
The main arabic block contains 52 combining characters, with 43 more in the Arabic Extended-A block. However, only a small number are typically used for normal, written Arabic, Persian, etc.
The standard diacritics in the Arabic language repertoire include the following:
All of these diacritics are discussed in earlier sections. Follow the links for more information.
Multiple combining characters may be used for a single base character, such as when both a shadda and a vowel diacritic are used together.
Modern Arabic text typically uses the following punctuation characters from the Unicode Arabic block.
The Arabic language also uses western punctuation, including the following non-ASCII characters from other Unicode blocks.
Other punctuation in the Unicode Arabic block, infrequently used for the Arabic language.
For information about how these and punctuation marks from other blocks are used for the Arabic language, see the phrase and numbers sections below.
There are only 3 more characters with the general category of punctuation in the Unicode Arabic blocks.
Only the main Arabic Unicode block contains the symbols, none of which are widely used by Arabic language text.
Characters in the Arabic Presentation Forms blocks should not normally be used, but they contain just a few symbols that are not just for compability use, including the following.
For more information about how they are used, click on them and follow the links to the the character notes page.
The Arabic script uses a large number of Unicode characters that affect the way that other characters are rendered. Many of those have no visible form of their own.
The following set does have a visual representation. All these characters are found in Unicode's Arabic block, but none are commonly used for modern Arabic language text.
Modern Arabic text makes use of a relatively large set of invisible formatting characters, especially in plain text, many of which are used to manage text direction.
RLE [U+202B RIGHT-TO-LEFT EMBEDDING], LRE [U+202A LEFT-TO-RIGHT EMBEDDING], and PDF [U+202C POP DIRECTIONAL FORMATTING] are in widespread use to set the base direction of a range of characters. RLE/LRE come at the start, and PDF at the end of a range of characters for which the base direction is to be set.
More recently, the Unicode Standard added a set of characters which do the same thing but also isolate the content from surrounding characters, in order to avoid spillover effects. They are RLI [U+2067 RIGHT-TO-LEFT ISOLATE], LRI [U+2066 LEFT-TO-RIGHT ISOLATE], and PDI [U+2069 POP DIRECTIONAL ISOLATE]. The Unicode Standard recommends that these be used instead, however some applications don't yet recognise them.
There is also FSI [U+2068 FIRST STRONG ISOLATE], used initially to set the base direction according to the first recognised strongly-directional character.
ALM [U+061C ARABIC LETTER MARK] is used to produce correct sequencing of numeric data. Follow the link for details.
RLM [U+200F RIGHT-TO-LEFT MARK] and LRM [U+200E LEFT-TO-RIGHT MARK] are invisible characters with strong directional properties that are also sometimes used to produce the correct ordering of text.
For more information about how to use these formatting characters see How to use Unicode controls for bidi text. Note, however, that when writing HTML you should generally use markup rather than these control codes. For information about that, see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.
ZWJ [U+200D ZERO WIDTH JOINER] and ZWNJ [U+200C ZERO WIDTH NON-JOINER] are used to control the joining behaviour of cursive glyphs. They are particularly useful in educational contexts, but also have real world applications.
ZWJ permits a letter to form a cursive connection without a visible neighbour. For example, the marker for hijri dates is an initial form of heh, even though it doesn't join to the left, ie. ه. For this, use ZWJ immediately after the heh, eg. الاثنين 10 رجب 1415 ه..
ZWNJ prevents two adjacent letters forming a cursive connection with each other when rendered. For example, it is used in Persian for plural suffixes, some proper names, and Ottoman Turkish vowels. Ignoring or removing the ZWNJ will result in text with a different meaning or meaningless text, eg, تنها is the plural of body, whereas تنها is the adjective alone.2 The only difference is the presence or absence of ZWNJ after noon.
CGJ [U+034F COMBINING GRAPHEME JOINER] is used in Arabic to produce special ordering of diacritics. The name is a misnomer, as it is generally used to break the normal sequence of diacritics.
A set of arabic-indic digits are typically used in Middle Eastern and Gulf countries, whereas North African countries tend to use European digits. In neither area, however, is one digit style used exclusively.
Still in the basic Unicode Arabic block, there is a second set of digits in Unicode for use in languages such as Persian and Urdu.
The glyph shapes are typically different for 4 of the digits (although there can also be differences between Persian and Urdu shapes).
See also the information about handling expressions or sequences of numbers, below.
Arabic script has its own number separators, which are used in Arabic language text when the non-European digits are used. They are ٫ [U+066B ARABIC DECIMAL SEPARATOR] and ٬ [U+066C ARABIC THOUSANDS SEPARATOR].
Arabic also has its own characters for ٪ [U+066A ARABIC PERCENT SIGN] and ؉ [U+0609 ARABIC-INDIC PER MILLE SIGN].
Arabic script joins letters together. This results in four different shapes for most letters (including an isolated shape). The highlights in the example below show the same letter, ع [U+0639 ARABIC LETTER AIN], with three different joining forms.
A few Arabic script letters only join on the right-hand side.
Ligated glyph forms are common in Arabic. Some, such as لا are mandatory. Most of the remainder depend on the font style. Traditional fonts tend to have more ligated forms than modern styles.
In more traditional fonts, you will also often see the join between certain characters, when adjacent, above the baseline, rather than at the baseline, like this:
rather than on the baseline, like this:
But actually a good font will constantly change the shape of glyphs slightly so as to create a more aesthetically pleasing, and in some cases and more easily readable, flow.
When vowel or shadda diacritics are used they can be placed in different positions, according to the context.
When both shadda and vowel signs are present, a more complicated set of rules may be applied, depending on the font style, to determine the relevant positions. Vowel diacritics are placed above and below the shadda, rather than above and below the base character.
Words are separated by spaces.
In Arabic, small words like 'and' (و) are written alongside the following word with no intervening space (eg. الجامعات والكليات means 'universities and colleges', but there is only one space). Such small words are handled typographically as part of the word they are attached to.
Arabic language uses a mixture of western and arabic punctuation. Other languages using the Arabic script may use different punctuation, such as the full stop in Urdu.
For separators at the sentence level and below, the following are used in Arabic language text, where the right column indicates approximate equivalences to Latin script.
|comma||، [U+060C ARABIC COMMA]|
|semi-colon||؛ [U+061B ARABIC SEMICOLON]|
|colon||: [U+003A COLON]|
|sentence||. [U+002E FULL STOP]|
|question mark||؟ [U+061F ARABIC QUESTION MARK]|
Arabic language text uses ‐ [U+2010 HYPHEN], – [U+2013 EN DASH], and — [U+2014 EM DASH].
Emphasis can sometimes be expressed by stretching the baseline of one or more words. See the section on justification below for more information about baseline stretching.
Arabic language text typically uses « [U+00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK] and » [U+00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK] for quotation marks.
It is important to note that the Unicode names for these marks should be ignored. 'Left-pointing' should be read 'Start', and 'Right-pointing' as 'End'. The direction in which the glyphs point will be automatically determined according to the base direction of the text.
This section focuses mainly on Arabic language text, however attention is sometimes drawn to differences when the Arabic script is used for other languages.
The alphabetic baseline is a strong feature of Arabic script on the whole, since characters tend to join there. This is not always the case: for example, some adjacent pairs or ligatures have joins above the baseline, and initial letters in some fonts may start slightly above the baseline, but for most cases it remains a strong feature.
The nastaliq style of the script, on the other hand, uses arrangements of joined glyphs that cascade downwards from right to left, and ressemble a strongly sloping baseline.
Arabic script justification can use a number of different techniques. These include stretching the baseline and the glyphs of the text, expanding inter-word spaces, application of ligatures or swash forms, etc.. Typically these will be applied in combination. Where baseline stretching is applied, the rules for what can be stretched, and how much, are complicated, and differ across writing systems. (Elongation is not normally used at all for the ruq'a style.) It is not a question of simply adding equal-length extensions across the line.
The baseline extension character ـ [U+0640 ARABIC TATWEEL] is sometimes suggested as a way of producing justification by extending the baseline, however when a browser window is resized, or when new text is added near the start of a paragraph, lines wrap differently and all the places where tatweel would be needed have to be recalibrated. Thus tatweels only work for static text with fixed dimensions.
Better quality justification systems stretch glyphs, rather than adding baseline extensions. This dynamic stretching of glyphs is often called 'kashida'. In some typesetting systems, such as InDesign, the tatweel character serves more to indicate opportunities for stretching, and the glyph for the character itself is not shown.
It is very common to see baseline stretching in modern Arabic text where a word or phrase is stretched to fill a particular space, eg. the Arabic tag line (الابداع المتجدد Creativity renewed) below the word Lexus in the following image is stretched to be the same width.
Use the control below to see how your browser justifies the text sample here.
المادة 7 كل الناس سواسية أمام القانون ولهم الحق في التمتع بحماية متكافئة عنه دون أية تفرقة، كما أن لهم جميعاً الحق في حماية متساوية ضد أي تمييز يُخل بهذا الإعلان وضد أي تحريض على تمييز كهذا.
Other features to be investigated in this section include:
Glyph shaping & positioning Cursive text Context-based shaping Multiple combining characters Context-based positioning Transforming characters Structural boundaries & markers Grapheme, word & phrase boundaries Hyphens & dashes Bracketing information Quotations Abbreviations, ellipsis, & repetition Emphasis & highlights Inline notes & annotations Inline layout Inline text spacing Bidirectional text Line & paragraph layout Text direction Line breaking Hyphenation Text alignment & justification Counters, lists, etc. Styling initials Baselines & inline alignment Page & book layout General page layout & progression Directional layout features Grids & tables Notes, footnotes, etc. Forms & user interaction Page numbering, running headers, etc.