Updated 11 February, 2018 • tags arabic, scriptnotes
This page provides basic information about the Arabic script and its use for the Arabic language. (See also the pages Urdu Writing System and Uighur Writing System.) It is not authoritative, peer-reviewed information – these are just notes I have gathered or copied from various places as i learned. For character-specific details follow the links to the Arabic character notes.
For similar information related to other scripts, see the Script comparison table.
Clicking on red text examples, or highlighting part of the sample text shows a list of characters, with links to more details. Click on the vertical blue bar (bottom right) to change font settings for the sample text.
المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق. وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضًا بروح الإخاء.
المادة 2 لكل إنسان حق التمتع بكافة الحقوق والحريات الواردة في هذا الإعلان، دون أي تمييز، كالتمييز بسبب العنصر أو اللون أو الجنس أو اللغة أو الدين أو الرأي السياسي أو أي رأي آخر، أو الأصل الوطني أو الإجتماعي أو الثروة أو الميلاد أو أي وضع آخر، دون أية تفرقة بين الرجال والنساء. وفضلاً عما تقدم فلن يكون هناك أي تمييز أساسه الوضع السياسي أو القانوني أو الدولي لبلد أو البقعة التي ينتمي إليها الفرد سواء كان هذا البلد أو تلك البقعة مستقلاً أو تحت الوصاية أو غير متمتع بالحكم الذاتي أو كانت سيادته خاضعة لأي قيد من القيود.
Arabic writing is the second most broadly-used script in the world, after the Latin alphabet. It descended from the Nabataean abjad, itself a descendant of the Phoenician script, and has been used since the 4th century for writing the Arabic language. Since the words of the Prophet Muhammed can only be written in Arabic, the Arabic script has traveled far and wide with the spread of Islam and came to be used for a number of languages throughout Asia, Africa and the Middle East. Many of these are non-Semitic languages, so employ very different sound systems from spoken Arabic, and as a result the script has had to be adapted and is used slightly differently by speakers of different languages. Many African languages use an Arabic-based transcription system called Ajami, which is different from the original Arabic script. Romance languages such as Mozarabic or Ladino are also sometimes written in a modified Arabic script, called Aljamiado.
Many variations on the script have developed over time and space, but these can be broadly classified into two groups; an angular kufic style which was originally used for stone inscriptions and which commonly employs no diacritics, and the naskh style which is more commonly used, more rounded in form, and governed by a set of principles regulating the proportions between the letters. There are a number of variant styles included in this group, including those used in Arabic calligraphy.
The Arabic script is the writing system used for writing Arabic and several other languages of Asia and Africa, such as Persian, Urdu, Azerbaijani, Pashto, Central Kurdish, Luri, dialects of Mandinka, and others. Until the 16th century, it was also used to write some texts in Spanish. It is the second-most widely used writing system in the world by the number of countries using it and the third by the number of users, after Latin and Chinese characters. ...The script was first used to write texts in Arabic, most notably the Qurʼān, the holy book of Islam. With the spread of Islam, it came to be used to write languages of many language families, leading to the addition of new letters and other symbols, with some versions, such as Kurdish, Uyghur, and old Bosnian being abugidas or true alphabets. It is also the basis for the tradition of Arabic calligraphy.
The Arabic script is an abjad. This means that in normal use the script represents only consonant and long vowel sounds. This approach is helped by the strong emphasis on consonant patterns in Semitic languages, however the Arabic script is also used for other kinds of language (such as Urdu and Uighur). See the table to the right for a brief overview of features, taken from the Script Comparison Table.
Arabic script is written horizontally, right-to-left, but numbers and embedded Latin text are read left-to-right. Words are separated by spaces, and contain a mixture of consonants and long vowels. Diacritics can be used to indicate short vowel sounds or other phonetic information, where needed.
The script is cursive, and some basic letter shapes change radically, depending on what they join to. It is also very common for adjacent characters to ligate and to stretch to fill available space. Many of the characters share a common base form, and are distinguished by the number and location of dots or other small diacritics, called i'jam. For example, س ش ݜ ݰ ݽ ݾ ڛ ښ ڜ ۺ.
The Arabic script characters in Unicode 10.0 are spread across 3 blocks:
There are two additional blocks for presentation forms, but (with the exception of a handful of code points) these characters are only for compatibility with legacy encodings, and should not be used. Sometimes they are used by people to get around problems with Arabic support in applications, but this is a bad idea since it corrupts the underlying data, making it difficult to search, spellcheck, or do many other things that rely on the use of standard characters and their properties.
The following links give information about characters used for languages associated with this script. The numbers in parentheses are for non-ASCII characters.
There are separate pages about the Urdu Writing System and the Uighur Writing System, which map the Arabic script characters to sounds in a slightly different way from that of the Arabic language and treat the characters differently.
For character-specific details see Arabic character notes.
The main Unicode Arabic block contains 153 letters, with 77 more in the extended blocks. As shown in the previous section, only a small subset of those are used to write a given language. The others represent special characters added to the repertoire for one or other of the many languages for which the Arabic script is used.
The vast majority of letters represent consonants. A few represent long vowels.
The following letters are those generally recognised as constituting the Arabic alphabet.
Of those, three letters represent either consonants and/or long vowels.
Other letters are commonly found in Arabic include:
ء [U+0621 ARABIC LETTER HAMZA] represents the glottal stop sound. For historical reasons, is treated as an orthographic sign rather than as a letter of the alphabet. It sometimes stands alone, but usually appears with a 'carrier' letter - alef, waw, or yeh for which separate precomposed characters are available in Unicode (أ إ ؤ ئ). Examples of use include إكرام 'ikrām, نائم nā'im, and بناء binā'.
In modern printed arabic, the hamza is rarely shown when it occurs at the beginning of a word, but may appear in conjunction with another character. When the hamza is above another character you should typically use ٔ [U+0654 ARABIC HAMZA ABOVE] with the appropriate base character, although there are a number of exceptions. For more details, see the character description.
Classical arabic distinguishes between 'cutting' and 'joining' hamza. 'Cutting' means always pronounced, 'joining' means frequently elided.
The joining hamza is of little practical importance in modern arabic pronounced without the old case endings. When it does appear in modern Arabic, ٱ [U+0671 ARABIC LETTER ALEF WASLA] is used to indicate a joining hamza.
آ [U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE] is used when either of the two following combinations of glottal stop and a vowel appear in a word:
ʔaʔ (hamza, short a, hamza) eg. آثار ʔaːθaːr
ʔaː (hamza, long a) eg. قرآن qur'ʔaːn
Normal pronunciation in both cases is ʔaː.
The madda sign is still very often shown in print.
ى [U+0649 ARABIC LETTER ALEF MAKSURA] represents the long a-vowel at the end of many words when it is written with yeh instead of an alef. In this case the yeh is typically printed without dots, to avoid confusion, although this is not universal. This spelling only occurs with certain words, and only when the final sound is aː, eg. معنى mæʕnaː. If any suffix is added, the spelling reverts to the normal alef, eg. معناهم mæʕnaː-hum.
ة [U+0629 ARABIC LETTER TEH MARBUTA] usually has no sound, but is sometimes pronounced t in specific grammatical contexts, eg. مدرسة mædræsæ.
It is used for historical reasons to write the feminine ending, æ – the dots are borrowed from teh (ت) – and is only used in final position. If any suffix is added, the ending is spelled with ت [U+062A ARABIC LETTER TEH], eg. مدرستنا mædræsæt-naː.
In modern arabic it is not uncommon to find the two dots omitted, particularly on masculine proper names that have the feminine ending, eg. طلبة tulbæ.
Vowelled text may omit the short æ diacritic before the teh marbuta, because the sound is always the same.
The main arabic block contains 52 combining characters, with 43 more in the Arabic Extended-A block. However, only a small number are typically used for normal, written Arabic, Persian, etc.
The standard diacritics in the Arabic language repertoire include the following:
The hamza diacritics were discussed above.
Multiple combining characters may be used for a single base character, such as when both a shadda and a vowel diacritic are used together.
As we saw above, long vowel locations are usually identified by letters.
Short vowels can be expressed using diacritics, however for languages such as Arabic, Persian and Urdu they are typically not used, unless there is a particular need to help the reader understand the pronunciation. On the other hand, when the script is used for Uighur, the vowel diacritics are used, as a matter of course. These diacritics are also used in the Koran.
For the Arabic language the vowel diacritics are:
There is a secondary set of vowel diacritics with origins in classical arabic, where indefinite nouns and adjectives were marked by a final n-sound, called تنوين tænwiːn or, in English, 'nunation'. This is normally indicated by doubling the vowel diacritic.
On a word ending with an a-vowel (though not with a feminine ending or some other suffixes) an extra alef was also added at the end of the word. In modern arabic printing the fathatan is usually dropped, but the alef is retained. The pronunciation of the ending æn is also retained in many words, eg.كِتَابًا kɪtæːbæn, فَرَسًا færæsæn.
When text is vowelled, ـْ [U+0652 ARABIC SUKUN] is used over a consonant to indicate that it is not followed by a vowel sound, eg. مَكْتَب maktab.
The diacritic ـّ [U+0651 ARABIC SHADDA] doubles the value of the consonant it is attached to, which is phonemically significant in Arabic, eg. رتّب rætːæbæ. It, too, is not often used, although sometimes it appears when vowel signs don't.
A common, though not universal, practice is to display any combining kasra below the shadda, rather than below the base consonant, eg. قَبِّل qæbːɪl. Some fonts, such as Amiri, don't do this.
ـٰ [U+0670 ARABIC LETTER SUPERSCRIPT ALEF] is used in certain Arabic words such as هٰذَا this or ذٰلِكَ that, and not forgetting اللّٰه Allah.
Only the main Arabic Unicode block contains punctuation. There are 12 items.
The Arabic language typically uses the following. For information about how these and punctuation marks from other blocks are used for the Arabic language, see the Text layout and Numbers sections below.
Only the main Arabic Unicode block contains the symbols, none of which are widely used by Arabic language text.
Characters in the Arabic Presentation Forms blocks should not normally be used, but they contain just a few symbols that are not just for compability use, including the following.
For more information about how they are used, click on them and follow the links to the the character notes page.
A set of arabic-indic digits are typically used in Middle Eastern and Gulf countries, whereas North African countries tend to use European digits. In neither area, however, is one digit style used exclusively.
There is a second set of digits in Unicode for use in languages such as Persian and Urdu. The glyph shapes are typically different for 4 of the digits (although there can also be differences between Persian and Urdu shapes).
See also the information about handling expressions or sequences of numbers, below.
Arabic script has its own number separators, which are used in Arabic language text when the non-European digits are used. They are ٫ [U+066B ARABIC DECIMAL SEPARATOR] and ٬ [U+066C ARABIC THOUSANDS SEPARATOR].
Arabic also has its own versions of ٪ [U+066A ARABIC PERCENT SIGN] and ؉ [U+0609 ARABIC-INDIC PER MILLE SIGN].
Arabic script joins letters together. This results in four different shapes for most letters (including an isolated shape). The highlights in the example below show the same letter, ع [U+0639 ARABIC LETTER AIN], with three different joining forms.
A few Arabic script letters only join on the right-hand side.
Ligated glyph forms are common in Arabic. Some, such as لا are mandatory. Most of the remainder depend on the font style. Traditional fonts tend to have more ligated forms than modern styles.
In more traditional fonts, you will also often see the join between certain characters, when adjacent, above the baseline, rather than at the baseline, like this:
rather than on the baseline, like this:
But actually a good font will constantly change the shape of glyphs slightly so as to create a more aesthetically pleasing, and in some cases and more easily readable, flow.
When vowel or shadda diacritics are used they can be placed in different positions, according to the context.
When both shadda and vowel signs are present, a more complicated set of rules may be applied, depending on the font style, to determine the relevant positions. Vowel diacritics are placed above and below the shadda, rather than above and below the base character.
This section focuses mainly on Arabic language text, however attention is sometimes drawn to differences when the Arabic script is used for other languages.
Arabic script is written horizontally and right-to-left in the main, but as with most RTL scripts, numbers and embedded LTR script text are written left-to-right (producing 'bidirectional' text).
A sequence of numbers, for example a range separated by hyphens, runs right to left in the Arabic language (and Thaana or Syriac scripts), whereas for Persian language text (and in Hebrew, N’Ko or Adlam scripts) it runs left to right.
In the following Arabic text, which is RTL overall, the numeric range is also ordered RTL, ie. it starts with 10 and ends with 12:
In Persian, however, the expression would run LTR, so this would be:
The Unicode Bidirectional Algorithm automatically produces the Arabic ordering when a sequence or expression follows Arabic text. However, a sequence that appears alone on a line doesn't benefit from this, so to make the text appear correctly for Arabic you should add U+061C ARABIC LETTER MARK (ALM) at the start of the line. This is effectively an invisible Arabic script character.
If you are writing in Persian, on the other hand, you don't need to add anything in this case.
However, if you are writing in Persian and the sequence or expression follows text you need to either isolate the sequence directionally or precede it with U+200E LEFT-TO-RIGHT MARK (LRM) to make it look correct (click on the example above with text to see that in action).
Similar special ordering is applied to numbers in equations, such as 1 + 2 = 3, for Arabic language text.
The alphabetic baseline is a strong feature of Arabic script on the whole, since characters tend to join there. This is not always the case: for example, some adjacent pairs or ligatures have joins above the baseline, and initial letters in some fonts may start slightly above the baseline, but for most cases it remains a strong feature.
The nastaliq style of the script, on the other hand, uses arrangements of joined glyphs that cascade downwards from right to left, and ressemble a strongly sloping baseline.
Words are separated by spaces.
Arabic language uses a mixture of western and arabic punctuation. Other languages using the Arabic script may use different punctuation, such as the full stop in Urdu.
For separators at the sentence level and below, the following are used in Arabic language text, where the right column indicates approximate equivalences to Latin script.
|comma||، [U+060C ARABIC COMMA]|
|semi-colon||؛ [U+061B ARABIC SEMICOLON]|
|colon||: [U+003A COLON]|
|sentence||. [U+002E FULL STOP]|
|question mark||؟ [U+061F ARABIC QUESTION MARK]|
Arabic language text uses ‐ [U+2010 HYPHEN], – [U+2013 EN DASH], and — [U+2014 EM DASH].
Emphasis can sometimes be expressed by stretching the baseline of one or more words. See the section on justification below for more information about baseline stretching.
Arabic language text typically uses « [U+00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK] and » [U+00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK] for quotation marks.
Usage tip It is important to note that the Unicode names for these marks should be ignored. 'Left-pointing' should be read 'Start', and 'Right-pointing' as 'End'. The direction in which the glyphs point will be automatically determined according to the base direction of the text.
Arabic script justification can use a number of different techniques. These include stretching the baseline and the glyphs of the text, expanding inter-word spaces, application of ligatures or swash forms, etc.. Typically these will be applied in combination. Where baseline stretching is applied, the rules for what can be stretched, and how much, are complicated, and differ across writing systems. (Elongation is not normally used at all for the ruq'a style.) It is not a question of simply adding equal-length extensions across the line.
The baseline extension character ـ [U+0640 ARABIC TATWEEL] is sometimes suggested as a way of producing justification by extending the baseline, however when a browser window is resized, or when new text is added near the start of a paragraph, lines wrap differently and all the places where tatweel would be needed have to be recalibrated. Thus tatweels only work for static text with fixed dimensions.
Better quality justification systems stretch glyphs, rather than adding baseline extensions. This dynamic stretching of glyphs is often called 'kashida'. In some typesetting systems, such as InDesign, the tatweel character serves more to indicate opportunities for stretching, and the glyph for the character itself is not shown.
It is very common to see baseline stretching in modern Arabic text where a word or phrase is stretched to fill a particular space, eg. the Arabic tag line (الابداع المتجدد Creativity renewed) below the word Lexus in the following image is stretched to be the same width.
Use the control below to see how your browser justifies the text sample here.
المادة 7 كل الناس سواسية أمام القانون ولهم الحق في التمتع بحماية متكافئة عنه دون أية تفرقة، كما أن لهم جميعاً الحق في حماية متساوية ضد أي تمييز يُخل بهذا الإعلان وضد أي تحريض على تمييز كهذا.
Other features to be investigated in this section include: text decoration, abbreviations & ellipsis, glyph controls line breaking, hyphenation, first-letter styling, notes & footnotes, page layout