Use accesskey "n" to jump to the internal navigation links at any point. Right now you can

 
r12a >> docs

Modern Standard Arabic

orthography notes

Updated 13 December, 2024 • recent changes scripts/arab/arb • leave a comment

This page brings together basic information about the Arabic script and its use for the Modern Standard Arabic language. It doesn't cover Quranic usage. It aims to provide a brief, descriptive summary of the modern, printed orthography and typographic features, and to advise how to write Arabic using Unicode.

Referencing this document

Richard Ishida, Modern Standard Arabic Orthography Notes, 13-Dec-2024, https://r12a.github.io/scripts/arab/arb

 

Click to toggle Table of Contents.

Phonological transcriptions should be treated as a guide, only. They are taken from the sources consulted, and may be narrow or broad, phonemic or phonetic, depending on what is available. They mostly represent pronunciation of words in isolation. For more detailed information about allophones, alternations, sandhi, dialectal differences, and so on, follow the links to cited references.

This is an interactive document. Click/tap on the following to reveal detailed information and examples for each character: (a) coloured characters in examples and lists; (b) link text on character names. If your browser supports it, your cursor will change to look like as you hover over these items.

More about using this page

Character names. The names of characters in codepoint markup drop the initial ARABIC label (purely to reduce the length of the examples). In other places the full name can be found.

Navigation. The Toggle images icon opens the table of contents in a popup window. Dismiss it by clicking on the X alongside it, or by hitting the ESC key.

Detailed character notes. Clicking on coloured characters in lists or on character names opens panels that give detailed information about each character. This information is taken from the companion document, Arabic Character Notes. (Those panels can be dismissed by pressing on the ESC key.)

Transcriptions & transliterations. Phonological transcriptions are surrounded by ⌈corner brackets⌋, to indicate that they vary between narrow, [phonetic] and broad, /phonemic/ transcriptions.
Latin transcriptions between <angle brackets>, represent the letters as commonly written in the Latin script.
A transliteration has also been developed especially for this orthography, and is generally based on the sound of a letter where possible, but where a letter has multiple pronunciations, the transliteration represents only one.
Transliterations provide perfect round-trip conversion between the native script and Latin, whereas Latin transcriptions rarely do.
When you click on an example to see its composition, the top of the panel that opens contains a transliteration, followed by the native text, then (if available) an IPA transcription.

Copied !
TOC.
Accessibility settings
ˇ

Languages using the Arabic scriptArabic pickerTerms listCharacter notesArabic linksOther orthography notes

Sample

Select part of this sample text to show a list of characters, with links to more details.
Change size:   36px

المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق. وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضًا بروح الإخاء.

المادة 2 لكل إنسان حق التمتع بكافة الحقوق والحريات الواردة في هذا الإعلان، دون أي تمييز، كالتمييز بسبب العنصر أو اللون أو الجنس أو اللغة أو الدين أو الرأي السياسي أو أي رأي آخر، أو الأصل الوطني أو الإجتماعي أو الثروة أو الميلاد أو أي وضع آخر، دون أية تفرقة بين الرجال والنساء. وفضلاً عما تقدم فلن يكون هناك أي تمييز أساسه الوضع السياسي أو القانوني أو الدولي لبلد أو البقعة التي ينتمي إليها الفرد سواء كان هذا البلد أو تلك البقعة مستقلاً أو تحت الوصاية أو غير متمتع بالحكم الذاتي أو كانت سيادته خاضعة لأي قيد من القيود.

Source: Unicode UDHR, clauses 1 & 2

Usage & history

Origins of the Arabic script, 6thC – today.

Phoenician

└ Aramaic

└ Nabataean

└ Arabic

The Arabic script is the 2nd most widely used script after Latin by number of countries, and 3rd by number of speakers (after Latin and Chinese). It used for writing the Arabic language and several other languages of Asia and Africa, such as Persian, Urdu, Azerbaijani, Pashto, Uighur, etc. Historically, it was used far more widely, as its spread followed that of Islam into many countries of not only West and Central Asia, and North Africa, but also Southern and Eastern Europe, South Asia, Malaysia, East Africa, etc.

ألأبجدية ٱلعربية‎ ʔalʔabd͡ʒadiːjaʰ lʕarabiːjaʰ‎ Arabic alphabet

The script was first used to write texts in Arabic, most notably the Qurʼān, the holy book of Islam. It descended from the Nabataean abjad, itself a descendant of the Phoenician script, and has been used since the 4th century for writing the Arabic language.

Many of the languages written in Arabic script are non-Semitic, and so employ very different sound systems from spoken Arabic. As a result the script has had to be adapted and is used slightly differently by speakers of different languages.

More information: Scriptsource, Wikipedia

Script codearab
Language codearb
Script typeabjad
Originwasia
Native speakers273,989,700
  
Total characters146
Letters48
Combining marks15
Symbols38
Punctuation23
Numbers10
Other12
Possible other0
Unicode blocks7
  
Character counts above are for this
orthography but exclude ASCII.
  
Text directionrtl
Post-consonant vowelsletters
marks
hides vowels
composite vowels
Standalone vowelscarrier Aleph ا
Case distinctionno
Cursive scriptyes
Combining marks>1 per base
Clusters markedyes
Consonant
Clusters
diacritics
Other ligaturesyes
Word separatorspace
Wraps atword
Hyphenationfalse
G Clusters OK?yes
Justificationspaces
baseline stretching
swashes
Baselineromn

Basic features

The Arabic script is an abjad. This means that in normal use the script represents only consonant and long vowel sounds. This approach is helped by the strong emphasis on consonant patterns in Semitic languages (however the Arabic script is also adapted for use with other kinds of language, such as Urdu, Uighur. and African ajami, not all of which are abjads). See the table to the right for a brief overview of the features of Standard Arabic.

Arabic text runs right-to-left in horizontal lines, but numbers and embedded Latin text are read left-to-right.

The script is cursive, and some basic letter shapes change radically, depending on what they join to. It is also very common for adjacent characters to ligate and to stretch to fill available space. Many of the characters share a common base form, and are distinguished by the number and location of dots or other small diacritics, called i'jam. For example, س ‎ش ‎ݜ ‎ ݰ ‎ݽ ‎ݾ ‎ڛ ‎ښ ‎ڜ ‎ۺ.

There is no case distinction. Words are separated by spaces (except some very short, usually 1-letter conjunctions and prepositions, which attach to the following word).

❯ Consonant summary

Modern Standard Arabic has 28 letters in its alphabet, but regularly uses 8 more. Most of those involve representations of the hamza, for which the usage is complicated. This page also lists 3 letters for foriegn sounds, and 6 others which are used infrequently.

A mandatory ligature has to be used for combinations of lam + alif.

The diacritic ◌ّU+0651 SHADDA indicates gemination in vowelled text.

❯ Vowel summary table

The orthography for the Arabic language is an abjad, and so vowels are written using a mixture of combining marks and letters in vocalised text, but normally the diacritics are not used (and so it is difficult to accurately read the text unless you recognise the consonant patterns). However these diacritics and other phonetic information can be written where needed, and are regularly used for Qur'anic texts, dictionaries, educational materials, and where the pronunciation needs to be made clear.

In vowelled text, the Arabic language uses 3 basic vowel diacritics, but 4 more and 1 letter are occasionally also used. Long vowel locations are marked by matres lectionis (consonants indicating vowel locations).

In vowelled text, ◌ْU+0652 SUKUN is used to indicate vowel absence in consonant clusters.

Arabic uses both European and native digits, and has local forms for several of the more common punctuation marks.

Joining forms

Because the Arabic script is 'cursive' (ie. joined-up) writing, letters tend to have different shapes depending on whether they join with adjacent letters or not (see Cursive script). In addition, vowels can be represented using different characters, depending on where in a word they appear.

In scripts such as Arabic, several characters have no left-joining form. In what follows we'll use the characters يU+064A LETTER YEH and دU+062F LETTER DAL to illustrate shapes. The former can join on both sides, but the latter can only join on the right.

Left-joining glyphs are commonly called initial; dual-joining are called medial; and right-joining are called final. Glyphs that don't join on either side are called isolated. However, these glyph shapes can be found in various places within a single word.

Word-initial characters usually have initial glyph shapes (eg. ي‍ ). However, characters that only join to the right will use an isolated glyph shape (eg. د ). Furthermore, words beginning with a vowel are always preceded by a vowel carrier, which is normally اU+0627 LETTER ALEF (eg. ای‍ or اَ ).

Word-medial characters will typically join on both sides (eg. ‍ي‍ ) but those that only join to the right will use a final glyph (eg. ‍د ). However, if either of those is preceded by another character that only joins to the right, the glyph shapes rendered will be initial (eg. ي‍ ) and isolated (eg. د ), respectively.

Word-final characters will typically use a final glyph shape (eg. ‍ي and ‍د ). However, if the previous character joins only to the right, they will use isolated glyph shapes (eg.ي and د ).

In all this contextual glyph shaping the basic shapes used for a character can vary significantly in a script like Arabic. This also includes some characters that only have ijam dots in certain contexts.

Character index

The index points to locations where a character is mentioned in this page, and indicates whether it is used by the Arabic orthography described here.

Manage characters.

Click on the image to the left to view all the 'main' and 'infrequent' characters in the index in various groupings or open related apps.

Letters

Show

Basic consonants

list all 28
ا0627
ARABIC LETTER ALEFconsonant/mater lectionis ∅ (aː) - ā
و0648
ARABIC LETTER WAWconsonant/mater lectionis w (uː) w ū aw
ي064A
ARABIC LETTER YEHconsonant/mater lectionis j (iː) y ī ay
ج062C
ARABIC LETTER JEEMconsonant d͡ʒ ʒ j
ظ0638
ARABIC LETTER ZAHpharyngealised consonant ðˤ zˤ
ب0628
ARABIC LETTER BEHconsonant b b
ت062A
ARABIC LETTER TEHconsonant t t
د062F
ARABIC LETTER DALconsonant d d
ط0637
ARABIC LETTER TAHpharyngealised consonant
ض0636
ARABIC LETTER DADpharyngealised consonant
ك0643
ARABIC LETTER KAFconsonant k k
ق0642
ARABIC LETTER QAFconsonant q q
ف0641
ARABIC LETTER FEHconsonant f f
ث062B
ARABIC LETTER THEHconsonant θ th
ذ0630
ARABIC LETTER THALconsonant ð dh
س0633
ARABIC LETTER SEENconsonant s s
ص0635
ARABIC LETTER SADpharyngealised consonant
ز0632
ARABIC LETTER ZAINconsonant z z
ش0634
ARABIC LETTER SHEENconsonant ʃ sh
خ062E
ARABIC LETTER KHAHconsonant x kh
غ063A
ARABIC LETTER GHAINconsonant ɣ gh
ه0647
ARABIC LETTER HEHconsonant h h
ح062D
ARABIC LETTER HAHconsonant ħ
ع0639
ARABIC LETTER AINconsonant ʕ ʿ
م0645
ARABIC LETTER MEEMconsonant m m
ن0646
ARABIC LETTER NOONconsonant n n
ر0631
ARABIC LETTER REHconsonant r r
ل0644
ARABIC LETTER LAMconsonant l l

Extended consonants

list all 3
ڤ06A4
(loan)    ARABIC LETTER VEHconsonant v v
پ067E
(loan)    ARABIC LETTER PEHconsonant p p
چ0686
(loan)    ARABIC LETTER TCHEHconsonant t͡ʃ ch

Vowels

list
ى0649
ARABIC LETTER ALEF MAKSURAvowel á

Other

list all 7
ء0621
ARABIC LETTER HAMZAglottal stop ʔ
آ0622
ARABIC LETTER ALEF WITH MADDA ABOVEglottal stop ʔaː ā ’ā
أ0623
ARABIC LETTER ALEF WITH HAMZA ABOVEglottal stop ʔ a u
إ0625
ARABIC LETTER ALEF WITH HAMZA BELOWglottal stop ʔ i
ؤ0624
ARABIC LETTER WAW WITH HAMZA ABOVEglottal stop ʔ
ئ0626
ARABIC LETTER YEH WITH HAMZA ABOVEglottal stop ʔ
ة0629
ARABIC LETTER TEH MARBUTAfeminine indicator - ʰ h t
list all 5
ٱ0671
(infrequent)    ARABIC LETTER ALEF WASLAvowel a -
ڢ06A2
(infrequent)    ARABIC LETTER FEH WITH DOT MOVED BELOWconsonant Maghrebi form, used in North Africa. f f
ڧ06A7
(infrequent)    ARABIC LETTER QAF WITH DOT ABOVEconsonant Maghrebi form, used in North Africa. q q
08B2
(loan)    ARABIC LETTER ZAIN WITH INVERTED V ABOVEpharyngealized consonant Sometimes used for Berber sounds.
ـ0640
(infrequent)    ARABIC TATWEELbaseline extender
list all 4
FDF2
(infrequent)    ARABIC LIGATURE ALLAH ISOLATED FORMligature ʔaɫˈɫaːh Allāh
FDF4
ARABIC LIGATURE MOHAMMAD ISOLATED FORMword ligature honorific
FDFA
(infrequent)    ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAMligature
FDFB
(infrequent)    ARABIC LIGATURE JALLAJALALOUHOUligature

Combining marks

Show

Vowels

list all 7
َ064E
(infrequent)    ARABIC FATHAvowel a a
ُ064F
(infrequent)    ARABIC DAMMAvowel ʊ u
ِ0650
(infrequent)    ARABIC KASRAvowel ɪ i
ً064B
(infrequent)    ARABIC FATHATANvowel an an
ٌ064C
(infrequent)    ARABIC DAMMATANvowel ʊn un
ٍ064D
(infrequent)    ARABIC KASRATANvowel ɪn in
ٰ0670
(infrequent)    ARABIC LETTER SUPERSCRIPT ALEFvowel

Honorifics

list all 5
ؐ0610
ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAMhonorific diacritic honorific
ؑ0611
ARABIC SIGN ALAYHE ASSALLAMhonorific diacritic honorific
ؒ0612
ARABIC SIGN RAHMATULLAH ALAYHEhonorific diacritic honorific
ؓ0613
ARABIC SIGN RADI ALLAHOU ANHUhonorific diacritic honorific
ؔ0614
ARABIC SIGN TAKHALLUSauthor name marker honorific

Other

list both
ْ0652
(infrequent)    ARABIC SUKUNvowel absence marker
ّ0651
(infrequent)    ARABIC SHADDAgemination mark
list all 3
ٓ0653
(rare)    ARABIC MADDAH ABOVEmaddah diacritic Found in decomposed text only; used only with ا. ʔ
ٔ0654
(rare)    ARABIC HAMZA ABOVEhamza Found in decomposed text only. ʔ
ٕ0655
(rare)    ARABIC HAMZA BELOWhamza Found in decomposed text only. ʔ

These items only occur in decomposed text.

Numbers

Show
list all 10
٠0660
ARABIC-INDIC DIGIT ZEROdigit
١0661
ARABIC-INDIC DIGIT ONEdigit 1
٢0662
ARABIC-INDIC DIGIT TWOdigit 2
٣0663
ARABIC-INDIC DIGIT THREEdigit 3
٤0664
ARABIC-INDIC DIGIT FOURdigit 4
٥0665
ARABIC-INDIC DIGIT FIVEdigit 5
٦0666
ARABIC-INDIC DIGIT SIXdigit 6
٧0667
ARABIC-INDIC DIGIT SEVENdigit 7
٨0668
ARABIC-INDIC DIGIT EIGHTdigit 8
٩0669
ARABIC-INDIC DIGIT NINEdigit 9

ASCII

list all 10
10031
DIGIT ONEdigit 1
20032
DIGIT TWOdigit 2
30033
DIGIT THREEdigit 3
40034
DIGIT FOURdigit 4
50035
DIGIT FIVEdigit 5
60036
DIGIT SIXdigit 6
70037
DIGIT SEVENdigit 7
80038
DIGIT EIGHTdigit 8
90039
DIGIT NINEdigit 9
00030
DIGIT ZEROdigit

Punctuation

Show
list all 18
٫066B
ARABIC DECIMAL SEPARATORdecimal separator
٬066C
ARABIC THOUSANDS SEPARATORthousands separator
٪066A
ARABIC PERCENT SIGNpercent sign
؉0609
U+0609 ARABIC-INDIC PER MILLE SIGNper mille sign
2030
PER MILLE SIGNper mille sign
2013
EN DASHen dash
،060C
ARABIC COMMAcomma ,
؛061B
ARABIC SEMICOLONsemicolon
2014
EM DASHem dash
201D
RIGHT DOUBLE QUOTATION MARKquotation mark
201C
LEFT DOUBLE QUOTATION MARKquotation mark
2019
RIGHT SINGLE QUOTATION MARKquotation mark
2018
LEFT SINGLE QUOTATION MARKquotation mark
«00AB
LEFT-POINTING DOUBLE ANGLE QUOTATION MARKquotation mark
»00BB
RIGHT-POINTING DOUBLE ANGLE QUOTATION MARKquotation mark
2039
LEFT SINGLE QUOTATION MARKquotation mark
203A
RIGHT SINGLE QUOTATION MARKquotation mark
2026
HORIZONTAL ELLIPSISellipsis
list all 4
FD3E
(infrequent)    ORNATE LEFT PARENTHESISornate parenthesis
﴿FD3F
(infrequent)    ORNATE RIGHT PARENTHESISornate parenthesis
؍060D
(infrequent)    ARABIC DATE SEPARATORdate separator
٭066D
(infrequent)    ARABIC FIVE POINTED STARpunctuation

ASCII

list all 7
:003A
COLONcolon :
.002E
FULL STOPfull stop .
؟061F
ARABIC QUESTION MARKquestion mark ?
!0021
EXCLAMATION MARKexclamation mark !
(0028
LEFT PARENTHESISparenthesis (
)0029
RIGHT PARENTHESISparenthesis )
%0025
PERCENT SIGNpercentage mark

Symbols

Show

Honorifics

list all 20
FD40
ARABIC LIGATURE RAHIMAHU ALLAAHword ligature honorific
FD41
ARABIC LIGATURE RADI ALLAAHU ANHword ligature honorific
FD42
ARABIC LIGATURE RADI ALLAAHU ANHAAword ligature honorific
FD43
ARABIC LIGATURE RADI ALLAAHU ANHUMword ligature honorific
FD44
ARABIC LIGATURE RADI ALLAAHU ANHUMAAword ligature honorific
FD45
ARABIC LIGATURE RADI ALLAAHU ANHUNNAword ligature honorific
FD46
ARABIC LIGATURE SALLALLAAHU ALAYHI WA-AALIHword ligature honorific
FD47
ARABIC LIGATURE ALAYHI AS-SALAAMword ligature honorific
FD48
ARABIC LIGATURE ALAYHIM AS-SALAAMword ligature honorific
FD49
ARABIC LIGATURE ALAYHIMAA AS-SALAAMword ligature honorific
FD4A
ARABIC LIGATURE ALAYHI AS-SALAATU WAS-SALAAMword ligature honorific
FD4B
ARABIC LIGATURE QUDDISA SIRRAHword ligature honorific
FD4C
ARABIC LIGATURE SALLALLAHU ALAYHI WAAALIHEE WA-SALLAMword ligature honorific
FD4D
ARABIC LIGATURE ALAYHAA AS-SALAAMword ligature honorific
FD4E
ARABIC LIGATURE TABAARAKA WA-TAAALAAword ligature honorific
FD4F
ARABIC LIGATURE RAHIMAHUM ALLAAHword ligature honorific
FDCF
ARABIC LIGATURE SALAAMUHU ALAYNAAword ligature honorific
FDFD
(infrequent)    ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEMligature
FDFE
ARABIC LIGATURE SUBHAANAHU WA TAAALAAword ligature honorific
﷿FDFF
ARABIC LIGATURE AZZA WA JALLword ligature honorific

Ijam

list all 17
FBB2
(rare)    ARABIC SYMBOL DOT ABOVEindependent ijam symbol pedagogical use only
FBB3
(rare)    ARABIC SYMBOL DOT BELOWindependent ijam symbol pedagogical use only
FBB4
(rare)    ARABIC SYMBOL TWO DOTS ABOVEindependent ijam symbol pedagogical use only
FBB5
(rare)    ARABIC SYMBOL TWO DOTS BELOWindependent ijam symbol pedagogical use only
FBB6
(rare)    ARABIC SYMBOL THREE DOTS ABOVEindependent ijam symbol pedagogical use only
FBB7
(rare)    ARABIC SYMBOL THREE DOTS BELOWindependent ijam symbol pedagogical use only
FBB8
(rare)    ARABIC SYMBOL THREE DOTS POINTING DOWNWARDS ABOVEindependent ijam symbol pedagogical use only
FBB9
(rare)    ARABIC SYMBOL THREE DOTS POINTING DOWNWARDS BELOWindependent ijam symbol pedagogical use only
FBBA
(rare)    ARABIC SYMBOL FOUR DOTS ABOVEindependent ijam symbol pedagogical use only
FBBB
(rare)    ARABIC SYMBOL FOUR DOTS BELOWindependent ijam symbol pedagogical use only
FBBC
(rare)    ARABIC SYMBOL DOUBLE VERTICAL BAR BELOWindependent ijam symbol pedagogical use only
FBBD
(rare)    ARABIC SYMBOL TWO DOTS VERTICALLY ABOVEindependent ijam symbol pedagogical use only
FBBE
(rare)    ARABIC SYMBOL TWO DOTS VERTICALLY BELOWindependent ijam symbol pedagogical use only
﮿FBBF
(rare)    ARABIC SYMBOL RINGindependent ijam symbol pedagogical use only
FBC0
(rare)    ARABIC SYMBOL SMALL TAH ABOVEindependent ijam symbol pedagogical use only
FBC1
(rare)    ARABIC SYMBOL SMALL TAH BELOWindependent ijam symbol pedagogical use only
FBC2
(rare)    ARABIC SYMBOL WASLA ABOVEindependent ijam symbol pedagogical use only

Currency

list
FDFC
(infrequent)    RIAL SIGNcurrency symbol ri.jaːl

Other

Show
list all 13
ZWNJ200C
ZERO WIDTH NON-JOINERzero-width non-joiner
ZWJ200D
ZERO WIDTH JOINERzero-width joiner
RLI2067
RIGHT-TO-LEFT ISOLATErtl isolate
RLE202B
RIGHT-TO-LEFT EMBEDDINGrtl embed
LRI2066
LEFT-TO-RIGHT ISOLATEltr isolate
LRE202A
LEFT-TO-RIGHT EMBEDDINGltr embed
FSI2068
FIRST STRONG ISOLATEfirst-strong isolate
PDI2069
POP DIRECTIONAL ISOLATEpop direction isolate
PDF202C
POP DIRECTIONAL FORMATTINGpop direction
RLM200F
RIGHT-TO-LEFT MARKrtl mark
LRM200E
LEFT-TO-RIGHT MARKltr mark
؜ALM061C
ARABIC LETTER MARKarabic letter mark
͏CGJ034F
COMBINING GRAPHEME JOINERcombining grapheme joiner

Phonology

Click on the sounds to reveal locations in this document where they are mentioned.

Phones in a lighter colour are non-native or allophones. Source Wikipedia.

Vowel sounds

Plain vowels

ɪ ʊ o ɛ æ a ɑ ɑ

The above chart is for 'Standard Arabic'. Even so, many regional variants of the standard pronunciation exist, not to mention local dialects.16

o, and e, and are sometimes used for foreign words, and are sometimes introduced into speech as allophones due to regional dialects.

In addition, the adjacent consonants can also affect the vowel sounds. In particular, the sound a is retracted to ɑ around a neighboring r, q or emphatic consonants. æ is also a common allophone of a.16

Most of the phonetic transcriptions for examples in this page therefore just use basic phonemic representations when it comes to vowels.

For more details, see Wikipedia.

Diphthongs

aj aw

Source 16..

The diphthong aj is colloquially pronounced more like ej, however in this page we will continue to transcribe it phonetically per the official pronunciation.

Consonant sounds

labial dental alveolar post-
alveolar
palatal velar uvular pharyngeal glottal
stops p b t d       k ɡ q   ʔ
ejective                
affricates       d͡ʒ          
fricatives f v θ ð s z ʃ ʒ   x ɣ χ ʁ ħ ʕ h
ejective   ðˤ            
nasals m   n          
approximants w   l ɫ   j      
trills/flaps     r    

Modern Standard Arabic covers many territories, most of which have their own dialects or languages, and these tend to influence the local pronunciation of Standard Arabic. In the chart above, we remain conservative, only mentioning variants that tend to apply to the standard pronunciation. For a slightly more detailed set of notes, see Wikipedia.

p and v are sometimes pronounced by some speakers for foreign words, such as باكستان paː.ki.ˈstaːn Pakistan فيروس vi.rus virus

Sometimes alternative letters are used for such words (see foreign).16

Although most dialects include it as a phoneme, ɡ is only used in Modern Standard Arabic as a marginal phoneme to pronounce some dialectal and loan words.16

The sound ɫ occurs as a phoneme in a handful of loanwords, though not in all pronunciations. It also occurs in the name اللّٰه ʔaɫ.ˈɫaːh Allah

The sound, d͡ʒ is used in Algerian, Hejazi, Najdi, Iraqi, and Gulf regions, whereas ʒ is used in Moroccan, Tunisian, Egyptian, Levantine, and Israeli regions. In both cases, the sound is written using جU+062C LETTER JEEM.16

ظU+0638 LETTER ZAH is pronounced in some regions, rather than ðˤ

Tone

Arabic is not a tonal language.

Structure

The following notes on structure are taken from Wikipedia.16

[C1] [S1] V [S2] [C2 [C3]]
Legend
C
Consonant.
V
Vowel.
S
Semi-vowel.

Arabic syllable structure consists of an optional syllable onset, consisting of one or two consonants; an obligatory syllable nucleus, consisting of a vowel optionally preceded by and/or followed by a semivowel; and an optional syllable coda, consisting of one or two consonants.

The following restrictions apply:

Onset
C1 can be any consonant, including a liquid (l r). The onset is composed only of one consonant; consonant clusters are only found in loanwords. Sometimes an epenthetic a is inserted between consonants.
Nucleus
Includes S1 V S2.
Coda
C2 and C3 can be any consonant.

Vowels

The orthography for the Arabic language is an abjad, and so vowels are written using a mixture of combining marks and letters in vocalised text, but normally the diacritics are not used (and so it is difficult to accurately read the text unless you recognise the consonant patterns). However these diacritics and other phonetic information can be written where needed, and are regularly used for Qur'anic texts, dictionaries, educational materials, and where the pronunciation needs to be made clear.

In vowelled text, the Arabic language uses 3 basic vowel diacritics, but 4 more and 1 letter are occasionally also used. Long vowel locations are marked by matres lectionis (consonants indicating vowel locations).

In vowelled text, ◌ْU+0652 SUKUN is used to indicate vowel absence in consonant clusters.

Vowel summary table

The following table summarises the main vowel to character assigments, in both vowelled and unvowelled forms.

Each table cell shows word-initial, word-medial, and word-final forms from right to left. The glyphs shown are illustrative; alternative shapes may occur (see Joining forms).

Unvowelled:
i إ    
إي‍ ‍ي‍ ‍ي
u أ    
أو ‍و ‍و
a أ    
آ ‍ا ‍ا / ‍ى
Vowelled:
i إِ ◌ِ ◌ِ
إِي‍ ◌ِ‍ي‍ ◌ِ‍ي
u أُ ◌ُ ◌ُ
أُو ◌‍ُو ◌‍ُو
a أَ ◌َ ◌َ
آ ◌َ‍ا ◌َ‍ا / ◌َ‍ى

For additional details see Vowel sounds to characters.

Because the sukun is also dropped in non-vocalised text, where a mater lectionis remains it only implies a vowel location, since it may either represent a consonant or a glide.

In word-initial position vowels are attached to اU+0627 LETTER ALEF. However, more often than not, a hamza is also attached to the alef to indicate the glottal stop. Although this actually constitutes a consonant plus vowel, in unvocalised text the alef (plus any hamza) signals the location of a vowel. The table therefore shows these maximal combinations.

The sounds , o, and are only used for transliterations of foreign words, and are spelled identically to u, and , respectively.

The letter ىU+0649 LETTER ALEF MAKSURA, used as an alternative for a final , is a dedicated vowel character (see Alef maksura).

Ijam and tashkil

The Unicode Standard makes an important distinction between ijam and tashkil diacritics, which is particularly relevant for this section about vowels. For more information, see Ijam, tashkil, hamza.

Post-consonant vowels

Vowels are written using a mixture of combining marks and letters in vocalised text, but normally the diacritics are not used (and so it is difficult to accurately read the text unless you recognise the consonant patterns). However these diacritics and other phonetic information can be written where needed, and are regularly used for Qur'anic texts, dictionaries, educational materials, and where the pronunciation needs to be made clear.

In vowelled text, the Arabic language uses 3 basic vowel diacritics, but 4 more and 1 letter are occasionally also used. Long vowel locations are marked by matres lectionis (consonants indicating vowel locations).

Alef maksura


ىáà0649

ىU+0649 LETTER ALEF MAKSURA represents at the end of many words when it is written with YEH instead of an ALEF. In this case, YEH has no dots below, and this code point produces the requisite shape. It is the only letter used for vowels alone.

حتى  ≡  حَتَّى ħat.taː until

It also produces in informal speech for some words that end with nunation (though formally the ending is pronounced -an).

معنى  ≡  مَعْنًى maʕ.naː meaning, concept (formal)

If any suffix is added, the spelling reverts to the normal alef, eg.

معناهم mæʕnaː-hum

Matres lectionis

In the spelling of Arabic, Hebrew, and other Semitic languages, mater lectionis refers to a consonant that may also indicate the location of a vowel.w


3
ا- āɑ0627
وw ū aww0648
يy ī ayy064A

In Arabic, the consonants listed just above may indicate the location of a long vowel, eg. قلوب qu.luːb hearts تاريخ tɑː.riːx history They are always visible, whether or not the text shows vowel diacritics.

اU+0627 LETTER ALEF in word final position commonly either represents a short a, eg.

أنا ˈʔa.na I

or is silent, eg.

رَسْمِيًا ras.miː.jan officially

كَتَبُوا kæ.tæ.buː they wrote

Combining marks used for vowels

In situations where it is necessary to unambiguously indicate the underlying vowel sounds, short vowels can be expressed using diacritics called harakat, eg. العَرَبِيَّة al.ʕa.ra.bij.ja Arabic

However for languages such as Arabic, Persian and Urdu they are typically not used unless there is a particular need to help the reader understand the pronunciation. The previous example would therefore usually be written العربية al.ʕa.ra.bij.ja Arabic

On the other hand, when the script is used for some other languages (such as Uighur, Kashmiri, or Hausa), all vowels are shown, as a matter of course. These diacritics are also used in the Qur'an (though not originally), to reduce ambiguity.

Basic harakat

The basic short vowel marks in the Arabic language repertoire are:


3
َinfreq.aaa064E
ُinfreq.ʊuu064F
ِinfreq.ɪii0650

Although the phonemic distinctions for Arabic involve only 3 vowel sounds, the phonetic realisation often varies with context. For example, Vowel sounds to characters includes e and o sounds, which can be found in a few foriegn loan words.

Tanwīn

Tanwin refers to a secondary set of vowel diacritics with origins in classical Arabic, where indefinite nouns, and adjectives were marked by a final n-sound, called تنوين tænwiːn or, in English, 'nunation'. This is indicated by visually doubling the vowel diacritic, but there are precomposed Unicode characters for each combination.


3
ًinfreq.ananaⁿ064B
ٌinfreq.ʊnunuⁿ064C
ٍinfreq.ɪniniⁿ064D

In modern text this is particularly common for adverbs.751

◌ًU+064B FATHATAN is often used in the combination ◌ًاU+064B FATHATAN + U+0627 LETTER ALEF, where the ALEF is silent and the ending is pronounced -an, eg. فَوْرًا  ≡  فورا faw.ran immediately

The same applies before ALEF MAKSURA in formal pronunciation (but see also Alef maksura).

أَفْعًى  ≡  أفعى ʔaf.ʕan snake

If it appears as ◌َةًU+064E FATHA + U+0629 LETTER TEH MARBUTA + U+064B FATHATAN the pronunciation is -atan, eg. عَادَةً  ≡  عادة ʕaː.da.tan usually

After a final YEH, the pronunciation has an extra j sound,751 ie. -iːjan, eg. رَسْمِيًا  ≡  رسميا ras.miː.jan officially

In modern Arabic printing the fathatan may be dropped, but the alef is retained.

The other two diacritics are much less common.751

Superscript alef


ٰinfreq. ̍0670

◌ٰU+0670 LETTER SUPERSCRIPT ALEF is used in only a few Arabic words, however they tend to be commonly used words. It represents the sound .

هٰذَا haː.ðaː this

اللّٰه ʔaɫ.ˈɫaːh Allah

Diphthongs & glides

The 2 diphthongs aj and aw are written using a combination of short a with the semivowels يU+064A LETTER YEH and وU+0648 LETTER WAW,16. In vocalised text this usage can be detected by the presence of sukun, but in non-vocalised text it is not so obvious.

عين  ≡  عَيْن ʕajn eye

عود  ≡  عَوْد ʕawd return

Vowel length

Long vowels are generally distinguished from short vowels by the use of matres lectionis (see Matres lectionis).

Composite vowels

A composite vowel sign is a single vowel sound or diphthong that is represented by more than one code point from the set of vowel signs, repurposed consonants, and diacritics available. It is the opposite of a circumgraph.

The 5 composite vowels listed here only appear in vocalised text. Three represent long vowels, where the vowel diacritic is followed by a letter. Two more represent standalone vowels, with alef used as a carrier for the vowel diacritics. Diphthongs and glides are not included here.

Click on the letters for examples.


5
ِي īiy0650
064A
ُو ūuw064F
0648
َا ā064E
0627
اِinfreq.iiɑi0627
0650
اَinfreq.aaɑa0627
064E

Standalone vowels


ا∅ (aː)- āɑ0627

Standalone vowels at the beginning of a word usually incorporate اU+0627 LETTER ALEF.

In principle, Arabic has very few true standalone vowels, since vowels are nearly always preceded by a glottal stop or other consonant, and in Arabic, unlike many other languages, the glottal stop is phonemic and a distinct letter of the alphabet.

The definite article

How does the orthography handle vowels that are not preceded by a consonant?

When اU+0627 LETTER ALEF appears without any hamza diacritic at the beginning of a sentence, it may be pronounced a, for the definite article, or i in various other circumstances (ie. there is no glottal stop).

When preceded by another word these sounds are elided, although the spelling remains unchanged.

انتباه  ≡  اِنْتِبَاه in.ti.bah caution

المدير  ≡  اَلْمُدِير al.mu.diːr the manager

Word-initial characters


5
ا - āɑ0627
أ a uɑ͑0623
إ i0625
ٱinfreq.-ɑ̄0671
آ ā ’āɑ̃0622

As just mentioned, many Arabic words begin with a glottal stop followed by a vowel, which can be indicated using the characters listed above. Strictly speaking, they represent consonants, although, like the matres lectionis, they strongly imply the presence of a vowel.

Alef alone or with wasla. When اU+0627 LETTER ALEF appears on its own it represents a vowel that will usually be elided if it doesn't appear at the beginning of a sentence (but the spelling doesn't change). Before لU+0644 LETTER LAM it is usually part of the definite article and is pronounced a (see also Arabic definite article), and otherwise it is usually pronounced i. Note that Arabic words do not begin with onset clusters, so borrowed words will often add this vowel to the start of the word, but it is also used for various grammatical forms of words.

المدير  ≡  اَلْمُدِير al.mu.diːr the manager

اسم  ≡  اِسْم ɪsm name

In classical Arabic, this behaviour was indicated using the wasla diacritic, ie. as ٱU+0671 LETTER ALEF WASLA. In modern text, however, this is rarely seen.

أنت المدير  ≡  أَنْتَ اَلْمُدِير ʔanta l.mu.diːr you are the manager

ما ٱسمك  ≡  مَا ٱسْمُكَ maː smuka What's your name?

Alef with hamza. In the majority of cases, the alef also carries a hamza that indicates that the glottal stop and vowel are always pronounced. The list of characters above shows 4 variant forms of this. أU+0623 LETTER ALEF WITH HAMZA ABOVE can be followed by a(ː) or u(ː), whereas i(ː) (and only that vowel) follows إU+0625 LETTER ALEF WITH HAMZA BELOW.

آU+0622 LETTER ALEF WITH MADDA ABOVE represents the sound ʔaː (see Alef madda).

Vowel sounds to characters

This section maps Modern Standard Arabic vowel sounds to common graphemes in the Arabic orthography.

The entries show typical word-initial, word-medial, and word-final usage. The joining forms shown are illustrative; alternative shapes may occur (see Joining forms). They are also fully-vowelled, although the examples show normal unvowelled usage as well as vowelled.

Sounds listed as 'infrequent' are allophones, or sounds used for foreign words, etc. Light coloured characters occur infrequently.

Plain vowels

initial إِي‍ U+0625 LETTER ALEF WITH HAMZA BELOW + U+0650 KASRA + U+064A LETTER YEH eg. إيقاف  ≡  إِيقَاف iː.qaːf parking

medial ◌ِ‍يU+0650 KASRA + U+064A LETTER YEH eg. حشيش  ≡  حَشِيش ħa.ʃiːʃ grass

final ◌ِ‍يU+0650 KASRA + U+064A LETTER YEH eg. في  ≡  فِي fiː in

ɪ

initial اِU+0627 LETTER ALEF + U+0650 KASRA eg. انتباه  ≡  اِنْتِبَاه in.ti.bah caution

  إِU+0625 LETTER ALEF WITH HAMZA BELOW + U+0650 KASRA eg. إنسان  ≡  إِنْسَان ʔin.saːn man (human being)

medial ◌ِU+0650 KASRA eg. باكستان  ≡  بَاكِسْتَان paː.ki.ˈstaːn Pakistan

final ◌ِU+0650 KASRA eg. بسبب  ≡  بِسَبَبِ bi.sa.ba.bi because

ʊ

initial أُU+0623 LETTER ALEF WITH HAMZA ABOVE + U+064F DAMMA eg. أذن  ≡  أُذُن#أُذْن ʔʊ.ðʊn ear

medial ◌ُU+064F DAMMA eg. كنة  ≡  كُنّة kun.na wing

final ◌ُU+064F DAMMA eg. منذ  ≡  مُنْذُ mʊn.ðʊ since

initial أُو‍ U+0623 LETTER ALEF WITH HAMZA ABOVE + U+064F DAMMA + U+0648 LETTER WAW eg. أوروبا  ≡  أُورُوبَّا ʔuː.rub.baː Europe

medial ◌ُ‍وU+064F DAMMA + U+0648 LETTER WAW eg. دودة  ≡  دُودَة duː.da worm

final ◌ُ‍وU+064F DAMMA + U+0648 LETTER WAW eg. كتبوا  ≡  كَتَبُوا kæ.tæ.buː they wrote

medial ◌ِ‍يU+0650 KASRA + U+064A LETTER YEH eg. سكرتير  ≡  سِكْرِتِير sɪ.krɪ.teːr secretary

o

initial أُU+0623 LETTER ALEF WITH HAMZA ABOVE + U+064F DAMMA eg. أكتوبر  ≡  أُكتُوبَر ok.toːbɪr October

medial ◌ُU+064F DAMMA

final ◌ُU+064F DAMMA

initial أُو‍ U+0623 LETTER ALEF WITH HAMZA ABOVE + U+064F DAMMA + U+0648 LETTER WAW eg. أوتيل  ≡  أُوتِيل oː.teːl hotel

medial ◌ُ‍وU+064F DAMMA + U+0648 LETTER WAW eg. بنطلون  ≡  بَنْطَلُون ban.t̴a.loːn trousers

final ◌ُ‍وU+064F DAMMA + U+0648 LETTER WAW

a

initial اَ‍ U+0627 LETTER ALEF + U+064E FATHA eg. الآن  ≡  اَلْآنَ al.ʔaː.na now

  أَ‍ U+0623 LETTER ALEF WITH HAMZA ABOVE + U+064E FATHA eg. أخضر  ≡  أَخْضَر ʔax.dˤar green

medial ◌َU+064E FATHA eg. حجر  ≡  حَجَر ħa.d͡ʒar stone

final ◌َU+064E FATHA eg. شرب  ≡  شَرِبَ ʃa.ri.ba to drink

ɑː

initial آ‍ U+0622 LETTER ALEF WITH MADDA ABOVE eg. آنسة  ≡  آنِسَة ʔaː.ni.sa young woman

medial ◌َ‍اU+064E FATHA + U+0627 LETTER ALEF eg. اثنان  ≡  اِثْنَان ʔiθ.naːn two

  ◌ٰU+0670 LETTER SUPERSCRIPT ALEF Only in a few common words, eg. هذا  ≡  هٰذَا haː.ðaː this

final ◌َ‍اU+064E FATHA + U+0627 LETTER ALEF eg. إذا  ≡  إِذَا ʔi.ðaː if

  ◌َ‍ىU+064E FATHA + U+0649 LETTER ALEF MAKSURA Only in certain words, eg. متى  ≡  مَتَى ma.taː when

Diphthongs

aj

  ◌َ‍يU+064E FATHA + U+064A LETTER YEH eg. عين  ≡  عَيْن ʕajn eye

aw

  ◌َ‍وU+064E FATHA + U+0648 LETTER WAW eg. عود  ≡  عَوْد ʕawd return

Consonants

Modern Standard Arabic has 28 letters in its alphabet, but regularly uses 8 more. Most of those involve representations of the hamza, for which the usage is complicated. This page also lists 3 letters for foriegn sounds, and 6 others which are used infrequently.

A mandatory ligature has to be used for combinations of lam + alif.

The diacritic ◌ّU+0651 SHADDA indicates gemination in vowelled text.

Consonant summary

The following table summarises the main consonant to character assigments.

Stops

15
pپloanpp067E
bب bb0628
tت tt062A
dد dd062F
ط 0637
ض 0636
kك kk0643
qق qq0642
qڧinfreq.q06A7
  unused  
ʔء  ʔ0621
ʔaːآ ā ’āɑ̃0622
ʔأ a uɑ͑0623
ʔإ i0625
ʔؤ  0624
ʔئ  0626
Affricates

both
t͡ʃچloanchʧ0686
d͡ʒ ʒج jʤ062C
Fricatives

18
fف ff0641
fڢinfreq.f06A2
vڤloanvv06A4
θث thθ062B
ðذ dhð0630
sس ss0633
ص 0635
zز zz0632
loan 08B2
ðˤ zˤظ ð̴0638
ʃش shʃ0634
d͡ʒ ʒج jʤ062C
xخ khx062E
ɣغ ghɣ063A
ħح ħ062D
ʕع ʿʕ0639
hه hh0647
- ʰة h tä0629
Nasals

both
mمmm0645
nنnn0646
Other

4
wوw ū aww0648
rرrr0631
lلll0644
jيy ī ayy064A
Special

ʔaɫˈɫaːhinfreq.Allāh{allāh}FDF2

For additional details see Consonant sounds to characters.

Basic consonant letters

The main Unicode Arabic block contains 153 letters, with 77 more in the extended blocks. As shown in the previous section, only a small subset of those are used to write a given language. The others represent special characters added to the repertoire for one or other of the many languages for which the Arabic script is used.

The vast majority of letters represent consonants. A few represent long vowels.

'Alphabetic' consonants

The following consonant letters are those generally recognised as constituting the list representing what is called the 'alphabet' for the Standard Arabic language.


27
ب bbb0628
ت ttt062A
د ddd062F
ط 0637
ض 0636
ك kkk0643
ق qqq0642
unused   
ف fff0641
ث θthθ062B
ذ ðdhð0630
س sss0633
ص 0635
ز zzz0632
ظ ðˤ zˤð̴0638
ش ʃshʃ0634
ج d͡ʒ ʒjʤ062C
خ xkhx062E
غ ɣghɣ063A
ه hhh0647
ح ħħ062D
ع ʕʿʕ0639
unused   
م mmm0645
ن nnn0646
unused   
و w (uː)w ū aww0648
ر rrr0631
ل lll0644
ي j (iː)y ī ayy064A

The recognised alphabet also includes اU+0627 LETTER ALEF, although that is generally used in the context of vowels (see Alef). وU+0648 LETTER WAW and يU+064A LETTER YEH can also represent long vowel locations or combinations of consonant plus vowel (see Matres lectionis).

Additional letters

Besides those that are listed as part of the alphabet, other Unicode letters regularly used in Arabic include:


8
ءʔ ʔ0621
آʔaːā ’āɑ̃0622
أʔa uɑ͑0623
إʔi0625
ؤʔ 0624
ئʔ 0626
ىáà0649
ة- ʰh tä0629

Most of the above letters with diacritics decompose in Unicode Normalization Form D (NFD), however ةU+0629 LETTER TEH MARBUTA does not.

Special letters

The following describe basic letters that require some more lengthy description.

Alef


ا∅ (aː)- āɑ0627

Formally speaking, اU+0627 LETTER ALEF has no sound of its own. It is really a vowel lengthener and carrier. Its main uses in arabic orthography are:

That said, its presence usually indicates the location of a vowel.

It also has one or two minor functions such as in conjunction with tawiin (nunation) (see ًU+064B FATHATAN).

Certain parts of the arabic verb end in a long u-vowel that is conventionally written with a following alef that has no effect on pronunciation, eg. كتبوا ktbwɑ kætæbuːThe alef is omitted if a suffix is added, eg. كتبوها ktbwhɑ kætæbuː-haa

Hamza


6
ءʔ ʔ0621
أʔa uɑ͑0623
إʔi0625
ؤʔ 0624
ئʔ 0626
آʔaːā ’āɑ̃0622

both
ٔrareʔ ʿ0654
ٕrareʔ ˓0655

ءU+0621 LETTER HAMZA represents the glottal stop sound. For historical reasons, it is treated as an orthographic sign rather than as a letter of the alphabet. It sometimes stands alone, but usually appears with a 'carrier' letter - ALEF, WAW, or YEH for which separate precomposed characters are available in Unicode ( أ إ ؤ ئ ). Examples of use include أنكر ʔan.ka.ra denial نائم naː.ʔɪm asleep بناء ban.naːʔ builder

In modern printed arabic, the hamza is rarely shown when it occurs at the beginning of a word, but may appear in conjunction with another character. When the hamza is above another character you should typically use ◌ٔU+0654 HAMZA ABOVE with the appropriate base character, although there are a number of exceptions, and for the Arabic language all the needed combinations are available as precomposed characters. For more details, see the character description.

Classical arabic distinguishes between 'cutting' and 'joining' hamza. 'Cutting' means always pronounced, 'joining' means frequently elided. The joining hamza is of little practical importance in modern arabic pronounced without the old case endings. When it does appear in modern Arabic, ٱU+0671 LETTER ALEF WASLA is used to indicate a joining hamza.

Alef madda

آU+0622 LETTER ALEF WITH MADDA ABOVE is used when either of the two following combinations of glottal stop and a vowel appear in a word:

  • ʔaʔ (hamza, short a, hamza) eg. آثار ʔaː.θaːr effects

  • ʔaː (hamza, long a) eg. القرآن al.qur.ˈʔaːn the Qurʼan

Normal pronunciation in both cases is ʔaː.

The madda sign is still very often shown in print.

Teh marbuta

ةU+0629 LETTER TEH MARBUTA usually has no sound, eg. مَدْرَسَة ma.dra.sa school

However, it is sometimes pronounced t in specific grammatical contexts.

It is used for historical reasons to indicate the feminine ending, a, and is only used in final position. The dots are borrowed from تU+062A LETTER TEH. If any suffix is added, the ending is spelled with that letter, eg.

مَدْرَسَتْنَا ma.dra.sat-naː our school

In modern Arabic it is not uncommon to find the two dots omitted, particularly on masculine proper names that have the feminine ending, eg.

طلبه t̴ul.bæ Tulba

Vowelled text may omit the short a diacritic before the TEH MARBUTA, because the sound is always the same.

Repertoire extensions

Letters for foreign sounds

The following characters are not part of the standard Arabic language set but are occasionally used to represent foreign sounds.


3
ڤloanvvv06A4
پloanppp067E
چloant͡ʃchʧ0686

Two of the above are borrowed from Persian/Urdu.

Other letters

The following characters also have the general property of Letter, but are less commonly used for modern Arabic language text.


5
ڢinfreq.ff06A2
ڧinfreq.qq06A7
loan 08B2
ـinfreq.  _0640
ٱinfreq.a-ɑ̄0671

ڢU+06A2 LETTER FEH WITH DOT MOVED BELOW and ڧU+06A7 LETTER QAF WITH DOT ABOVE are alternative forms that are used in Northwest Africa. U+08B2 LETTER ZAIN WITH INVERTED V ABOVE is used for Berber.

ٱU+0671 LETTER ALEF WASLA is described in the section Hamza. Whereas many of the above letters with diacritics decompose in Unicode Normalization Form D (NFD), this letter does not.

ـU+0640 TATWEEL is used to stretch words for simple justification, or to make a word or phrase a particular width, or as a form of emphasis. For more information see Text alignment & justification.

Word ligatures in the Presentation Forms block

Characters in the Arabic Presentation Forms blocks should not normally be used, but they contain just a few characters that are not just for compability use, including the following, which have compatibility decompositions but which are sometimes used as regular characters. See also Presentation Forms.


4
infreq.FDF2
 FDF4
infreq.FDFA
infreq.FDFB

U+FDF2 LIGATURE ALLAH ISOLATED FORM is used to write the name of Allah. The composition of this character differs from font to font in terms of glyph forms. With some fonts it is necessary to add diacritics, whereas with others it is not.

The other characters represent honorifics or common phrases. Click on the character glyphs in the list above for descriptions.

Arabic definite article

The pronunciation of ال (alif followed by lām) varies when it represents the Arabic definite article.

The lām is not pronounced if it precedes one of the following characters, but instead the following sound is doubled, eg. السلام علیکم as.sa.lɑːm ʕa.laj.kum greetings


14
تttt062A
ثθthθ062B
دddd062F
ذðdhð0630
رrrr0631
زzzz0632
سsss0633
شʃshʃ0634
ص0635
ض0636
ط0637
ظðˤ zˤð̴0638
لlll0644
نnnn0646

These are called 'sun letters' in Arabic. The other letters are 'moon letters'.932

The alif is also not pronounced if the preceding word ends with a vowel or h. It is, however, written.932

Onsets

No special features are used for syllable onsets.

Finals

Final consonants in Arabic are simply written using ordinary consonant letters. No special features are used, other than the sukun in vowelled text (see Consonant clusters).

Consonant clusters

Consonant clusters in Arabic are simply written using a sequence of consonant letters.

When text is vowelled, ◌ْU+0652 SUKUN can be used over a consonant to indicate that it is not followed by a vowel sound. Like other vowel diacritics, this is typically not used in modern text, unless it is necessary to clarify pronunciation.

Consonant length

The diacritic ◌ّU+0651 SHADDA doubles the value of the consonant it is attached to, which is phonemically significant in Arabic, eg. تجّار tud͡ʒ.d͡ʒaːr traders

Like the short vowels, it, too, is not often used, although sometimes it appears when vowel signs don't.

When both shadda and kasra are attached to the same base consonant, a common, though not universal, practice is to display the kasra below the shadda, rather than below the base consonant, eg. مُمَثِّلْ mu.maθ.θil representative Some fonts, such as Amiri, don't do this. (See also Context-based positioning.)

Consonant sounds to characters

This section maps Modern Standard Arabic consonant sounds to common graphemes in the Arabic orthography. Sounds listed as 'infrequent' are allophones, or sounds used for foreign words, etc.

The right-hand column shows various joining forms.

Sounds listed as 'infrequent' are allophones, or sounds used for foreign words, etc. Light coloured characters occur infrequently.

p

پ‍پ‍پ‍ پ‍ consonant پU+067E LETTER PEH Only for foreign words. (From Persian/Urdu).

ب‍ب‍ب‍ ب‍ consonant بU+0628 LETTER BEH

b

ب‍ب‍ب‍ ب‍ consonant بU+0628 LETTER BEH

t

ت‍ت‍ت‍ ت‍ consonant تU+062A LETTER TEH

t͡ʃ

چ‍چ‍چ‍ چ‍ consonant چU+0686 LETTER TCHEH Only for foreign words. (From Persian/Urdu).

ط‍ط‍ط‍ ط‍ pharyngealised consonant طU+0637 LETTER TAH

d

د‍ ‍د consonant دU+062F LETTER DAL

d͡ʒ

ج‍ج‍ج‍ ج‍ consonant جU+062C LETTER JEEM

چ‍چ‍چ‍ چ‍ consonant چU+0686 LETTER TCHEH Used in Egypt for foreign names. (From Persian/Urdu).

ض‍ض‍ض‍ ض‍ pharyngealised consonant ضU+0636 LETTER DAD

k

ك‍ك‍ك‍ ك‍ consonant كU+0643 LETTER KAF

q

ق‍ق‍ق‍ ق‍ consonant قU+0642 LETTER QAF

ڧ‍ڧ‍ڧ‍ ڧ‍ consonant ڧU+06A7 LETTER QAF WITH DOT ABOVE Maghrebi form, used in North Africa.

ʔ

ء glottal stop ءU+0621 LETTER HAMZA

أ ـأ glottal stop أU+0623 LETTER ALEF WITH HAMZA ABOVE

إ ـإ glottal stop إU+0625 LETTER ALEF WITH HAMZA BELOW

ؤ ـؤ glottal stop ؤU+0624 LETTER WAW WITH HAMZA ABOVE

ئ ئئئ glottal stop ئU+0626 LETTER YEH WITH HAMZA ABOVE

آ ـآ glottal stop آU+0622 LETTER ALEF WITH MADDA ABOVE Represents the sound ʔaː.

f

ف‍ف‍ف‍ ف‍ consonant فU+0641 LETTER FEH

 

ڢ‍ڢ‍ڢ‍ ڢ‍ consonant ڢU+06A2 LETTER FEH WITH DOT MOVED BELOW Maghrebi form, used in North Africa.

v

ف‍ف‍ف‍ ف‍ consonant فU+0641 LETTER FEH

ڤ‍ڤ‍ڤ‍ ڤ‍ consonant ڤU+06A4 LETTER VEH

θ

ث‍ث‍ث‍ ث‍ consonant ثU+062B LETTER THEH

ð

ذ‍ ‍ذ consonant ذU+0630 LETTER THAL

ðˤ

ظ‍ظ‍ظ‍ ظ‍ pharyngealised consonant ظU+0638 LETTER ZAH

s

س‍س‍س‍ س‍ consonant سU+0633 LETTER SEEN

ص‍ص‍ص‍ ص‍ pharyngealised consonant صU+0635 LETTER SAD

z

ز‍ ‍ز consonant زU+0632 LETTER ZAIN

ظ‍ظ‍ظ‍ ظ‍ pharyngealised consonant ظU+0638 LETTER ZAH

ࢲ‍ ‍ࢲ pharyngealized consonant U+08B2 LETTER ZAIN WITH INVERTED V ABOVE Sometimes used for Berber sounds.

ʃ

ش‍ش‍ش‍ ش‍ consonant شU+0634 LETTER SHEEN

ʒ

ج‍ج‍ج‍ ج‍ consonant جU+062C LETTER JEEM This is a regional variant for d͡ʒ.

x

خ‍خ‍خ‍ خ‍ consonant خU+062E LETTER KHAH

ɣ

غ‍غ‍غ‍ غ‍ consonant غU+063A LETTER GHAIN

ħ

ح‍ح‍ح‍ ح‍ consonant حU+062D LETTER HAH

ʕ

ع‍ع‍ع‍ ع‍ consonant عU+0639 LETTER AIN

h

ه‍ه‍ه‍ ه‍ consonant هU+0647 LETTER HEH

m

م‍م‍م‍ م‍ consonant مU+0645 LETTER MEEM

n

ن‍ن‍ن‍ ن‍ consonant نU+0646 LETTER NOON

w

و‍ ‍و consonant/mater lectionis وU+0648 LETTER WAW

r

ر‍ ‍ر consonant رU+0631 LETTER REH

l

ل‍ل‍ل‍ ل‍ consonant لU+0644 LETTER LAM

j

ي‍ي‍ي‍ ي‍ consonant/mater lectionis يU+064A LETTER YEH

Symbols

Honorifics

Characters in the Arabic Presentation Forms blocks that are not just for compability use include the following. Click on the characters in the list for more information.


20
  {RAHIMAHU ALLAAH}FD40
  {RADI ALLAAHU ANH}FD41
  {RADI ALLAAHU ANHAA}FD42
  {RADI ALLAAHU ANHUM}FD43
  {RADI ALLAAHU ANHUMAA}FD44
  {RADI ALLAAHU ANHUNNA}FD45
  {SALLALLAAHU ALAYHI WA-AALIH}FD46
  {ALAYHI AS-SALAAM}FD47
  {ALAYHIM AS-SALAAM}FD48
  {ALAYHIMAA AS-SALAAM}FD49
  {ALAYHI AS-SALAATU WAS-SALAAM}FD4A
  {QUDDISA SIRRAH}FD4B
  {SALLALLAHU ALAYHI WAAALIHEE WA-SALLAM}FD4C
  {ALAYHAA AS-SALAAM}FD4D
  {TABAARAKA WA-TAAALAA}FD4E
  {RAHIMAHUM ALLAAH}FD4F
  {SALAAMUHU ALAYNAA}FDCF
infreq. {In the name of God, the Most Gracious, the Most Merciful}FDFD
  {SUBHAANAHU WA TAAALAA}FDFE
﷿  {AZZA WA JALL}FDFF

See also Presentation Forms.

Ijam symbols

Other characters in the Arabic Presentation Forms blocks that are not just for compability use include the following symbols that can be used for pedagogical purposes. In educational materials there is sometimes a need to show pictures of the dots and marks used to distinguish Arabic characters, particularly the ijam. These code points provide for that use case. They are never used as combining marks, nor in composition with Arabic letter forms, but are simply symbols.


17
rare {DOT ABOVE}FBB2
rare {DOT BELOW}FBB3
rare {TWO DOTS ABOVE}FBB4
rare {TWO DOTS BELOW}FBB5
rare {THREE DOTS ABOVE}FBB6
rare {THREE DOTS BELOW}FBB7
rare {THREE DOTS POINTING DOWNWARDS ABOVE}FBB8
rare {THREE DOTS POINTING DOWNWARDS BELOW}FBB9
rare {FOUR DOTS ABOVE}FBBA
rare {FOUR DOTS BELOW}FBBB
rare {DOUBLE VERTICAL BAR BELOW}FBBC
rare {TWO DOTS VERTICALLY ABOVE}FBBD
rare {TWO DOTS VERTICALLY BELOW}FBBE
﮿rare {RING}FBBF
rare {SMALL TAH ABOVE}FBC0
rare {SMALL TAH BELOW}FBC1
rare {WASLA ABOVE}FBC2

See also Presentation Forms.

Other features

Ligatures

The combination ل + اU+0644 LETTER LAM + U+0627 LETTER ALEF is always written as a ligature. The underlying code points are, however, preserved. The form of this ligature that joins to the right is ‍لاand unjoined it is لا

Observation: When diacritics are used with this ligature, they sometimes appear to be over the ALEF, rather than over the LAM, eg. قليلاً This would require a typing order that is different from the spoken sequence.

Other combinations of characters are likely to also ligate (see Context-based shaping). The number of ligatures in text typically depends on the font used, but ligation can also be used as a device to manage justification, in which case it needs some degree of manual control

Formatting characters

Modern Arabic text makes use of a relatively large set of invisible formatting characters, especially in plain text, many of which are used to manage text direction. Descriptions of these characters can be found in the following sections:

Presentation Forms

The code points in the Unicode blocks Arabic Presentation Forms-A and Arabic Presentation Forms-B provide positional forms of Arabic letters and ligatures. They should not be used for ordinary text. Those code points are provided for compatibility with legacy code pages, and have (compatibility) character decomposition mappings. Normally, Arabic text should be written with code points from the main Arabic block and its extensions; positional forms are dealt with by the font and rendering algorithms.

However, there are some exceptions to this rule, which are listed here. These characters are not included in the Unicode repertoire for compatibility but may be used in Arabic texts, in their own right.12398-400 They normally don't have character decomposition mappings. (See also Arabic ‘presentation form’ exceptions.)

The useful code points include the following:

Honorific combining marks

In addition to the honorifics described earlier, the basic Arabic block has a small number of corresponding combining marks. Click on the characters in the list below for more details.


5
ؐ0610
ؑ0611
ؒ0612
ؓ0613
ؔ0614

Encoding choices

In the Persian orthography different sequences of Unicode characters may produce the same visual result. Here we look at those, and make notes on usage.

Hamza & precomposed characters

Unicode support for the various uses of the hamza is complicated.12384 In general, the Unicode Standard recommends to use ◌ٔU+0654 HAMZA ABOVE in combination with a base character. However, there are a few exceptions to consider.

Canonically-equivalent alternatives

A number of combinations with the hamza diacritic can be represented as either an atomic character or a decomposed sequence, where the parts are separated in Unicode Normalisation Form D (NFD) and recomposed in Unicode Normalisation Form C (NFC), so both approaches are canonically equivalent. These include the following:

Atomic Decomposed
أ [U+0623 ARABIC LETTER ALEF WITH HAMZA ABOVE] أ [U+0627 ARABIC LETTER ALEF + U+0654 ARABIC HAMZA ABOVE]
آ [U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE] آ [U+0627 ARABIC LETTER ALEF + U+0653 ARABIC MADDAH ABOVE]
ؤ [U+0624 ARABIC LETTER WAW WITH HAMZA ABOVE] ؤ [U+0648 ARABIC LETTER WAW + U+0654 ARABIC HAMZA ABOVE]
ئ [U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE] ئ [U+064A ARABIC LETTER YEH + U+0654 ARABIC HAMZA ABOVE]

The single code point per vowel-sign is the form preferred by the Unicode Standard and the form in common use for Arabic language text, but either could be found.

Codepoint sequences

When typing and in storage, combining marks always follow the base character they are associated with.

Special rendering rules

In principle, if more than one combining mark appears on the same side of the base character, Unicode expects applications to render the marks such that those marks closer to the base character in memory appear closer to the base character when rendered. (This is called the inside-out rule.) However, due to the reordering applied by the Unicode normalisation forms, some of the Arabic script diacritics end up in an inappropriate order on display.

For example, if a user types the sequence of characters in Figure 1, the order of the marks will be changed such that applying the inside-out rule would render the shadda above the vowel (which is incorrect). (In fact, most application renderers have special rules to correct this.)

The Unicode Standard formally addresses this anomaly in the Technical Annex Unicode® Arabic Mark Rendering (AMTRA), with a set of rules for how to render sequences of Arabic characters. The rules generally move shadda, hamza, round dots, etc. so that they are close to the base character.

User inputPost-normalisation output

بُّ

بU+0628 LETTER BEH

ّU+0651 SHADDA

ُU+064F DAMMA

بُ͏ّ

بU+0628 LETTER BEH

ُU+064F DAMMA

ّU+0651 SHADDA

A sequence of shadda and damma as the user is likely to input it (left), and how it could potentially be arranged after normalisation (right).

In the rare exceptions where the AMTRA rules should not change the rendering, this can be achieved by placing an invisible ͏U+034F COMBINING GRAPHEME JOINER character between the combining marks. (In fact, this is what was done to simulate the incorrect appearance in Figure 1, because otherwise the browser rendering engine would have automatically produced the same output as in the first column. Clicking on the example will show the sequence used.)

Numbers, dates, currency, etc.

This section describes typographic features related to digits, dates, currencies, etc.

Digits

See type samples.


10
٠00660
١10661
٢20662
٣30663
٤40664
٥50665
٦60666
٧70667
٨80668
٩90669

10
10031
20032
30033
40034
50035
60036
70037
80038
90039
00030

A set of arabic-indic digits are typically used in Middle Eastern and Gulf countries, whereas North African countries tend to use European digits. In neither area, however, is one digit style used exclusively.

The Unicode bidi_class property for these native digits is Arabic_Number, which makes them behave differently from ASCII digits, and differently from the set of extended digits used for Persian, Urdu, etc. For more information, see Expressions & sequences.

Arabic script has its own number separators, which are used in Arabic language text when the non-European digits are used. They are ٫U+066B DECIMAL SEPARATOR and ٬U+066C THOUSANDS SEPARATOR.

Arabic also has its own characters for ٪U+066A PERCENT SIGN, ؉U+0609 -INDIC PER MILLE SIGN. The ASCII %U+0025 PERCENT SIGN and U+2030 PER MILLE SIGN are also used.

The CLDR standard-decimal pattern is #,##0.###. The standard-percent pattern is #,##0% or #,##0٪.11

See also Expressions & sequences about directional implications for handling expressions or sequences of numbers.

Extended-Arabic digits. Still in the basic Unicode Arabic block, as mentioned, there is a second set of digits in Unicode for use in languages such as Persian and Urdu.


10
۰06F0
۱06F1
۲06F2
۳06F3
۴06F4
۵06F5
۶06F6
۷06F7
۸06F8
۹06F9

The glyph shapes are typically different for 3 of the digits (although not always the same 3 digits) in Persian, Urdu and Sindhi.

Arabic٠١٢٣٤٥٦٧٨٩
Persian۰۱۲۳۴۵۶۷۸۹
Urdu۰۱۲۳۴۵۶۷۸۹
Sindi۰۱۲۳۴۵۶۷۸۹
Arabic-indic numerals, as used in Arabic, Persian, Urdu and Sindhi language text.

Currency

Unicode has a character for the rial: U+FDFC RIAL SIGN.

Text direction

Arabic script text is written horizontally and right-to-left in the main but, as in most right-to-left scripts, numbers and embedded text in other scripts are written left-to-right (producing 'bidirectional' text).

العاشر ليونيكود (Unicode Conference)،الذي سيعقد في 10-12 آذار 1997 مبدينة
Arabic words are read right-to-left, starting from the right of this line, but numbers and Latin text (highlighted) are read left-to-right.

The Unicode Bidirectional Algorithm automatically takes care of the ordering for all the text in Figure 3, as long as the 'base direction' (ie. the surrounding directional context) is set to right-to-left (RTL).

Characters are all stored in the order in which they are spoken (and typed). This so-called 'logical' order is then rendered as bidirectional flows by the application at run time, as the text is displayed or printed. The relative placement of characters within a single directional flow is based on strong directional properties (RTL or LTR) assigned to each Unicode character by the Unicode Standard. There exist, however a set of neutral direction property values, mostly for punctuation, where the placement of characters depends on the base direction.

Show default bidi_class properties for characters in this orthography.

If the base direction is not set appropriately, the directional runs will be ordered incorrectly as shown in Figure 4, making it very difficult to get the meaning.

لحجز مواعيد اللقاح ضد COVID-19 انتقل إلى، www.nhs.uk/vaccination أو اتصل برقم 119 الذي.
The exact same sequence of characters with the base direction set to RTL (top), and with no base direction set on this LTR page (bottom). The arrows show how items are relocated.

In some circumstances the Unicode Bidirectional Algorithm requires additional assistance to correctly render the directionality of bidirectional text. For such cases the Unicode Standard provides invisible formatting characters for use in plain text. See Managing text direction.

In HTML the base direction and higher level controls can be set using the dir or bdi attributes. CSS should not be used to control direction. Unicode formatting codes should also not be used where markup is available.

For more information about how directionality and base direction work, see Unicode Bidirectional Algorithm basics. For information about plain text formatting characters see How to use Unicode controls for bidi text. And for working with markup in HTML, see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.

For authoring HTML pages, one of the most important things to remember is to use <html dir="rtl" … > at the top of a right-to-left page, and then use the dir attribute or bdi tag for ranges within the page, but only when you need to change the base direction. Also, use markup to manage direction, and do not use CSS styling.

For other aspects of dealing with right-to-left writing systems see the following sections:

Managing text direction

Unicode provides a set of 10 formatting characters that can be used to control the direction of text when displayed. These characters have no visual form in the rendered text, however text editing applications may have a way to show their location.

‫U+202B RIGHT-TO-LEFT EMBEDDING (RLE), ‪U+202A LEFT-TO-RIGHT EMBEDDING (LRE), and ‬U+202C POP DIRECTIONAL FORMATTING (PDF) are in widespread use to set the base direction of a range of characters. RLE/LRE comes at the start, and PDF at the end of a range of characters for which the base direction is to be set.

In Unicode 6.1, the Unicode Standard added a set of characters which do the same thing but also isolate the content from surrounding characters, in order to avoid spillover effects. They are ⁧U+2067 RIGHT-TO-LEFT ISOLATE (RLI), ⁦U+2066 LEFT-TO-RIGHT ISOLATE (LRI), and ⁦U+2066 LEFT-TO-RIGHT ISOLATE (PDI). The Unicode Standard recommends that these be used instead.

There is also ⁨U+2068 FIRST STRONG ISOLATE (FSI), used initially to set the base direction according to the first recognised strongly-directional character.

؜U+061C LETTER MARK (ALM) is used to produce correct sequencing of numeric data. Click on the character name, and see also Expressions & sequences for details.

‏U+200F RIGHT-TO-LEFT MARK (RLM) and ‎U+200E LEFT-TO-RIGHT MARK (LRM) are invisible characters with strong directional properties that are also sometimes used to produce the correct ordering of text.

For more information about how to use these formatting characters see How to use Unicode controls for bidi text. Note, however, that when writing HTML you should generally use markup rather than these control codes. For information about that, see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.

Expressions & sequences

This section is about sequences of numbers, rather than a sequence of digits. Sequences of numbers are sets of numbers separated by punctuation or spaces, such as 10–12–2022. Sequences of digits, such as 123, in Arabic text run LTR automatically.

A sequence of numbers used to express a range of values generally runs right to left in the Arabic language (and languages using the Thaana or Syriac scripts), whereas for Persian language text (and in Hebrew, N’Ko or Adlam scripts) it runs left to right.

This also tends to apply to expressions such as 1 + 2 = 3.

Figure 5 shows Arabic text which is right-to-left overall, containing an ASCII-digit numeric range that is also ordered RTL, ie. it starts with 10 on the right and ends with 12 on the left.

في 10–12 آدار
A numeric range in Arabic language text.

In Persian, however, the sequence would generally run LTR, so 10 would be on the left, and 12 on the right. The underlying order of the characters that make up the expression, and the order in which they are typed, remain the same. (Click on each figure to see the underlying character sequences.)

در ‎10–12 آذار
A numeric range in Persian language text.

However, the preferred order for a sequence of numbers may also depend on the context. For ISBN numbers, telephone numbers, and so forth, a left-to-right sequencing is likely to be preferred.

The default direction for a sequence in an application that implements Unicode fully will depend on:

  1. the digits used (ASCII, Arabic or Extended Arabic),
  2. whether or not the sequence is preceded by Arabic script text, and
  3. the separators used.

Contextual factors for Arabic

The table below shows default sequence orders for Arabic text, with separators drawn from 4 different Unicode bidi_class properties. The base direction in all cases is RTL. The coloured items are LTR sequences; the black sequences run RTL.

The ASCII digits have the bidi property European_number, and the Arabic digits have the property Arabic_number.

If you add spaces after any separator (such as the solidus on the right), the order will be RTL, per the left-hand column.

bidi_class White_Space Other_Neutral European_Separator Common_Separator
Includes: ASCII space, and 15 others Hyphen (U+2010), en-dash, and 5,500+ other code points Hyphen-minus (U+002D), minus sign, plus sign, +9 more Solidus, Arabic comma, comma, full stop, colon, nbsp, +9 more
Bare ASCII 12 34 56 12‐34‐56 12-34-56 12/34/56
Bare native ١٢ ٣٤ ٥٦ ١٢‐٣٤‐٥٦ ١٢-٣٤-٥٦ ١٢/٣٤/٥٦
ASCII after Arabic ن 12 34 56 ن 12‐34‐56 ن 12-34-56 ن 12/34/56
Native after Arabic ن ١٢ ٣٤ ٥٦ ن ١٢‐٣٤‐٥٦ ن ١٢-٣٤-٥٦ ن ١٢/٣٤/٥٦

Controlling the direction for Arabic

Changing the direction of the bare ASCII digits with ASCII hyphen. If you have a line that only contains digits the direction for the sequences varies, depending on whether the digits are ASCII (European_Number) or Arabic (Arabic_number).

If you want the ASCII digit sequence to run RTL (eg. for a range) you need to start the line with the formatting character ؜U+061C LETTER MARK (ALM). This is effectively an invisible Arabic script character. The required order cannot be achieved by simply setting the base direction, nor by using ‏U+200F RIGHT-TO-LEFT MARK. It has to be ALM.

An alternative would be to use U+2010 HYPHEN or U+2013 EN DASH instead, since they have a different bidi class.

Making other sequences run LTR. Sequences using most other separators, such as the non-ASCII hyphen, run RTL by default in RTL text. This is appropriate for ranges in Arabic, but not for ISBN numbers, telephone numbers, etc. To make these run LTR, you can either precede the sequence with a ‎U+200E LEFT-TO-RIGHT MARK (LRM), or set the base direction of the sequence to LTR using markup or formatting characters.

Making Common_separator sequences run RTL. Sequences separated by commas (ASCII and Arabic), full stops, colons, and no-break spaces run LTR and are resistant to change. The direction cannot be changed using RLM or by changing the base direction. Which means that, for example, if you want the components of numeric dates to be ordered RTL, you should avoid using these separators. (Although, surrounding the separators by a space would produce the RTL direction, eg. compare 12/34/56 and 12 / 34 / 56, where the only difference is the addition of spaces.)

Alphanumeric sequences. Some sequences, such as MAC addresses, contain a mixture of numbers and letters. The strong directionality of the letters influences the resulting order, and so these sequences are best managed by explicitly setting the base direction.

Contextual factors for Persian

Although we are describing Arabic here, it may also be useful to include data for Persian to allow for comparison.

This table is the same as the Arabic table, except for the cell that is the junction of European_separator and native Arabic digits. This is because the native digits are from the Extended Arabic-indic range, and have a bidi_class property of European_number, like the ASCII digits.

bidi_class White_Space Other_Neutral European_Separator Common_Separator
Includes: ASCII space, and 15 others Hyphen (U+2010), en-dash, and 5,500+ other code points Hyphen-minus (U+002D), minus sign, plus sign, +9 more Solidus, Arabic comma, comma, full stop, colon, nbsp, +9 more
Bare ASCII 12 34 56 12‐34‐56 12-34-56 12/34/56
Bare native ۱۲ ۳۴ ۵۶ ۱۲‐۳۴‐۵۶ ۱۲-۳۴-۵۶ ۱۲/۳۴/۵۶
ASCII after Arabic ن 12 34 56 ن 12‐34‐56 ن 12-34-56 ن 12/34/56
Native after Arabic ن ۱۲ ۳۴ ۵۶ ن ۱۲‐۳۴‐۵۶ ن ۱۲-۳۴-۵۶ ن ۱۲/۳۴/۵۶

Glyph shaping & positioning

This section describes typographic features related to font/writing styles, cursive text, context-based shaping, context-based positioning, letterform slopes, weights & italics, and case & other character transforms.

You can experiment with examples using the Arabic character app.

Writing styles

Arabic orthographies can be grouped into a number of writing styles, some of which are more common for particular languages, while others can be used interchangeably for the same language. Sometimes the variations are adapted to usage, for example book text vs. inscriptions; sometimes the variants reflect regional, cultural or stylistic calligraphic preferences.

The different styles include Naskh, Nasta'liq, Ruq'a, Thuluth, Taʻlīq, Kufi, Diwani, Maghribi, Kano. The examples in this page use a naskh writing style. For a brief introduction to font styles, with examples, see Text layout requirements for the Arabic script.

The naskh writing style is the most prominent style for the Arabic language, and has become the default form of Arabic language content in most contexts. It has clearly distinguished letters, which make it easy to read, and can be written in small sizes.

يحق لكل فرد أن يغادر أية بلاد بما في ذلك بلده كما يحق له العودة إليه.
Arabic is commonly written in the naskh writing style.

The ruq’ah writing style was designed for use in education, in official documents, and for every-day writing. It is known for its clipped letters composed of short, straight lines and simple curves, as well as its straight and even lines of text. It is a functional style of writing that is quick to write and easy to read. It also doesn’t extend baselines, like a naskh font does. In 2010's rebranding of Amman a ruq'ah font family was created to serve as an italic face. Monotype has an interesting article on the development of ruq'ah.

يحق لكل فرد أن يغادر أية بلاد بما في ذلك بلده كما يحق له العودة إليه.
The Waseem font released with Mojave OS is based on the ruq'ah style.

The nasta’liq writing style is the standard way of writing Urdu and Kashmiri, and is also often a preferred style for Persian text. Key features include a sloping baseline for joined letters, and overall complex shaping and positioning for base letters and diacritics alike. There are also distinctive shapes for many glyphs and ligatures.

يحق لكل فرد أن يغادر أية بلاد بما في ذلك بلده كما يحق له العودة إليه.
The same Arabic language text rendered with the Awami Nastaliq font.

The kano writing style is a common way of writing Hausa in Nigeria in the ajami script, and like other East African writing it is based on Warsh (Warš) forms, which incorporate Maghribi characteristics. Some sources describe an alternative Hafs (Ḥafṣ) orthography, used with hand-written adaptations for the newspaper Al-Fijir.

يحق لكل فرد أن يغادر أية بلاد بما في ذلك بلده كما يحق له العودة إليه.
The same Arabic language text rendered with the Alkalami font.

The kufi writing style is the original style used for the Koran, but is not used for newspapers or official content today. However, it is used in modern content for logos and other stylised applications.

يحق لكل فرد أن يغادر أية بلاد بما في ذلك بلده كما يحق له العودة إليه.
The same Arabic language text rendered with the KufiStandardGK font.

Cursive script

Do letters in this script join with each other by default? Is the basic shape of a letter radically changed? Is it sometimes not cursive? Are there any special features to note? Are Unicode joiner and non-joiner characters needed to override default joining behaviours?

See type samples.

Arabic script is always cursive, ie. letters in a word are joined up. Fonts need to produce the appropriate joining form for a letter, according to its visual context, but the code point used doesn't change. This results in four different shapes for most letters (including an isolated shape). Ligated forms also join with characters alongside them.

The highlights in the example below show the same letter, عU+0639 LETTER AIN, with three different joining forms.

على • متعددة • وسيجمع

The letter ع (ain) in 3 different joining contexts.

Most Arabic script letters join on both sides. A few only join on the right-hand side: this involves 4 basic shapes for Modern Standard Arabic.

ءU+0621 LETTER HAMZA doesn't join on either side.

Cursive joining forms

Most dual-joining characters add or become a swash when they don't join to the left. A number of characters, however, undergo additional shape changes across the joining forms. Figure 13 and Figure 14 show the basic shapes in Modern Standard Arabic and what their joining forms look like. Significant variations are highlighted.

isolatedright-joineddual-joinleft-joined MSA letters
ب ـب ـبـ بـ

4
ب 0628
ت 062A
ث 062B
پloan067E
ن ـن ـنـ نـ

ن0646
ق ـق ـقـ قـ

ق0642
ف ـف ـفـ فـ

both
ف 0641
ڤloan06A4
س ـس ـسـ سـ

both
س0633
ش0634
ص ـص ـصـ صـ

both
ص0635
ض0636
ط ـط ـطـ طـ

both
ط0637
ظ0638
ك ـك ـكـ كـ

ك0643
ل ـل ـلـ لـ

ل0644
ه ـه ـهـ هـ

both
ه0647
ة0629
م ـم ـمـ مـ

م0645
ع ـع ـعـ عـ

both
ع0639
غ063A
ح ـح ـحـ حـ

4
ح 062D
خ 062E
ج 062C
چloan0686
ي ـي ـيـ يـ

3
ي064A
ئ0626
ى0649
Joining forms for shapes that join on both sides..
isolatedright-joined MSA letters
ا ـا

5
ا 0627
أ 0623
إ 0625
آ 0622
ٱinfreq.0671
ر ـر

both
ر0631
ز0632
د ـد

both
د062F
ذ0630
و ـو

both
و0648
ؤ0624
Joining forms for shapes that join on the right only.

Managing glyph shaping

‍U+200D ZERO WIDTH JOINER (ZWJ) and ‌U+200C ZERO WIDTH NON-JOINER (ZWNJ) are used to control the joining behaviour of cursive glyphs. They are particularly useful in educational contexts, but also have real world applications.

ZWJ permits a letter to form a cursive connection without a visible neighbour. For example, the marker for hijri dates is an initial form of heh, even though it doesn't join to the left, ie. ه‍. For this, use ZWJ immediately after the heh, eg. الاثنين 10 رجب 1415 ه‍..

ZWNJ prevents two adjacent letters forming a cursive connection with each other when rendered. For example, it is used in Persian for plural suffixes, some proper names, and Ottoman Turkish vowels. Ignoring or removing the ZWNJ will result in text with a different meaning or meaningless text, eg, تن‌ها is the plural of body, whereas تنها is the adjective alone.2 The only difference is the presence or absence of ZWNJ after noon.

͏U+034F COMBINING GRAPHEME JOINER is used in Arabic to produce special ordering of diacritics. The name is a misnomer, as it is generally used to break the normal sequence of diacritics.

Context-based shaping & positioning

Are special glyph forms needed, depending on the context in which a character is used? Do glyphs interact in some circumstances? Are there requirements to position diacritics or other items specially, depending on context? Does the script have multiple diacritics competing for the same location relative to the base?

Context-based shaping

See just above for shaping related to cursive joining.

In all but the most basic fonts, glyph shapes are highly variable for Arabic letters. For example, Figure 15 shows a wide variety of shapes produced by default in the Mishafi font for كU+0643 LETTER KAF when followed by various letters.

كا  كع كغ كح كخ كق كف كط كه كم كر كو كؤ كد كي كب كن كص كس
Glyph variation in the Mishafi font.

Teeth letters

A good font will constantly change the shape of glyphs slightly so as to create a more aesthetically pleasing, and in some cases a more easily readable, flow.

ـدد تتـ سسـ
Three examples where the same letter is repeated, but the glyph shapes differ.

Figure 17 shows an example where the same word is displayed using different fonts.2 The font on the left applies rules to distinguish the letter bases clearly. Note, in particular, that although there are 3 letters which are repeated, none of those letters uses the same shape twice.

بتثبيتين

The same word in two different fonts (Mishaf and Scheherazade).

Special joining forms

In more traditional fonts, you will also often see the join between certain characters above the baseline. Compare the highlighted character joins in Figure 18, showing the same sequence of letters but with joins above vs. along the baseline. (The first font is Mishafi, and the second Scheherazade New.)

نين خبراء
نين خبراء
Font-based differences in joining.

But actually a good font will typically have a range of shapes and placements for a given letter, depending on the adjoining letter. This is illustrated in Figure 19.2→

نم نمل نجر نسيم نبات

Various different forms for the initial letter noon,

Characters within a word may also combine vertically in certain groupings. See the example in Figure 20.

خجمم

Vertically arranged letters in a word.

Ligatures

Ligated glyph forms are common in Arabic. Some, such as لا are mandatory. Most of the remainder depend on the font style. The lam-alif ligature also affects other characters that are based on the alif, such as for لإ لأ لآ.

Traditional fonts tend to have more optional ligated forms than modern styles.

المؤتمر  vs.  المؤتمر

The same word with ligatures (right) and no ligatures (left).

Ligatures are often used to manage justification. Since they generally reduce the horizontal width of a word, they can be used to fit more text on the end of a line, or balance baseline stretching.

Context-based positioning

When vowel or shadda diacritics are used they can be placed in different positions, according to the context.

يتكلّم تسجّل
The position of the shadda diacritic depends on the height of the base character in many fonts.

When both shadda and vowel signs are combined with a base character, a more complicated set of rules may be applied. Depending on the font used, some vowel diacritics may be placed relative the shadda diacritic, rather than relative to the base character.

مَمِمّمَّمِّ

When kasra and shadda diacritics appear together, the kasra may be below the base character (right), or below the shadda (left), depending on the font.

Letterform slopes, weights, & italics

See type samples.

Italics & oblique

Arabic text does use slanting letters. In some cases the letters may be slanted to the left as in Figure 24.

Left-leaning italics for جريدة العجب الدولية

The text just below this newspaper title leans to the left.

Case & other character transforms

Is the orthography bicameral? Are there other character pairings, especially when transforms are needed to convert between the two?

Arabic has no case distinction.

However, as mentioned in Numbers, dates, currency, etc., Arabic sometimes uses ASCII digit glyphs and other times uses local digit glyphs. Some fonts and authoring applications allow you to choose which glyph shapes to use for the same underlying characters.

Arabic fonts may also have alternative shapes for glyphs, which can be turned on in certain circumstances. For example, some fonts have a set of swash forms for certain characters, which can be used for justification, or just for effect.

Jalt table
The jalt table in the Arabic Typesetting font contains alternative elongated forms. (source)

Typographic units

Word boundaries

Are words separated by spaces, or other characters? Are there special requirements when double-clicking on the text? Are words hyphenated?

The concept of 'word' is difficult to define in any language (see What is a word?). Here, a word is a vaguely-defined, but recognisable semantic unit that is typically smaller than a phrase and may comprise one or more syllables.

Words are separated by spaces.

In Arabic, small words like 'and' (و) are written alongside the following word with no intervening space (eg. الجامعات والكليات means 'universities and colleges', but there is only one space). Such small words are handled typographically as part of the word they are attached to.

Graphemes

A grapheme is a user-perceived unit of text. Text operations that use graphemes as a unit of text include line-breaking, forwards deletion, cursor movement & selection, character counts, text spacing, text insertion, justification, case conversions, and sorting. The Unicode Standard uses generalised rules to define 'grapheme clusters', which approximate the likely grapheme boundaries in a writing system, however they don't work well with many complex scripts.

The term orthographic syllable is not clearly defined in the Unicode Standard. In the orthography notes on this site we define it to mean a typographic unit that includes more than one grapheme cluster. This is commonly the case for Brahmi-derived scripts, such as for Devanagari conjuncts, or Balinese stacks. Orthographic syllables do not correspond to phonetic syllables.

Grapheme clusters

In most cases Arabic text uses precomposed characters and omits vowels. Therefore grapheme boundaries are consistent with individual letters. Where this is not the case, the additions are combining marks, and the Unicode grapheme cluster is designed to span combinations of base character plus any number of following combining marks.

Larger typographic units

One potential complication is that fonts often render sequences of characters as ligated forms. The ligated forms are a font-specific feature, whereas grapheme clusters are based on code point sequences: some fonts may display the same sequence of characters without a ligated form. Most applications tend to move character by character through the text, producing situations like the cursor position in Figure 26.

لم بلده كما
The cursor positioned between k and m in the ligated form for كما kmɑ. (source)

This approach allows for easy deletion or insertion of any component of a ligated form.

Many Brahmi-derived scripts are segmented by units that incorporate more than one grapheme cluster, for operations such as in-word line-breaking, justification, letter-spacing, and initial letter highlighting. It is not clear whether such typographic units are needed for Arabic language text, since there is usually no hyphenation, and no initial letter highlighting, and letter-spacing and justification follow quite different rules, extending the baseline. It may be worth looking at the very rare examples of vertically-set lines with upright Arabic letters to check whether ligatures like lam-alif or others are kept together. (My expectation is that lam-alif is not split, but the others may be.)

Apart from lam-alif, where one can expect a rule to apply consistently, another issue is that an application that wants to keep ligatures together as a single unit would have to be aware of the rendering behaviour of the particular font in use, since some fonts have ligatures for a given code point sequence and others don't. There is no way of deriving this information from the code point sequence itself, since that is always exactly the same.

Browser behaviour

Test in your browser. The words test units that equate to grapheme clusters only, and others that include conjuncts. First, the text is displayed in a contenteditable paragraph, then in a textarea. Results are reported for Gecko (Firefox), Blink (Chrome), and WebKit (Safari) on a Mac.

أنتن أَنْتُنَّ الإسلام أسو

The last word on each line (only) has a decomposed sequence for the lam+hamza.

Cursor movement. Move the cursor through the text.
Gecko , Blink, and WebKit browsers steps through the text using grapheme clusters. This means that it takes 2 steps to get past the lam-alif ligature. The decomposed sequence in the last word is treated like any other grapheme cluster.

Selection. Place the cursor next to a character and hold down shift while pressing an arrow key.
The behaviour is the same as for cursor movement.

Deletion. Forward deletion works in the same way as cursor movement. The backspace key deletes code point by code point, for all browsers.

Punctuation & inline features

This section describes typographic features related to word boundaries, phrase & section boundaries, bracketed text, quotations & citations, emphasis, abbreviation, ellipsis & repetition, inline notes & annotations, other punctuation, and other inline text decoration.

Phrase & section boundaries

What characters are used to indicate the boundaries of phrases, sentences, and sections?


6
،060C
؛061B
:003A
.002E
؟061F
!0021

Arabic language uses a mixture of ASCII and Arabic punctuation. Other languages using the Arabic script may use different punctuation, such as the full stop in Urdu.

phrase

،U+060C COMMA

؛U+061B SEMICOLON

:U+003A COLON

sentence

.U+002E FULL STOP

؟U+061F QUESTION MARK

!U+0021 EXCLAMATION MARK

آخر، … والنساء.

Arabic language text using an arabic comma, but an ASCII full stop.

Arabic language text also uses U+2010 HYPHEN, U+2013 EN DASH, and U+2014 EM DASH.

Bracketed text


both
(0028
)0029

Arabic commonly uses ASCII parentheses to insert parenthetical information into text.

  start end
standard

(U+0028 LEFT PARENTHESIS

)U+0029 RIGHT PARENTHESIS

خصائصها الفيزيائية (الإشعاعية والحرارية) له أهمية خاصة في أبحاث المناخ

translation

Its physical properties (radiative and thermal) are of particular interest in climate research.

In this text sample, the parenthesis on the right is U+0028 LEFT PARENTHESIS, and the one on the left is U+0029 RIGHT PARENTHESIS (see Mirrored characters).

Mirrored characters

The words 'left' and 'right' in the Unicode names for parentheses, brackets, and other paired characters should be ignored. LEFT should be read as if it said START, and RIGHT as END. The direction in which the glyphs point will be automatically determined according to the base direction of the text.

a > b > c
ا > ب > ج
Both of these lines use > U+003E GREATER-THAN SIGN, but the direction it faces depends on the base direction at the point of display.

The number of characters that are mirrored in this way is around 550, most of which are mathematical symbols. Some are single characters, rather than pairs. The following are some of the more common ones.


12
( 0028
) 0029
< 003C
> 003E
[unused005B
]unused005D
{ 007B
} 007D
« 00AB
» 00BB
 2039
 203A

Presentation forms

Although characters in the Arabic Presentation Forms blocks should not normally be used, the following are sometimes used for Arabic text.


both
infreq.FD3E
﴿infreq.FD3F

Unlike other parentheses, for legacy reasons these are not automatically mirrored when used in text, so you need to choose the right code point based on the expected glyph shape.

Quotations & citations

What characters are used to indicate quotations? Do quotations within quotations use different characters? What characters are used to indicate dialogue? Are the same mechanisms used to cite words, or for scare quotes, etc? What about citing book or article names?

See type samples.


8
201D
201C
2019
2018
«00AB
»00BB
2039
203A

Two different styles of quotation mark can be found in Arabic language texts. When quoted text appears within quoted text different characters are used, though usually of the same type. (Of course, depending on ease of input, quotations may also be surrounded by ASCII double and single quote marks.) Spacing inside the marks is optional.19

  start end
primary «U+00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK »U+00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
nested U+2018 LEFT SINGLE QUOTATION MARK U+2019 RIGHT SINGLE QUOTATION MARK

Because they are mirrored, when using the guillemets, LEFT should be read as if it said START, and RIGHT as END. The guillemet shapes are typically rounded, as shown in Figure 30.

  start end
primary

U+201D RIGHT DOUBLE QUOTATION MARK

U+201C LEFT DOUBLE QUOTATION MARK

nested

U+2019 RIGHT SINGLE QUOTATION MARK

U+2018 LEFT SINGLE QUOTATION MARK

Unlike the guillemets, these quote marks are not mirrored during display. As a result, LEFT means use on the left, and RIGHT means use on the right.

Sometimes these styles are mixed in the same text. The example in Figure 30 uses both in a single sentence.

 يعني اسم ”سيسيميوت“ «المستعمرة القريبة من أرض بها ثعالب».

A sentence containing 2 types of quotation mark.

Emphasis

How are emphasis and highlighting achieved? If lines are drawn alongside, over or through the text, do they need to be a special distance from the text itself? Is it important to skip characters when underlining, etc? How do things change for vertically set text?

Emphasis can sometimes be expressed by stretching the baseline of one or more words. See the section on justification below for more information about baseline stretching.

Abbreviation, ellipsis & repetition

What characters are used to indicate abbreviation, ellipsis & repetition?

Arabic uses U+2026 HORIZONTAL ELLIPSIS.

Text decoration & other inline features

Any other form of highlighting or marking of text, such as underlining, numeric overbars, etc. What characters or methods (eg. text decoration) are used to convey information about a range of text? If lines are drawn alongside, over or through the text, do they need to be a special distance from the text itself? Is it important to skip characters when underlining, etc? How do things change for vertically set text?

Underlines & overlines

Underlines and overlines in Arabic text are usually further from the baseline than they are in Latin text. This is because the Arabic letters extend further from the baseline, and because there are also sometimes diacritics beyond those long extensions. Typically, the line will be drawn so that it is further from the baseline than any other glyphs reach.§

Underlining of Arabic usually clears the long descenders and their diacritics.

In some cases, however, while still keeping the line further from the baseline than in Latin text, typographers don't clear the glyphs. In this case, the lines usually skip the ink of the other glyphs.

Underlining of Arabic that skips the ink of some long descenders.

When skipping ink it is important to avoid leaving very short remnants of the line between glyphs, since these may look like dots or diacritics.

Ink-skipping during underline that can create confusing marks.

The Qur'an tends to use overlines, rather than underlines.

An example of the use of an overline in the Qur'an.

Other punctuation

Punctuation not already mentioned, such as dashes, connectors, separators, scare quotes, etc.

Other punctuation marks used in Arabic include the following.


3
   2010
 2013
 2014

both
؍infreq.  /060D
٭infreq.  *066D

Line & paragraph layout

This section describes typographic features related to line breaking & hyphenation, text alignment & justification, text spacing, baselines, line height, counters, lists, and styling initials.

This section focuses mainly on Arabic language text, however attention is sometimes drawn to differences when the Arabic script is used for other languages.

Line breaking & hyphenation

Are there special rules about the way text wraps when it hits the end of a line? Does line-breaking wrap whole 'words' at a time, or characters, or something else (such as syllables in Tibetan and Javanese)? What characters should not appear at the end or start of a line, and what should be done to prevent that? Is hyphenation used, or something else? What rules are used? What difficulties exist?

Lines are normally broken at word boundaries.

They are not broken at the small gaps that appear where a character doesn't join on the left.

In-word line-breaking

Hyphenation isn't used for the Arabic language, however other languages using the Arabic script may hyphenate (such as Uighur).2

Line-edge rules

As in almost all writing systems, certain punctuation characters should not appear at the end or the start of a line. The Unicode line-break properties help applications decide whether a character should appear at the start or end of a line.

Show default line-breaking properties for characters in this orthography.

The following list gives examples of typical behaviours for characters affected by these rules. Context may affect the behaviour of some of these and other characters.

  • « “ ‘ (   should not be the last character on a line
  • » ” ’ ) . ، ؛ ؟ !   should not begin a new line

Breaking between Latin words

When a line break occurs in the middle of an embedded left-to-right sequence, the items in that sequence need to be rearranged visually so that it isn't necessary to read lines upwards.

Figure 35 shows how two Latin words are apparently reordered in the flow of text to accommodate this rule. Of course, the rearragement is only that of the visual glyphs: nothing affects the order of the characters in memory.

Text with no line break in Latin text.

Text with line break in Latin text.

In this Arabic language text, the lower of these two images shows the result of decreasing the line width, so that text wraps between a sequence of Latin words.

Text alignment & justification

Does text in a paragraph needs to have flush lines down both sides? Does the script allow punctuation to hang outside the text box at the start or end of a line? Where adjustments are need to make a line flush, how is that done? Does the script shrink/stretch space between words and/or letters? Are word baselines stretched, as in Arabic? What about paragraph indents?

See type samples.

Arabic script justification can be implemented using a number of different techniques, which ideally are applied in combination. These include:

(In hand-written manuscripts it is also possible to find instances where the letters the would appear at the end of the line are squeezed above the last word in the line, or hang into the margin.)

The application of the various techniques is generally subject to rules governing the frequency and location of use of particular methods. Rules can differ by writing style – for example, elongation is not normally used at all for ruq'a fonts. Where baseline stretching is applied, the rules for what can be stretched, and how much, are complicated. Unlike space-based width adjustments, baseline extension is not a question of simply adding equal-length extensions across the line. The rules tend to differ across orthographies, and eminent typographers of the past also had their own preferred or idiosyncratic rules.

Justified Arabic text.

An example from a newspaper column of justified text.

The baseline extension character ـU+0640 TATWEEL is sometimes suggested as a way of producing justification by extending the baseline, however when a browser window is resized, or when new text is added near the start of a paragraph, lines wrap differently and all the places where tatweel would be needed have to be recalibrated. Thus tatweels only work for static text with fixed dimensions.

Better quality justification systems stretch glyphs, rather than adding baseline extensions. This dynamic stretching of glyphs is often called 'kashida'. In some typesetting systems, such as InDesign, the stretching can be produced automatically without the need for tatweel characters. InDesign has controls to vary the preferred length of the extensions.

بعد زبع قرن من الغياب والشوق لرؤية الاهل توفيت الراكبة جورجيت بشير (69 عاما) وهي كندية من اصل مصري، داخل صالة يرانزيت مطار القاهرة بعد اول زيارة لها الى مصر متئثرة بإصابتها بهبوط حاد في الدورة الدموية قبل دقائق من صعودها الى الطائرة المصرية المتجهة الى الولايات المتحدة الاميركية.

The same text, but produced automatically by InDesign, without the use of tatweel.

Note that the result of the automatic justification in Figure 37 is different from that in the newspaper clipping. For one thing, the kashida effect is only applied once per word (but is applied to most words). The rules determining which combinations of characters receive baseline stretching, and the extent of that stretching also differ.

InDesign also allows fonts to substitute long swash variants for certain characters, which soak more some of the horizontal space.

بعد زبع قرن من الغياب والشوق لرؤية الاهل توفيت الراكبة جورجيت بشير (69 عاما) وهي كندية من اصل مصري، داخل صالة يرانزيت مطار القاهرة بعد اول زيارة لها الى مصر متئثرة بإصابتها بهبوط حاد في الدورة الدموية قبل دقائق من صعودها الى الطائرة المصرية المتجهة الى الولايات المتحدة الاميركية.

The last 2 lines of the previous example, showing swash forms.

Well justified text would apply a mixture of swash characters, space stretching, kashidas and ligatures to achieve a visually appealing and effective justification. Also, the baseline stretching in Figure 37 is flat. A more advanced system would instead produce elegantly curved kashidas more like handwritten text.

محمد

Curvilinear kashidas. (source)
Useful links

It is very common to see baseline stretching in modern Arabic text where a word or phrase is stretched to fill a particular space, eg. the Arabic tag line (الابداع المتجدد Creativity renewed) below the word Lexus in Figure 40 is stretched to be the same width.

Arabic text stretched to fit the width of the word Lexus.
Arabic text being stretched to fit the width of text alongside it.

Observation: Text that is stretched in this manner very often has multiple kashidas per word. This is perhaps understandable, given that usually only a small number of words are involved.

Text spacing

This section looks at ways in which spacing is applied between characters over and above that which is introduced during justification. For example, does the orthography create emphasis or other effects by spacing out the words, letters or syllables in a word? (For justification related spacing, see Text alignment & justification, above).

See type samples.

Spaces are not added between characters, with the exception of micro-spacing during justification, which is applied to word-medial letters that don't join to the left. On the other hand, the baseline within words is often stretched.

It is quite common to see Arabic text stretched to fit a given width, as shown in Figure 40, but that type of stretching is more akin to justification than the typical letter-spacing that is applied to other scripts. The amount of stretch is determined by the area that needs to be filled.

In some cases, it may be that elongation of words is driven by stretching the distance between letters rather than matching an external template, for example to express emphasis or prolonged sound. However, as for justification, this is not normally based on an even amount of stretch between all letters, as letter-spacing tends to be in other scripts.

Baselines, line height, etc.

Does the script have special requirements for baseline alignment between mixed scripts and in general? Is line height special for this script? Are there other aspects that affect line spacing, or positioning of items vertically within a line?

The alphabetic baseline is a strong feature of Arabic script on the whole, since characters tend to join there. This is not always the case: for example, some adjacent pairs or ligatures have joins above the baseline, and initial letters in some fonts may start slightly above the baseline, but for most cases it remains a strong feature.

The nastaliq writing style, on the other hand, uses arrangements of joined glyphs that cascade downwards from right to left, and ressemble a strongly sloping baseline.

مستحق • شخص • کیفیت

Sloping baselines in Urdu nastaliq text.

However, even writing styles with an ostensibly flat baseline may, in good quality fonts, draw words on a slightly slanted baseline, or multiple baselines, as shown in Figure 42.2

يستبشر • يستمع
Words with a gradually sloping baseline (left) and multiple baselines (right).

Characters within a word may also combine vertically in certain groupings, as mentioned in the previous section.

Line height and multi-script positioning. Even without the deviations from the baseline described above, the ascenders and descenders of Arabic letters tend to travel further from the baseline that is usual in Latin script text. Allowances for this need to be made for line height settings on a page, but also it can be problematic when combining Latin and Arabic text on the same line using different fonts for each.

If the Arabic font supports the needed Latin letters, the font design will already take into account the relative sizes of the letters, and their placement relative to the baselines of each script. If different fonts are used, though, it's important to match the baselines and harmonise the font sizes used.

Arabic letters have ascenders and descenders that tend to be longer than the Latin ones. Figure 43 shows ascenders and descenders for Arabic letters in the Scheherazade New font. With the addition of diacritics above and below the letters, the line height needs to be significantly higher than for Latin script text.

qhxإِودكٍّخٍ
Font metrics for text in the Scheherazade New font.

Counters, lists, etc.

Are there list or other counter styles in use? If so, what is the format used? Do counters need to be upright in vertical text? Are there other aspects related to counters and lists that need to be addressed?

You can experiment with counter styles using the Counter styles converter. Patterns for using these styles in CSS can be found in Ready-made Counter Styles, and we use the names of those patterns here to refer to the various styles.

The Arabic language uses 1 numeric and 2 fixed styles. Wikipedia lists 2 more styles: an old maghrebi sequence and the hijai sequence.

Numeric

The arabic-indic numeric style is decimal-based and uses these digits.4


10
٠00660
١10661
٢20662
٣30663
٤40664
٥50665
٦60666
٧70667
٨80668
٩90669

Examples:


12
١ 10661
٢ 20662
٣ 30663
٤ 40664
unused 
١١ 110661
0661
٢٢ 220662
0662
٣٣ 330663
0663
٤٤ 440664
0664
unused 
١١١ 1110661
0661
0661
٢٢٢ 2220662
0662
0662
٣٣٣ 3330663
0663
0663
٤٤٤ 4440664
0664
0664

Fixed

The arabic-abjad fixed style uses these letters. It is only able to count to 28.4


28
ا10627
ب20628
ج3062C
د4062F
ه‍50647
200D
و60648
ز70632
ح8062D
ط90637
ي10064A
ك110643
ل120644
م130645
ن140646
س150633
ع160639
ف170641
ص180635
ق190642
ر200631
ش210634
ت22062A
ث23062B
خ24062E
ذ250630
ض260636
ظ270638
غ28063A

Note that the 5th counter includes a zero-width joiner formatting character. This makes the shape distinguishable from ٥U+0665 -INDIC DIGIT FIVE.

Examples:


8
ا 10627
ب 20628
ج 3062C
د 4062F
unused
ك 110643
ش 210634
خ 24062E
غ 28063A

The maghrebi-abjad fixed style uses these letters. It is also only able to count to 28. The letters are the same as those used for the arabic-abjad style, but 6 occur in different positions.4


28
ا10627
ب20628
ج3062C
د4062F
ه‍50647
200D
و60648
ز70632
ح8062D
ط90637
ي10064A
ك110643
ل120644
م130645
ن140646
ص150635
ع160639
ف170641
ض180636
ق190642
ر200631
س210633
ت22062A
ث23062B
خ24062E
ذ250630
ظ260638
غ27063A
ش280634

The 5th counter also includes a zero-width joiner formatting character.

Examples:


8
ا 10627
ب 20628
ج 3062C
د 4062F
unused
ك 110643
س 210633
خ 24062E
ش 280634

Prefixes and suffixes

Arabic lists generally use a full stop suffix as a separator.

Comparison of lists

The table below shows the differences between fixed counter styles for the Arabic and Persian languages. In addition to the styles described above are two other sequences that are mentioned in Wikipedia – an old maghrebi sequence and the hijai sequence.

A blank cell uses the same letter as the nearest non-blank cell above it.

Show the table
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
persian-abjad ا ب ج د ه‍ و ز ح ط ی ک ل م ن س ع ف ص ق ر ش ت ث خ ذ ض ظ غ        
arabic-abjad                   ي ك                                          
magrebi-abjad                             ص     ض     س         ظ غ ش        
WP old magrebi     ت ث ج ح خ د ذ ر ز ط ظ ك ل م ن ص ض ع غ ف ق س ش ه و ي        
WP hijai                       س ش ص ض ط ظ ع غ ف ق ك ل م ن              
persian-alphabetic     پ ت ث ج چ ح خ د ذ ر ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ه‍ ی

Styling initials

Does the script use special styling of the initial letter of a line or paragraph, such as for drop caps or similar? How about the size relationship between the large letter and the lines alongide? where does the large letter anchor relative to the lines alongside? is it normal to include initial quote marks in the large letter? is the large letter really a syllable? Are dropped, sunken, and raised types found? etc.

It is possible to find cases where Arabic enlarges and styles the first character at the beginning of a paragraph, but it is quite rare.

Observation: It is not clear whether it is appropriate to maintain the joining forms of the initial letter and the following letter. A good proportion of the examples seen have the initial letter in a box, in which case it appears to be in isolated form. For further discussion see this thread and this one.

Page & book layout

This section describes typographic features related to general page layout & progression; grids & tables, notes, footnotes, etc, forms & user interaction, and page numbering, running headers, etc.

General page layout & progression

How are the main text area and ancilliary areas positioned and defined? Are there any special requirements here, such as dimensions in characters for the Japanese kihon hanmen? The book cover for scripts that are read right-to-left scripts is on the right of the spine, rather than the left. When content can flow vertically and to the left or right, how to specify the location of objects, text, etc. relative to the flow? Do tables and grid layouts work as expected? How do columns work in vertical text? Can you mix block of vertical and horizontal text? Does text scroll in a different direction?

Arabic books, magazines, etc., are bound on the right-hand side, and pages progress from right to left.

عنوان كتاب

Binding configuration for Arabic books, magazines, etc.

Columns are vertical but run right-to-left across the page.

Grids & tables

Does the script have special requirements for character grids or tables?

Tables, grids, and other 2-dimensional arrangements progress from right to left across a page.

Forms & user interaction

Are vertical form controls needed? Are scroll bars in an unusual position? Other special requirements for user interaction?

Form controls should display Arabic text from right to left, starting at the right side of the input field. Form controls should also usually be arranged from right to left.

Figure 45 shows some form fields from an Arabic language web page. Note the position of the labels relative to the input fields and the checkbox, mirror-imaging a similar page in English. Note also that the input text in the first field appears to the right of the box.

A set of form fields on an Arabic web page

The position of a scrollbar should depend on the user's environment, not on the content of a page. A non-Arab user viewing a web page in Arabic shouldn't have to look for the scroll bar on the left side of the window. In a system that is set up for an Arab user, however, the scrollbar can appear on the left.

References & sources

1Peter T. Daniels and William Bright (1996), The World's Writing Systems, Oxford University Press, ISBN 0-19-507993-0

2Behnam Esfahbod, Mostafa Hajizadeh, Najib Tounsi, Richard Ishida, Shervin Afshar, Titus Nemeth, Text Layout Requirements for the Arabic Script (alreq)

3GitHub, Shortcomings of Characters Table

4Richard Ishida, Ready-made Counter Styles

5Jonathan Kew, Proposal to add Arabic-script honorifics and other marks

6Library of Congress, Arabic romanization table

7John Mace, Beginner's Arabic Script, Hodder & Stoughton Ltd, ISBN 0-340-86016-2

8Roozbeh Pournader, Initial and medial forms of Arabic Letter Noon Ghunna, l2/12-381, 2012-11-03

9Jack Smart and Francis Altorfer, Teach Yourself Arabic, Oxford University Press, ISBN 978-0-340-86996-3

10ScriptSource, Arabic

11Unicode Consortium, CLDR, Arabic

12Unicode Consortium, The Unicode Standard, Version 13.0, Chapter 9.2: Middle East-I, Arabic, 365-389, ISBN 978-1-936213-16-0

13Unicode Consortium, The Unicode Standard, Version 15.0, Chapter 9.2: Middle East-I, Arabic, 365-389, ISBN 978-1-936213-16-0

14Unicode Consortium, The Unicode Standard, Version 16.0, Chapter 9.2: Middle East-I, Arabic, 365-389, ISBN 978-1-936213-34-4

15Wikipedia, Arabic alphabet

16Wikipedia, Arabic phonology

17Wikipedia, Arabic script

18Wikipedia, Basmala

19Wikipedia, Quotation mark

20Roozbeh Pournader, Bob Hallissy, Lorna Evans (2024), Unicode® Arabic Mark Rendering, subtitle

See recent changes.  •  Make a comment.  •  Licence CC-By © r12a.