Updated 7 July, 2024
This page brings together basic information about the Arabic script and its use for the Iranian Persian language. It aims to provide a brief, descriptive summary of the modern, printed orthography and typographic features, and to advise how to write Persian using Unicode.
Richard Ishida, Persian Orthography Notes, 07-Jul-2024, https://r12a.github.io/scripts/arab/pes
تمام افراد بشر آزاد بدنیا میایند و از لحاظ حیثیت و حقوق با هم برابرند. همه دارای عقل و وجدان میباشند و باید نسبت بیکدیگر با روح برادری رفتار کنند.
هر کس میتواند بدون هیچگونه تمایز مخصوصا از حیث نژاد، رنگ، جنس، زبان، مذهب، عقیدهٔ سیاسی یا هر عقیده دیگر و همچنین ملیت، وضع اجتماعی، ثروت، ولادت یا هر موقعیت دیگر، از تمام حقوق و کلیهٔ آزادیهائیکه در اعلامیه ذکر حاضر شده است، بهرهمند گردد.
Source: Unicode UDHR, clauses 1 & 2
The Persian alphabet is a variant of the Arabic script, adapted to meet the needs of the different sounds in the Persian language. It is used to write the Persian language in Iran and Afghanistan.
الفبای فارسی ælefbɒːje fɒːrsiː Farsi alphabet
The former Persian Pahlavi scripts were replaced by the Arabic script, but not until some time after the Arab conquest of Persia. It was introduced by the Saffarid and Samanid dynasties in 9th-century Greater Khorasan.
Persian uses the Arabic script, with extensions to covers its wider repertoire of sounds. The Arabic script is an abjad. This means that in normal use the script represents only consonant and long vowel sounds. See the table to the right for a brief overview of features for the modern Persian orthography.
Persian text runs right-to-left in horizontal lines, but numbers and embedded Latin text are read left-to-right.
It is sometimes written using the nasta'liq style of Arabic writing. Glyphs are more drawn out, and the baseline tends to be sloping from word to word.
The script is cursive, and some basic letter shapes change radically, depending on what they join to.
There is no case distinction. Words are separated by spaces.
Modern Persian uses 36 basic letters to write consonants, some of which are hangovers used to spell words loaned Arabic. For example, there are 2 ways to write t, 3 ways to write s, and 4 ways to write z. Although it is not always easy to guess the vowel sounds in a word, the consonants are largely reliable phonetically. There is mostly a one-to-one correspondance between letters and sounds.
Diacritics are used to indicate vowel absence in consonant clusters, and gemination in vowelled text.
❯ basicV
This orthography is an abjad where vowel sounds are written using a mixture of combining marks and letters in vocalised text, but normally the diacritics are not used (and so it is not possible to accurately read the text unless you recognise the consonant patterns). However these diacritics and other phonetic information can be written where needed, and are regularly used for Qur'anic texts, dictionaries, educational materials, and where the pronunciation needs to be made clear.
Although the vowel diacritics are not normally used, Persian has 3 diacritic code points that can be used to indicate the 'short' vowels, if needed. Long vowels, and word initial and final short vowels are represented using one of 4 consonant letters, although there can occasionally be some ambiguity as to which sounds they represent.
To indicate vowel cluster boundaries and the ezafe conjunction, Persian uses a combining hamza above carrier letters. (A standalone hamza is only used occasionally.) The choice between precomposed and decomposed realisations of characters used for these features has a few complications.
Word-initial standalone vowels are always attached to or preceded by 0627. Word-medial standalone vowels are preceded by a hamza on a carrier.
In vowelled text, 0652 is used to indicate vowel absence in consonant clusters.
A mandatory ligature is used for combinations of lam + alif.
Persian uses native digits, though the code points are different from those used for the Arabic language, and Arabic code points are used for several of the more common punctuation marks.
Because the Arabic script is 'cursive' (ie. joined-up) writing, letters tend to have different shapes depending on whether they join with adjacent letters or not (see cursive). In addition, vowels can be represented using different characters, depending on where in a word they appear.
In scripts such as Arabic, several characters have no left-joining form. In what follows we'll use the characters ي and د to illustrate shapes. The former can join on both sides, but the latter can only join on the right.
Left-joining glyphs are commonly called initial; dual-joining are called medial; and right-joining are called final. Glyphs that don't join on either side are called isolated. However, these glyph shapes can be found in various places within a single word.
Word-initial characters usually have initial glyph shapes (eg. 064A ). However, characters that only join to the right will use an isolated glyph shape (eg. 062F ). Furthermore, words beginning with a vowel are always preceded by a vowel carrier, which is normally ا (eg. 0627 06CC or 0627 064E ).
Word-medial characters will typically join on both sides (eg. 064A ) but those that only join to the right will use a final glyph (eg. 062F ). However, if either of those is preceded by another character that only joins to the right, the glyph shapes rendered will be initial (eg. 064A ) and isolated (eg. 062F ), respectively.
In Persian, final glyph forms can also be found in word-medial even though the character would normally join to the left, usually at morphological boundaries, eg. گرگها . The left joining behaviour is prevented using 200C.
Word-final characters will typically use a final glyph shape (eg. 064A and 062F ). However, if the previous character joins only to the right, they will use isolated glyph shapes (eg.064A and 062F ).
In all this contextual glyph shaping the basic shapes used for a character can vary significantly in a script like Arabic. This also includes some characters that only have ijam dots in certain contexts.
These are sounds of the Iranian Persian language, and more particularly tend to reflect the Tehran dialect.
Click on the sounds to reveal locations in this document where they are mentioned.
Phones in a lighter colour are non-native or allophones. Source Wikipedia.
There are 6 vowel sounds, though there are also allophonic variants. The basic phonemes are as follows:
i, u, and ɒ are regarded as long vowels. e, o, and æ are short.
labial | dental | alveolar | post- alveolar |
palatal | velar | uvular | glottal | |
---|---|---|---|---|---|---|---|---|
stop | p b | t d | k ɡ | ɢ | ʔ | |||
affricate | t͡ʃ d͡ʒ | |||||||
fricative | f v | s z | ʃ ʒ | x ɣ | h | |||
nasal | m | n | ||||||
approximant | l | j | ||||||
trill/flap | r | |||||||
In Iranian Persian ɣ and q have merged into ɣ~ɢ, as a voiced velar fricative ɣ when positioned intervocalically and unstressed, and as a voiced uvular stop ɢ otherwise.wpl,#Phonology
Persian is not a tonal language.
tbd
The following table summarises the main vowel to character assignments, in both vowelled and unvowelled forms.
Each table cell shows word-initial, word-medial, and word-final forms from right to left. The glyphs shown are illustrative; alternative shapes may occur (see joining_forms).
Normal: |
|
||||||
---|---|---|---|---|---|---|---|
Vocalised: |
|
For additional details see vowel_mappings.
In normal (non-vocalised) text a vowel location is simply not indicated where only a diacritic is shown in the table. Because the sukun is also dropped in non-vocalised text, where a mater lectionis remains it only implies a vowel location, since it may alternatively represent a consonant or a glide.
In word-initial position vowels are attached to ا.
Initial u is rare, as are final æ and o.
Final 0647 and 0648 are only treated as vowels if they follow a consonant sound. If they follow a vowel sound, they revert to their normal consonant value. Unfortunately, since short vowels are only rarely shown, it can sometimes be difficult to tell from the written text whether to pronounce these letters as vowel or consonant.
This is the full set of characters needed to represent the Persian language vowels.
In situations where it is necessary to unambiguously indicate the underlying vowel sounds, the following diacritics can be added to base letters.
The doubled vowel diacritic, ◌ً [U+064B ARABIC FATHATAN] is used at the ends of certain Arabic-derived adverbs in vowelled text. It is usually written over an alif, although the vowel sound is short. Examples, لزوماً اصلاً
Other doubled vowel diacritics, ◌ٌ [U+064C ARABIC DAMMATAN] and ◌ٍ [U+064D ARABIC KASRATAN] are not used in Persian, but are taught to support education in the Qur'anwpa,#Tanvin_(nunation).
ٓ [U+0653 ARABIC MADDAH ABOVE] is only found in decomposed text, and is associated only with alef. See آ [U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE].
Persian is normally written without vowel diacritics. As a rule, only long vowels are represented by matres lectionis, except that all vowels in word-initial position are written with or preceded by ا [U+0627 ARABIC LETTER ALEF]. Another exception is a final e, which is written with ه [U+0647 ARABIC LETTER HEH]. (Final a and o are also represented by a consonant, but are rarely found.)
The table shows what sounds each letter may represent in non-vocalised text. Where there are gaps, the sound is just not written.
initial | medial | final | |||
---|---|---|---|---|---|
ایـ | iː | ـیـ | iː | ـی | iː |
او | o uː | و | uː | و | o uː ow |
ـه | e | ||||
ا | o e æ | ا | ɑː | ا | ɑː |
آ | ɑː |
The vowels e oand æ are not marked in medial position, and with the exception of e generally do not occur in word-final position. However, one common word in the Tehran dialect of Persian that does end with æ is نه næ no
آ [U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE] is included in the table because it is normally represented by a single, precomposed character.
See also vowel_mappings.
Ezāfe is a grammatical particle used to link words together. It is used between adjectives and nouns. Between a sequence of nouns it is similar in use to the word 'de' in French. It is pronounced ɛ~e or (after a vowel) jɛ.
Ezafe can be written in three ways in vowelled text, however in normal Persian text only the third is visible, because the diacritics are omitted.pm,41
ِ [U+0650 ARABIC KASRA] | After a word-final consonant. | گازِ طبیعی |
ٔ [U+0654 ARABIC HAMZA ABOVE] | After a word that ends with a short vowel (usually in the combination هٔ [U+0647 ARABIC LETTER HEH + U+0654 ARABIC HAMZA ABOVE]). | نشانهِٔ سَجاوَندی |
یِ [U+06CC ARABIC LETTER FARSI YEH + U+0650 ARABIC KASRA] | After a word that ends with a long vowel. | نشانههایِ سجاوندی |
Word-medial Persian vowels are sometimes pronounced without an intervening consonant, but they are normally written as if they are separated by a glottal stop. The glottal stop is written using a hamza and its carrier (see hamza).
زئوس
سؤال
Word-initial standalone vowels are always attached to or preceded by 0627, eg.
اومدن
ایرانی
Long vowels are generally distinguished from short vowels by the use of matres lectionis (see otherV).
When text is vowelled, 0652 can be used over a consonant to indicate that it is not followed by a vowel sound, however, like short vowel diacritics, it is only used when it is necessary to disambiguate pronunciation.
اسم
نستعلیق
This section maps Modern Persian vowel sounds to common graphemes in the Arabic orthography. Persian follows Arabic in providing diacritics to express short vowel sounds, but also rarely uses them in normal text.
The columns run right to left and indicate typical word-initial, word-medial, and word-final usage. The joining forms shown are illustrative; alternative shapes may occur (see joining_forms). The examples show normal unvowelled usage as well as vowelled where a difference exists.
Click on a grapheme to find other mentions on this page (links appear at the bottom of the page). Click on the character name to see examples and for detailed descriptions of the character(s) shown.
06CC
گیتی
06CC
تیر
0627 06CC
این
0648
زانو
0648
نور
0627 0648
اورمزد
0650 0647
کهنه
0650
کرم
0627 0650
اسم
064F 0648
کادو
064F
شتر
0627 064F
امید
064E 0647
064E
تبر
0627 064E
اسب
0627
گرگها
0627
مار
0622
آسمان
و [U+0648 ARABIC LETTER WAW] and ی [U+06CC ARABIC LETTER FARSI YEH] represent both consonants and vowels. See mapToVowels.
Due to the influence of Arabic spelling in loan words, Persian has 2 letters for t, 3 letters for s, 4 letters for z, and 2 letters for h. The most common letter for s is س [U+0633 ARABIC LETTER SEEN], and for z is ز [U+0632 ARABIC LETTER ZAIN].
In Persian, a glottal stop is commonly written using ع [U+0639 ARABIC LETTER AIN], eg. معنوی عربی شیعه
In other places, Persian uses a hamza (see hamza).
فائده
مؤثر
The hamza (called hamze in Persian) is used to represent a glottal stop, however rather than being written as a single letter it is normally written as a combination of a diacritic and a base letter.
In most cases, that is the combination ـیٔـ [U+06CC ARABIC LETTER FARSI YEH + U+0654 ARABIC HAMZA ABOVE], eg. مسئول
However, if the glottal stop is preceded by the sound o, the combination is ـؤـ [U+0624 ARABIC LETTER WAW WITH HAMZA ABOVE], eg. مؤمن
Although it may not be pronounced, the hamza and its carrier also appears between vowels. For example, فائده زئوس مؤثر سؤال سوئیس
In the examples above you can see that the hamza+carrier provides a place to put a vowel diacritic for short vowels in vowelled text. Between two long vowels it represents a nominal glottal stop.
In the i sound of the indefinite ending, the hamza may also be used, or alternatively the ending may double the YEH after a long vowel, eg. compare: پایٔی پایی مویٔی مویی
On more rare occasions the hamza may appear over ا [U+0627 ARABIC LETTER ALEF]. This makes it clear that the alef represents a glottal stop, rather than a long vowel.
رأی
تأکید
It may also appear as an isolated form, ء [U+0621 ARABIC LETTER HAMZA], eg. علاءالدین امضاء
The hamza is also used over short, word-final vowels for ezafe.
A number of precomposed combinations of base letter and hamza are encoded in Unicode. Many of these decompose and recompose under normalisation as canonical alternatives, but a few do not and need to be treated with care. For information about which precomposed characters are used or not used here see hamza_choices.
Consonant clusters in syllable onsets are simply written using a sequence of consonant letters.
No special mechanisms are used to write syllable or word final consonants.
Consonant clusters in Persian are simply written using a sequence of consonant letters. In vowelled text the consonant(s) without a following vowel may carry a sukun (see novowel).
In vowelled text, which is rare, geminated consonants are shown using the diacritic 0651 (called tašdid in Persian).
تپه
اولی
When both tašdid and zir are attached to the same base consonant, a common, though not universal, practice is to display the zir below the tašdid, rather than below the base consonant (see the examples just above). Some fonts, such as Amiri, don't do this. (See also context.)
This section maps Persian consonant sounds to common graphemes in the Arabic orthography.
The right-hand column shows the various joining forms for each character.
Click on a grapheme to find other mentions on this page (links appear at the bottom of the page). Click on the character name to see examples and for detailed descriptions of the character(s) shown.
067E
پنج
067E067E067E⏴
0628
بچه
062806280628⏴
062A
تیر
062A062A062A⏴
0637
طبقه
063706370637⏴
0629 at the end of some Arabic words.
06290629⏴
062F
دود
062F062F⏴
06A9
کوچک
06A906A906A9⏴
06AF
گرگ
06AF06AF06AF⏴
063A between vowels.
063A063A063A⏴
063A
غژگاو
063A063A063A⏴
0642
چقدر
064206420642⏴
0621
امضاء
0621⏴
06CC 0654
مسئول
06CC 065406CC 065406CC 0654⏴
0623 (decomposes)
تأکید
06230623⏴
0624 (decomposes)
مؤثر
06240624⏴
0639
علف
063906390639⏴
0686
چپ
068606860686⏴
062C
جگر
062C062C062C⏴
0641
فراخ
064106410641⏴
0648
وقت
06480648⏴
0633
سیلو
063306330633⏴
0635 in words of Arabic origin.
صاف
063506350635⏴
062B in words of Arabic or Persian origin.
ثبت
062B062B062B⏴
0632
زمین
063206320632⏴
0630
ذهن
063006300630⏴
0636 in words of Arabic origin.
ضخیم
063606360636⏴
0638 in words of Arabic origin.
ظهر
063806380638⏴
0634
شیک
063406340634⏴
0698
ژنو
06980698⏴
062E
خیس
062E062E062E⏴
063A
مرغ
063A063A063A⏴
0647
هلو
064706470647⏴
062D
حیوان
062D062D062D⏴
0645
مینو
064506450645⏴
0646
نیو
064606460646⏴
0648 as a glide after a vowel.
اُوج
06480648⏴
0631
ریشه
06310631⏴
0644
لیمو
064406440644⏴
06CC
یازده
06CC06CC06CC⏴
Modern Arabic text makes use of a relatively large set of invisible formatting characters, especially in plain text, many of which are used to manage text direction. Descriptions of these characters can be found in the following sections:
In the Persian orthography different sequences of Unicode characters may produce the same visual result. Here we look at those, and make notes on usage.
Unicode support for the various uses of the hamza is complicated.u,384 For notes on the usage of the hamza in Persian, see hamza and ezafe. In general, the Unicode Standard recommends to use 0654 in combination with a base character. However, there are a few exceptions to consider.
A number of combinations with the hamza diacritic can be represented as either an atomic character or a decomposed sequence, where the parts are separated in Unicode Normalisation Form D (NFD) and recomposed in Unicode Normalisation Form C (NFC), so both approaches are canonically equivalent. These include the following:
The single code point per vowel-sign is the form preferred by the Unicode Standard and the form in common use for Persian, but either could be found.
This item is a special case. Yeh with a hamza is used for the glottal stop, and also for word medial standalone vowels.
Atomic | Decomposed |
---|---|
ئ [U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE] | ئ [U+064A ARABIC LETTER YEH + U+0654 ARABIC HAMZA ABOVE] |
Persian uses ی and doesn't use ي because the latter produces dots below in all positions, whereas yeh in Persian only has dots below in initial and medial forms. However, the canonical decomposition of 0626 maps to the Arabic yeh and a combining hamza.
Nevertheless, the atomic character is widely used in Persian text. To mitigate the issues, the Unicode Standard recommends that any time ي is combined with a hamza the font should drop the dot glyphs. This ensures that the text looks correct in decomposed form, but applications need to be aware that decomposed text will contain an Arabic yeh which is not otherwise used for Persian.
These cases involve precomposed characters that look identical to the sequences used in Persian, however there is a catch because the precomposed characters have canonical decompositions to letters that are not used in Persian.
Recommended | Not recommended | |
---|---|---|
0647 0654 |
06C0 06C2 |
The Unicode Standard explicitly recommends use of the decomposed sequence when combining a hamza with HEH (for ezafe)u,395. The two atomic characters on the right are problematic because they decompose to sequences containing 06D5 and 06C1, respectively, neither of which are appropriate for Persian. |
06CC 0654 |
0626 0649 0654 08A8 |
For 0626 see hamza_yeh. The Unicode Standard explicitly recommends not to use 0649 0654. Persian doesn't use the alef maksurau,395. 08A8 is a character used for an African orthography which retains the dots for all joining forms. It is therefore also an inappropriate character for Persianu,395. |
Content authors should use a separate hamza for these sequences in Persian, even though the precomposed characters look the same visually, because they don't represent the same semantics, and may introduce problems if text is decomposed. However, because approaches may yield exactly the same result when displayed, applications will need to recognise the precomposed characters and treat them or map them to the appropriate sequence. Input mechanisms, on the other hand, can produce one rather than the other, and that choice should be made with advisement.
The following lists some common errors found in Persian text due to the similarity of Unicode characters, or perhaps sometimes due to problems inputting the correct character. See also the items listed for combinations with hamza just above.
Correct | Incorrect | |
---|---|---|
06CC |
064A 0649 |
064A doesn't drop the dots below the letter in isolate and final shapes, and 0649 doesn't add them for initial and medial shapes. |
0647 |
06C1 06D5 |
Although they look the same, these are all different characters. The 06C1 is used for languages that include Urdu and Kashmiri, whereas 06D5 is used for Uighur and central Asian languages. |
0629 |
06C3 |
Again, characters that a font may render exactly the same, but that are based on different base letters. |
06A9 |
0643 |
Differences in shape of final and isolate forms. |
Although these characters look the same (at least in certain joining forms), they are all different characters, and should not be used interchangeably. If they are used, the ability to search and compare text is impaired unless the application is aware of and takes counter-measures against this substitution.
When typing and in storage, combining marks always follow the base character they are associated with.
In principle, if more than one combining mark appears on the same side of the base character, Unicode expects applications to render the marks such that those marks closer to the base character in memory appear closer to the base character when rendered. (This is called the inside-out rule
.) However, due to the reordering applied by the Unicode normalisation forms, some of the Arabic script diacritics end up in an inappropriate order on display.
For example, if a user types the sequence of characters in fig_amtra, the order of the marks will be changed such that applying the inside-out rule would render the shadda above the vowel (which is incorrect). (In fact, most application renderers have special rules to correct this.)
The Unicode Standard formally addresses this anomaly in the Technical Annex Unicode® Arabic Mark Rendering (AMTRA), with a set of rules for how to render sequences of Arabic characters. The rules generally move shadda, hamza, round dots, etc. so that they are close to the base character.
User input | Post-normalisation output |
---|---|
بُّ ب ّ ُ |
بُ͏ّ ب ُ ّ |
In the rare exceptions where the AMTRA rules should not change the rendering, this can be achieved by placing an invisible 034F character between the combining marks. (In fact, this is what was done to simulate the incorrect appearance in fig_amtra, because otherwise the browser rendering engine would have automatically produced the same output as in the first column. Clicking on the example will show the sequence used.)
Persian uses the extended arabic-indic digits in the Arabic block.
This is a separate set of characters from those used for Arabic, to accommodate different shaping and directional behaviour. Shapes differ from those of Arabic for the digits 4, 5, and 6.
Arabic | |
---|---|
Persian | |
Urdu | |
Sindi |
See expressions for a discussion of how to handle numeric ranges.
Persian may use the Arabic percent sign, ٪ [U+066A ARABIC PERCENT SIGN].
The percent sign is typed and stored after the numbers. Like the numeric sequences using the ASCII hyphen (mentioned in expressions), it will appear to the left of a number if that number is preceded by Persian characters. However, if the digits of the percentage appear alone or at the beginning of a line it is necessary to use an ALM formatting character just before them to prevent the sign appearing on the right.
Observation: Wikipedia uses an ASCII percent sign with ASCII digits
TBD
The name of the currency is usually spelled out:
ریال
The Unicode Standard does have a symbol code point, FDFC, but it is only a compatibility character for use when converting from Iranian standards, and should not be used in normal Unicode textu,379.
The Arabic script uses a number of Unicode characters that affect the way that other characters are rendered.
Persian text makes use of a relatively large set of invisible formatting characters, especially in plain text, many of which are used to manage text direction (see directioncontrols), and others are used to control cursive shaping behaviour (see shapingcontrols).
Persian is written horizontally and right-to-left in the main, but (as with most RTL scripts) numbers and embedded LTR script text are written left-to-right (producing 'bidirectional' text).
The Unicode Bidirectional Algorithm automatically takes care of the ordering for all the text in fig_bidirectional, as long as the 'base direction' (ie. the surrounding directional context) is set to right-to-left (RTL).
Characters are all stored in the order in which they are spoken (and typed). This so-called 'logical' order is then rendered as bidirectional flows by the application at run time, as the text is displayed or printed. The relative placement of characters within a single directional flow is based on strong directional properties (RTL or LTR) assigned to each Unicode character by the Unicode Standard. There exist, however a set of neutral direction property values, mostly for punctuation, where the placement of characters depends on the base direction.
Show default bidi_class
properties for characters in this orthography.
If the base direction is not set appropriately, the directional runs will be ordered incorrectly as shown in fig_base_direction, making it very difficult to get the meaning.
In some circumstances the Unicode Bidirectional Algorithm requires additional assistance to correctly render the directionality of bidirectional text. For such cases the Unicode Standard provides invisible formatting characters for use in plain text. See directioncontrols.
In HTML the base direction and higher level controls can be set using the dir
or bdi
attributes. CSS should not be used to control direction. Unicode formatting codes should also not be used where markup is available.
For more information about how directionality and base direction work, see Unicode Bidirectional Algorithm basics. For information about plain text formatting characters see How to use Unicode controls for bidi text. And for working with markup in HTML, see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.
For authoring HTML pages, one of the most important things to remember is to use <html dir="rtl" … >
at the top of a right-to-left page, and then use the dir
attribute or bdi
tag for ranges within the page, but only when you need to change the base direction. Also, use markup to manage direction, and do not use CSS styling.
For other aspects of dealing with right-to-left writing systems see the following sections:
Unicode provides a set of 10 formatting characters that can be used to control the direction of text when displayed. These characters have no visual form in the rendered text, however text editing applications may have a way to show their location.
U+202B RIGHT-TO-LEFT EMBEDDING] ( [RLE), U+202A LEFT-TO-RIGHT EMBEDDING] ( [LRE), and U+202C POP DIRECTIONAL FORMATTING] ( [PDF) are in widespread use to set the base direction of a range of characters. RLE/LRE come at the start, and PDF at the end of a range of characters for which the base direction is to be set.
More recently, the Unicode Standard added a set of characters which do the same thing but also isolate the content from surrounding characters, in order to avoid spillover effects. They are U+2067 RIGHT-TO-LEFT ISOLATE] ( [RLI), U+2066 LEFT-TO-RIGHT ISOLATE] ( [LRI), and U+2069 POP DIRECTIONAL ISOLATE] ( [PDI). The Unicode Standard recommends that these be used instead.
There is also U+2068 FIRST STRONG ISOLATE] ( [FSI), used at the start of a range to set the base direction according to the first recognised strongly-directional character.
U+200F RIGHT-TO-LEFT MARK] ( [RLM) and U+200E LEFT-TO-RIGHT MARK] ( [LRM) are invisible characters with strong directional properties that are also sometimes used to produce the correct ordering of text.
For more information about how to use these formatting characters see How to use Unicode controls for bidi text. Note, however, that when writing HTML you should generally use markup rather than these control codes. For information about that, see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.
A sequence of numbers separated by hyphens (for example a range) runs from left to right in Persian (unlike Arabic language text).
fig_bidi_range shows some Persian text, which is right-to-left overall, containing a numeric range that is ordered LTR, ie. it starts with ۱۱۶۹ (1169) and ends with ۱۱۷۰ (1170).
Certain types of hyphen or other characters can affect the way expressions run, and you need to be aware of that when writing Persian text. For more details see the Expressions & sequences section in the description of the Arabic language orthography.
This section brings together information about the following topics: writing styles; cursive text; context-based shaping; context-based positioning; baselines, line height, etc.; font styles; case & other character transforms.
You can experiment with examples using the Persian character app.
The orthography has no case distinction, and no special transforms are needed to convert between characters.
Persian may be written in a nasta'liq writing style. Key features include a sloping baseline for joined letters, and overall complex shaping and positioning for base letters and diacritics alike. There are also distinctive shapes for many glyphs and ligatures.
This is achieved in Unicode by applying the correct font – the underlying characters used are not different for nasta'liq vs. other styles.
Persian may also be written in several other styles, especially in artistic and historical writing.
Arabic script joins letters together. Fonts need to produce the appropriate joining form for a code point, according to its visual context. This results in four different shapes for most letters (including an isolated shape). The highlights in fig_cursive below show the same letter, ع [U+0639 ARABIC LETTER AIN], with two different joining forms.
A few Arabic script letters only join on the right-hand side.
There are 2 Unicode blocks containing Arabic presentation forms: these contain individual characters corresponding to the various joining forms and ligatures. With only a handful of exceptions, characters in those blocks should not be used for text content; they are only for managing legacy encodings. Instead, characters in the main Arabic block should be used, and the font will manage the necessary cursive shaping.
Most dual-joining characters add or become a swash when they don't join to the left. A number of characters, however, undergo additional shape changes across the joining forms. fig_joining_forms and fig_right_joining_forms show the basic shapes in Persian and what their joining forms look like.
Two pairs of characters in the first table have base shapes that are identical, but they manage the dots differently in different joining forms. These have been put onto separate rows.
isolated | right-joined | dual-join | left-joined | Persian letters |
---|---|---|---|---|
ب | ـب | ـبـ | بـ | |
ن | ـن | ـنـ | نـ | |
ق | ـق | ـقـ | قـ | |
ف | ـف | ـفـ | فـ | |
س | ـس | ـسـ | سـ | |
ص | ـص | ـصـ | صـ | |
ط | ـط | ـطـ | طـ | |
ک | ـک | ـکـ | کـ | |
ل | ـل | ـلـ | لـ | |
ه | ـه | ـهـ | هـ | |
م | ـم | ـمـ | مـ | |
ع | ـع | ـعـ | عـ | |
ح | ـح | ـحـ | حـ | |
ی | ـی | ـیـ | یـ |
isolated | right-joined | Persian letters |
---|---|---|
ا | ـا | |
ر | ـر | |
د | ـد | |
و | ـو |
U+200D ZERO WIDTH JOINER] ( [ZWJ) and U+200C ZERO WIDTH NON-JOINER] ( [ZWNJ) are used to control the joining behaviour of cursive glyphs. They are particularly useful in educational contexts, but also have real world applications.
ZWJ permits a letter to form a cursive connection without a visible neighbour. It can be used for illustrating cursive joining forms, eg. ان س ان Characters from the Presentation Forms blocks in Unicode should not be used in such cases.
ZWNJ prevents two adjacent letters forming a cursive connection with each other when rendered, eg. انسان
This is particularly useful for Persian, since certain Persian suffixes don't join with word-final letters when they appear finally in a morpheme, eg. خانهها تکیهگاه
Click on the words above to see the composition.
To achieve this, you need to use U+200C ZERO WIDTH NON-JOINER]. It's also possible to sometimes see text where the suffix is written after a space, or simply joined to the end of the word. However, those alternatives are not available when the word ends with [ه [U+0647 ARABIC LETTER HEH].§
U+034F COMBINING GRAPHEME JOINER] is used in Arabic to produce special ordering of diacritics. The name is a misnomer, as it is generally used to break the normal sequence of diacritics. [
Context-based shaping is everwhere in Persian due to the combination of the cursive behaviour of the script plus the strong tendency to arrange joined characters in cascades or vertical arrangements.
As in Arabic, lam followed by alef ligates, and there are other such commonly ligated forms.
لاغر
علاوه
Depending on the font, Arabic letters often have special rules for joining between certain characters, and diacritic marks generally vary in height and horizontal position depending on the size of the base character.
Another example of contextual positioning rules is the placement of ِ [U+0650 ARABIC KASRA] (zir) in vowelled text when it appears on the same letter as ّ [U+0651 ARABIC SHADDA] (tašdid). Usually, zir appears below the base letter, and this is how it can be distinguished from َ [U+064E ARABIC FATHA] (zebar). However, when combined, zir may be placed relative to the shadda diacritic, rather than relative to the base character, as seen in fig_kasra_placement.
Positioning of cursive joining forms is particularly complicated in the nastaliq style. See more details in the Urdu page.
Words are separated by spaces.
tbd
Persian uses a mixture of ASCII and Arabic punctuation.
phrase |
: [U+003A COLON] |
---|---|
sentence | . [U+002E FULL STOP] |
Persian commonly uses ASCII parentheses to insert parenthetical information into text.
start | end | |
---|---|---|
standard | ( |
) |
The words 'left' and 'right' in the Unicode names for parentheses, brackets, and other paired characters should be ignored. LEFT should be read as if it said START, and RIGHT as END. The direction in which the glyphs point will be automatically determined according to the base direction of the text.
The number of characters that are mirrored in this way is around 550, most of which are mathematical symbols. Some are single characters, rather than pairs. The following are some more common ones.
The following quotation marks can be found in Persian texts. (Depending on ease of input, quotations may alternatively be surrounded by ASCII double and single quote marks.)
start | end | |
---|---|---|
primary | » [U+00BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK] |
Because they are mirrored, when using these quotation marks, LEFT should be read as if it said START, and RIGHT as END. These quotation marks typically have rounded glyph shapes, rather than sharp angles.
Basic line-break opportunities occur between the space-separated words.
They are not broken at the small gaps that appear where a character doesn't join on the left.
Hyphenation isn't used for the Persian language, however other languages using the Arabic script may hyphenate (such as Uighur).a
As in almost all writing systems, certain punctuation characters should not appear at the end or the start of a line. The Unicode line-break properties help applications decide whether a character should appear at the start or end of a line.
Show default line-breaking properties for characters in this orthography.
The following list gives examples of typical behaviours for characters affected by these rules. Context may affect the behaviour of some of these and other characters.
When a line break occurs in the middle of an embedded left-to-right sequence, the items in that sequence need to be rearranged visually so that it isn't necessary to read lines from bottom to top.
latin-line-breaks shows how two Latin words are apparently reordered in the flow of text to accommodate this rule. Of course, the rearragement is only that of the visual glyphs: nothing affects the order of the characters in memory.
See the section on justification for the Arabic language.
See the section on text spacing for the Arabic language.
The alphabetic baseline is a strong feature of Arabic script on the whole, since characters tend to join there. The nastaliq style of the script, on the other hand, uses arrangements of joined glyphs that cascade downwards from right to left, and ressemble a strongly sloping baseline.
You can experiment with counter styles using the Counter styles converter. Patterns for using these styles in CSS can be found in Ready-made Counter Styles, and we use the names of those patterns here to refer to the various styles.
The Persian language uses 1 numeric and 2 fixed styles.
The persian numeric style is decimal-based and uses these digits.rmcs,#arabic-styles
Examples:
The arabic-abjad fixed style uses these letters. It is only able to count to 28.rmcs,#arabic-styles
Note that the 5th counter includes a zero-width joiner formatting character. This makes the shape distinguishable from ٥ [U+0665 ARABIC-INDIC DIGIT FIVE].
The persian-alphabetic fixed style uses these letters. It is able to count to 32. The letters are arranged by shape.rmcs,#arabic-styles
The 31st counter also includes a zero-width joiner formatting character.
Persian lists generally use a full stop suffix as a separator.
Persian books, magazines, etc., are bound on the right-hand side, and pages progress from right to left.
Columns are vertical but run right-to-left across the page.
Tables, grids, and other 2-dimensional arrangements progress from right to left across a page.
Form controls should display Persian text from right to left, starting at the right side of the input field. Form controls should also usually be arranged from right to left.
fig_form shows some form fields from an Arabic language web page. Note the position of the labels relative to the input fields and the checkbox, mirror-imaging a similar page in English. Note also that the input text in the first field appears to the right of the box.
The position of a scrollbar should depend on the user's environment, not on the content of a page. A non-Persian user viewing a web page in Persian shouldn't have to look for the scroll bar on the left side of the window. In a system that is set up for an Persian user, however, the scrollbar may appear on the left.