Persian orthography notes v32

Basic features

The Arabic script is an abjad, ie. short vowels are not normally written. Persian uses the Arabic script, with extensions to covers its wider repertoire of sounds.

❯ basicV

Vowels Vowels are written using a mixture of combining marks and letters in vocalised text but, because the orthography for the Arabic language is an abjad, the combining mark diacritics are not normally used (and so it is difficult to accurately read the text unless you recognise the consonant patterns). However these diacritics and other phonetic information can be written where needed, and are regularly used for Qur'anic texts, dictionaries, educational materials, and where the pronunciation needs to be made clear.

Post-consonant vowels: Although the vowel diacritics are not normally used, Persian has 3 diacritic code points that can be used to indicate the 'short' vowels, if needed. Long vowels, and word initial and final short vowels are represented using one of 4 consonant letters, although there can occasionally be some ambiguity as to which sounds they represent.

Standalone vowel sounds in word-initial position are always attached to or preceded by 0627. Word-medial standalone vowels are preceded by a hamza on a carrier.

To indicate vowel cluster boundaries and the ezafe conjunction, Persian uses a combining hamza above carrier letters. (A standalone hamza is only used occasionally.) The choice between precomposed and decomposed realisations of characters used for these features has a few complications.

❯ consonantSummary

ConsonantsModern Persian uses 32 basic letters to write consonants, some of which are hangovers used to spell words loaned from Arabic. For example, there are 2 ways to write t, 3 ways to write s, and 4 ways to write z. Although it is not always easy to guess the vowel sounds in a word, the consonants are largely reliable phonetically. There is mostly a one-to-one correspondance between letters and sounds.

A mandatory ligature has to be used for combinations of lam + alif.

The diacritic 0651 indicates gemination in vowelled text.

Vowel absence Vowel absence is indicated in normal Persian text by simply using a sequence of consonant letters, and nothing indicates that there is no vowel sound after a word-final coda.

When text is vowelled, 0652 can be used over a consonant to indicate that it is not followed by a vowel sound. Like other vowel diacritics, this is typically not used in modern text, unless it is necessary to clarify pronunciation.

NumbersPersian uses native digits, though the code points are different from those used for the Arabic language, and Arabic code points are used for several of the more common punctuation marks.

Layout Persian text runs right-to-left in horizontal lines, but numbers and embedded Latin text are read left-to-right. Words are separated by spaces. There is no case distinction.

Persian is sometimes written using the nasta'liq style of Arabic writing. Glyphs are more drawn out, and the baseline tends to be sloping from word to word.

The script is cursive, and some basic letter shapes change radically, depending on what they join to. It is also very common for adjacent characters to ligate and to stretch to fill available space.

Punctuation is a mixture of ASCII and local forms.

Joining forms

Because the Arabic script is 'cursive' (ie. joined-up) writing, letters tend to have different shapes depending on whether they join with adjacent letters or not (see cursive). In addition, vowels can be represented using different characters, depending on where in a word they appear.

In scripts such as Arabic, several characters have no left-joining form. In what follows we'll use the characters ي and د to illustrate shapes. The former can join on both sides, but the latter can only join on the right.

Left-joining glyphs are commonly called initial; dual-joining are called medial; and right-joining are called final. Glyphs that don't join on either side are called isolated. However, these glyph shapes can be found in various places within a single word.

Word-initial characters usually have initial glyph shapes (eg. 064A ). However, characters that only join to the right will use an isolated glyph shape (eg. 062F ). Furthermore, words beginning with a vowel are always preceded by a vowel carrier, which is normally ا (eg. 0627 06CC or 0627 064E ).

Word-medial characters will typically join on both sides (eg. 064A ) but those that only join to the right will use a final glyph (eg. 062F ). However, if either of those is preceded by another character that only joins to the right, the glyph shapes rendered will be initial (eg. 064A ) and isolated (eg. 062F ), respectively.

In Persian, final glyph forms can also be found in word-medial even though the character would normally join to the left, usually at morphological boundaries, eg. گرگ‌ها . The left joining behaviour is prevented using 200C.

Word-final characters will typically use a final glyph shape (eg. 064A and 062F ). However, if the previous character joins only to the right, they will use isolated glyph shapes (eg.064A and 062F ).

In all this contextual glyph shaping the basic shapes used for a character can vary significantly in a script like Arabic. This also includes some characters that only have ijam dots in certain contexts.

Ijam & tashkil

Many Arabic characters share a common base form, and are distinguished by the number and location of dots or other small diacritics, called i'jam.

س ش ݜ ݰ ݽ ݾ ڛ ښ ڜ ۺ — A variety of characters, from various orthographies, that differ only by the ijam associated with the basic shape.

An ijam is a diacritic in the Arabic script that is considered to be an integral part of a basic letter form. Unicode encodes letter+ijam combinations as atomic characters, which are never given equivalent decompositions in the standard.

Other diacritics in the Arabic script mark indicate vocalization of text or other types of phonetic guide that indicate pronunciation. These are referred to as tashkil, and a basic Arabic letter plus any of these types of marks is never encoded as an atomic, precomposed character, but must always be represented as a sequence of letter plus a separate combining mark.

For more information, see Arabic script homographs.

Phonology

These are sounds of the Iranian Persian language, and more particularly tend to reflect the Tehran dialect.

Click on the sounds to reveal locations in this document where they are mentioned.

Phones in a lighter colour are non-native or allophones. Source Wikipedia.

Vowel sounds

There are 6 vowel sounds, though there are also allophonic variants. The basic phonemes are as follows:

i, u, and ɒ are regarded as long vowels. e, o, and æ are short.

Consonant sounds

	labial	dental	alveolar	post- alveolar	palatal	velar	uvular	glottal
stop	p b	t d				k ɡ	ɢ	ʔ
affricate				t͡ʃ d͡ʒ
fricative	f v		s z	ʃ ʒ		x ɣ		h
nasal	m		n
approximant			l		j
trill/flap			r

In Iranian Persian ɣ and q have merged into ɣ~ɢ, as a voiced velar fricative ɣ when positioned intervocalically and unstressed, and as a voiced uvular stop ɢ otherwise.wpl§#Phonology

Tone

Persian is not a tonal language.

Structure

The following information is from Wikipedia.Wikipedia§https://en.wikipedia.org/wiki/Persian_phonology#Phonotactics

The basic syllable structure is:

(C) (S)V(S) (C)(C)

Syllable onsets are optional and normally consist of a single consonant, except for some loan words which may insert an epenthetic æ inserted between consonants.

The nucleus is a vowel, which may be preceded or followed by a semivowel.

The coda is optional, and may consist of one or two consonants. The first can be any consonant, but the second is usually one of d k s t z.

Vowels

Normal:

iː ای‍ ‍ي‍ ‍ی	uː او ‍و ‍و
e ا ‍ه	o ا ‍و
æ ا ‍ه	ɒː آ ‍ا ‍ا

Vocalised:

iː ای‍ ‍ي‍ ‍ی	uː او ‍و ‍و
e اِ ◌ِ ◌ِ‍ه	o اُ ◌ُ ◌ُ‍و
æ اَ ◌َ ◌َ‍ه	ɒː آ ‍ا ‍ا

Each table cell shows word-initial, word-medial, and word-final forms from right to left. The glyphs shown are illustrative; alternative shapes may occur (see joining_forms).

In normal (non-vocalised) text a vowel location is simply not indicated where only a diacritic is shown in the table. Because the sukun is also dropped in non-vocalised text, where a mater lectionis remains it only implies a vowel location, since it may alternatively represent a consonant or a glide.

In word-initial position vowels are attached to ا.

Initial u is rare, as are final æ and o.

Final 0647 and 0648 are only treated as vowels if they follow a consonant sound. If they follow a vowel sound, they revert to their normal consonant value. Unfortunately, since short vowels are only rarely shown, it can sometimes be difficult to tell from the written text whether to pronounce these letters as vowel or consonant.

Post-consonant vowels

This orthography is an abjad where vowel sounds are written using a mixture of combining marks and letters in vocalised text, but normally the diacritics are not used (and so it is not possible to accurately read the text unless you recognise the consonant patterns). However these diacritics and other phonetic information can be written where needed, and are regularly used for Qur'anic texts, dictionaries, educational materials, and where the pronunciation needs to be made clear.

Although the vowel diacritics are not normally used, Persian has 3 diacritic code points that can be used to indicate the 'short' vowels, if needed. Long vowels, and word initial and final short vowels are represented using one of 4 consonant letters, although there can occasionally be some ambiguity as to which sounds they represent.

Simple vowels

The simple vowels that are written in normal Persian text are expressed using matres lectionis (see matres). In vowelled text, additional precision can be added using diacritics (see combiningV).

Matres lectionis

ا,آ,ی,و,ه

Persian is normally written without vowel diacritics. As a rule, only long vowels are represented by matres lectionis, except that all vowels in word-initial position are written with or preceded by ا [U+0627 ARABIC LETTER ALEF]. Another exception is a final e, which is written with ه [U+0647 ARABIC LETTER HEH]. (Final a and o are also represented by a consonant, but are rarely found.)

The table shows what sounds each letter may represent in non-vocalised text. Where there are gaps, the sound is just not written.

initial		medial		final
ایـ	iː	ـیـ	iː	ـی	iː
او	o uː	‍و	uː	‍و	o uː ow
				ـه	e
ا	o e æ	‍ا	ɑː	‍ا	ɑː
آ	ɑː	‍ا	ɑː	‍ا	ɑː

Letters mapped to vowel sounds. (The table should be read right-to-left.)

The vowels e oand æ are not marked in medial position, and with the exception of e generally do not occur in word-final position. However, one common word in the Tehran dialect of Persian that does end with æ is نه næ no

آ [U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE] is included in the table because it is normally represented by a single, precomposed character.

Vowel diacritics

In situations where it is necessary to unambiguously indicate the underlying vowel sounds, the following diacritics can be added to base letters.

◌ِ,◌ُ,◌َ

As mentioned earlier, these diacritics are rarely used in Persian content. They may be used on their own to represent short vowel sounds, or combined with the vowel letters listed above to create other sounds. See how to write the basic vowels in the basicV section.

Other diacritics

◌ً,◌ٌ,◌ٍ,◌ٓ

The doubled vowel diacritic, ◌ً [U+064B ARABIC FATHATAN] is used at the ends of certain Arabic-derived adverbs in vowelled text. It is usually written over an alif, although the vowel sound is short.

eg.

لزوماً

اصلاً

Other doubled vowel diacritics, ◌ٌ [U+064C ARABIC DAMMATAN] and ◌ٍ [U+064D ARABIC KASRATAN] are not used in Persian, but are taught to support education in the Qur'anwpa§#Tanvin_(nunation).

ٓ [U+0653 ARABIC MADDAH ABOVE] is only found in decomposed text, and is associated only with alef. See آ [U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE].

Vowel length

Long vowels are generally distinguished from short vowels by the use of matres lectionis (see combiningV).

Ezafe

ِ,ٔ,ی

Ezāfe is a grammatical particle used to link words together. It is used between adjectives and nouns. Between a sequence of nouns it is similar in use to the word 'de' in French. It is pronounced ɛ~e or (after a vowel) jɛ.

Ezafe can be written in three ways in vowelled text, however in normal Persian text only the third is visible, because the diacritics are omitted.pm§41

ِ [U+0650 ARABIC KASRA]	After a word-final consonant.	گازِ طبیعی
ٔ [U+0654 ARABIC HAMZA ABOVE]	After a word that ends with a short vowel (usually in the combination ‍هٔ [U+0647 ARABIC LETTER HEH + U+0654 ARABIC HAMZA ABOVE]).	نشانهِٔ سَجاوَندی
یِ [U+06CC ARABIC LETTER FARSI YEH + U+0650 ARABIC KASRA]	After a word that ends with a long vowel.	نشانه‌هایِ سجاوندی

Standalone vowels

Word-initial standalone vowels are always attached to or preceded by 0627.

eg.

اومدن

ایرانی

Word-medial Persian vowels are sometimes pronounced without an intervening consonant, but they are normally written as if they are separated by a glottal stop. The glottal stop is written using a hamza and its carrier (see hamza).

eg.

زئوس

ز,ئو,س

سؤال

س,ؤا,ل

Vowel sounds to characters

This section maps Modern Persian vowel sounds to common graphemes in the Arabic orthography. Persian follows Arabic in providing diacritics to express short vowel sounds, but also rarely uses them in normal text.

The joining forms shown are illustrative; alternative shapes may occur (see joining_forms). The examples show normal unvowelled usage as well as vowelled where a difference exists.

Sounds listed as 'infrequent' are allophones, or sounds used for foreign words, etc. Light coloured characters occur infrequently.

iː

initial 0627 06CC eg. این

medial 06CC eg. تیر

final 06CC eg. گیتی

uː

initial 0627 0648 eg. اورمزد

medial 0648 eg. نور

final 0648 eg. زانو

initial 0627 0650 eg. اسم

medial 0650 eg. کرم

final 0650 0647 eg. کهنه

initial 0627 064F eg. امید

medial 064F eg. شتر

final 064F 0648 eg. کادو

initial 0627 064E eg. اسب

medial 064E eg. تبر

final 064E 0647

ɒː

initial 0622 eg. آسمان

medial 0627 eg. مار

final 0627 گرگ‌ها

Consonants

	پ,ب,ت,ط,د,ک,گ,ق,غ
	ء,یٔ,ٔ,ؤ,أ,ع
	چ,ج
	ف,و,س,ث,ص,ز,ذ,ض,ظ,ش,ژ,خ,غ,ه,ح
	م,ن
	و,ر,ل,ی

Basic letters

The basic consonant sounds in Persian are written using the following letters.

Click on each letter for more details and for examples of usage, especially where more than one sound is indicated.

پ,ب,ت,ط,د,ک,گ,ق,ء,چ,ج,ف,و,س,ث,ص,ز,ذ,ض,ظ,ش,ژ,خ,غ,ه,ح,ع,م,ن,ر,ل,ی

و [U+0648 ARABIC LETTER WAW] and ی [U+06CC ARABIC LETTER FARSI YEH] represent both consonants and vowels. See vowel_mappings. It may not always be clear which they represent, eg. compare the final 2 letters of the following words:

دنیوی

دن,ی,و,ی

موی

م,و,ی

Due to the influence of Arabic spelling in loan words, Persian has 2 letters for t, 3 letters for s, 4 letters for z, and 2 letters for h. The most common letter for s is س [U+0633 ARABIC LETTER SEEN], and for z is ز [U+0632 ARABIC LETTER ZAIN].

Glottal stop

In Persian, a glottal stop is commonly written using ع [U+0639 ARABIC LETTER AIN].

eg.

معنوی

عربی

شیعه

In other places, Persian uses a hamza (see hamza).

eg.

فائده

مؤثر

Hamza

أ,ء,ٔ,ؤ

The hamza (called hamze in Persian) is used to represent a glottal stop, however rather than being written as a single letter it is normally written as a combination of a diacritic and a base letter.

In most cases, that is the combination ـیٔـ [U+06CC ARABIC LETTER FARSI YEH + U+0654 ARABIC HAMZA ABOVE].

eg.

مسئول

However, if the glottal stop is preceded by the sound o, the combination is ـؤـ [U+0624 ARABIC LETTER WAW WITH HAMZA ABOVE].

eg.

مؤمن

Although it may not be pronounced, the hamza and its carrier also appears between vowels.

eg.

فائده زئوس مؤثر سؤال سوئیس

In the examples above you can see that the hamza+carrier provides a place to put a vowel diacritic for short vowels in vowelled text. Between two long vowels it represents a nominal glottal stop.

In the i sound of the indefinite ending, the hamza may also be used, or alternatively the ending may double the YEH after a long vowel, eg. compare: پایٔی پایی مویٔی مویی

On more rare occasions the hamza may appear over ا [U+0627 ARABIC LETTER ALEF]. This makes it clear that the alef represents a glottal stop, rather than a long vowel.

eg.

رأی

تأکید

It may also appear as an isolated form, ء [U+0621 ARABIC LETTER HAMZA].

eg.

علاءالدین امضاء

The hamza is also used over short, word-final vowels for ezafe.

A number of precomposed combinations of base letter and hamza are encoded in Unicode. Many of these decompose and recompose under normalisation as canonical alternatives, but a few do not and need to be treated with care. For information about which precomposed characters are used or not used here see hamza_choices.

Onsets

Consonant clusters don't normally occur in Persian syllable onsets. If they do, for example in loan words or as a semivowel before the vowel, they are simply written using a sequence of consonant letters.

eg.

تکیه‌گاه

Codas

No special mechanisms are used to write syllable or word final consonants.

eg.

اسم

راست

عیب

اوج

Geminated consonants

In vowelled text, which is rare, geminated consonants are shown using the diacritic 0651 (called tašdid in Persian).

eg.

تپه

اولی

When both tašdid and zir are attached to the same base consonant, a common, though not universal, practice is to display the zir below the tašdid, rather than below the base consonant (see the examples just above). Some fonts, such as Amiri, don't do this. (See also context.)

Consonant sounds to characters

This section maps Persian consonant sounds to common graphemes in the Arabic orthography.

The right-hand side shows the various joining forms for each character.

Sounds listed as 'infrequent' are allophones, or sounds used for foreign words, etc. Light coloured characters occur infrequently.

067E067E067E067E stop پ

0628062806280628 stop ب

062A062A062A062A stop ت

0637063706370637 stop ط

06290629 stop ة Used at the end of some Arabic words.

t͡ʃ

0686068606860686 affricate چ

062F062F stop د

d͡ʒ

062C062C062C062C affricate ج

06A906A906A906A9 stop ک

06AF06AF06AF06AF stop گ

063A063A063A063A stop غ

0642064206420642 stop ق

063A063A063A063A consonant غ Between vowels.

glottal stop ء

06CC06CC06CC06CC glottal stop یٔ

06240624 glottal stop ؤ

06230623 glottal stop أ

0639063906390639 fricative ع

0641064106410641 fricative ف

06480648 fricative/mater lectionis و

0633063306330633 fricative س

062B062B062B062B fricative ث

0635063506350635 fricative ص In words of Arabic origin.

06320632 fricative ز

06300630 fricative ذ

0636063606360636 fricative ض In words of Arabic origin.

0638063806380638 consonant ظ In words of Arabic origin.

0634063406340634 fricative ش

06980698 fricative ژ

062E062E062E062E fricative خ

063A063A063A063A consonant غ

0647064706470647 fricative ه

062D062D062D062D fricative ح

0645064506450645 nasal م

0646064606460646 nasal ن

-w

06480648 و As a glide after a vowel.

06310631 trill ر

0644064406440644 lateral ل

06CC06CC06CC06CC approximant/mater lectionis ی

Encoding choices

In the Persian orthography different sequences of Unicode characters may produce the same visual result. Here we look at those, and make notes on usage.

Hamza & precomposed characters

Unicode support for the various uses of the hamza is complicated.u§384 For notes on the usage of the hamza in Persian, see hamza and ezafe. In general, the Unicode Standard recommends to use 0654 in combination with a base character. However, there are a few exceptions to consider.

Canonically-equivalent alternatives

A number of combinations with the hamza diacritic can be represented as either an atomic character or a decomposed sequence, where the parts are separated in Unicode Normalisation Form D (NFD) and recomposed in Unicode Normalisation Form C (NFC), so both approaches are canonically equivalent. These include the following:

Atomic	Decomposed
أ [U+0623 ARABIC LETTER ALEF WITH HAMZA ABOVE]	أ [U+0627 ARABIC LETTER ALEF + U+0654 ARABIC HAMZA ABOVE]
آ [U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE]	آ [U+0627 ARABIC LETTER ALEF + U+0653 ARABIC MADDAH ABOVE]
ؤ [U+0624 ARABIC LETTER WAW WITH HAMZA ABOVE]	ؤ [U+0648 ARABIC LETTER WAW + U+0654 ARABIC HAMZA ABOVE]

The single code point per vowel-sign is the form preferred by the Unicode Standard and the form in common use for Persian, but either could be found.

Yeh with hamza

This item is a special case. Yeh with a hamza is used for the glottal stop, and also for word medial standalone vowels.

Atomic	Decomposed
ئ [U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE]	ئ [U+064A ARABIC LETTER YEH + U+0654 ARABIC HAMZA ABOVE]

Persian uses ی and doesn't use ي because the latter produces dots below in all positions, whereas yeh in Persian only has dots below in initial and medial forms. However, the canonical decomposition of 0626 maps to the Arabic yeh and a combining hamza.

Nevertheless, the atomic character is widely used in Persian text. To mitigate the issues, the Unicode Standard recommends that any time ي is combined with a hamza the font should drop the dot glyphs. This ensures that the text looks correct in decomposed form, but applications need to be aware that decomposed text will contain an Arabic yeh which is not otherwise used for Persian.

Visually similar but different characters

These cases involve precomposed characters that look identical to the sequences used in Persian, however there is a catch because the precomposed characters have canonical decompositions to letters that are not used in Persian.

Recommended Not recommended

Recommended	Not recommended
0647 0654	06C0 06C2	The Unicode Standard explicitly recommends use of the decomposed sequence when combining a hamza with HEH (for ezafe)`u§395`. The two atomic characters on the right are problematic because they decompose to sequences containing 06D5 and 06C1, respectively, neither of which are appropriate for Persian.
06CC 0654	0626 0649 0654 08A8	For 0626 see hamza_yeh. The Unicode Standard explicitly recommends not to use 0649 0654. Persian doesn't use the alef maksura`u§395`. 08A8 is a character used for an African orthography which retains the dots for all joining forms. It is therefore also an inappropriate character for Persian`u§395`.

0647 0654

06C0

06C2

The Unicode Standard explicitly recommends use of the decomposed sequence when combining a hamza with HEH (for ezafe)u§395.

The two atomic characters on the right are problematic because they decompose to sequences containing 06D5 and 06C1, respectively, neither of which are appropriate for Persian.

06CC 0654

0626

0649 0654

08A8

For 0626 see hamza_yeh.

The Unicode Standard explicitly recommends not to use 0649 0654. Persian doesn't use the alef maksurau§395.

08A8 is a character used for an African orthography which retains the dots for all joining forms. It is therefore also an inappropriate character for Persianu§395.

Content authors should use a separate hamza for these sequences in Persian, even though the precomposed characters look the same visually, because they don't represent the same semantics, and may introduce problems if text is decomposed. However, because approaches may yield exactly the same result when displayed, applications will need to recognise the precomposed characters and treat them or map them to the appropriate sequence. Input mechanisms, on the other hand, can produce one rather than the other, and that choice should be made with advisement.

Confusables & spelling errors

The following lists some common errors found in Persian text due to the similarity of Unicode characters, or perhaps sometimes due to problems inputting the correct character. See also the items listed for combinations with hamza just above.

Correct	Incorrect
06CC	064A 0649	064A doesn't drop the dots below the letter in isolate and final shapes, and 0649 doesn't add them for initial and medial shapes.
0647	06C1 06D5	Although they look the same, these are all different characters. The 06C1 is used for languages that include Urdu and Kashmiri, whereas 06D5 is used for Uighur and central Asian languages.
0629	06C3	Again, characters that a font may render exactly the same, but that are based on different base letters.
06A9	0643	Differences in shape of final and isolate forms.

Although these characters look the same (at least in certain joining forms), they are all different characters, and should not be used interchangeably. If they are used, the ability to search and compare text is impaired unless the application is aware of and takes counter-measures against this substitution.

Codepoint sequences

When typing and in storage, combining marks always follow the base character they are associated with.

Special rendering rules

In principle, if more than one combining mark appears on the same side of the base character, Unicode expects applications to render the marks such that those marks closer to the base character in memory appear closer to the base character when rendered. (This is called the inside-out rule.) However, due to the reordering applied by the Unicode normalisation forms, some of the Arabic script diacritics end up in an inappropriate order on display.

For example, if a user types the sequence of characters in fig_amtra, the order of the marks will be changed such that applying the inside-out rule would render the shadda above the vowel (which is incorrect). (In fact, most application renderers have special rules to correct this.)

The Unicode Standard formally addresses this anomaly in the Technical Annex Unicode® Arabic Mark Rendering (AMTRA), with a set of rules for how to render sequences of Arabic characters. The rules generally move shadda, hamza, round dots, etc. so that they are close to the base character.

User input	Post-normalisation output
بُّ ب ّ ُ	بُ͏ّ ب ُ ّ

User input

Post-normalisation output

بُّ

بُ͏ّ

A sequence of shadda and damma as the user is likely to input it (left), and how it could potentially be arranged after normalisation (right).

In the rare exceptions where the AMTRA rules should not change the rendering, this can be achieved by placing an invisible 034F character between the combining marks. (In fact, this is what was done to simulate the incorrect appearance in fig_amtra, because otherwise the browser rendering engine would have automatically produced the same output as in the first column. Clicking on the example will show the sequence used.)

Arabic
Persian
Urdu
Sindi

Text direction

Persian is written horizontally and right-to-left in the main, but (as with most RTL scripts) numbers and embedded LTR script text are written left-to-right (producing 'bidirectional' text).

کنسرسیوم یونیکد برای اولین بار Unicode Standard را در سال 1991 منتشر کرد (نسخه 1.0) — Persian words are read right-to-left, starting from the right of the line, but numbers and Latin text (highlighted) are read left-to-right.

The Unicode Bidirectional Algorithm automatically takes care of the ordering for all the text in fig_bidirectional, as long as the 'base direction' (ie. the surrounding directional context) is set to right-to-left (RTL).

Characters are all stored in the order in which they are spoken (and typed). This so-called 'logical' order is then rendered as bidirectional flows by the application at run time, as the text is displayed or printed. The relative placement of characters within a single directional flow is based on strong directional properties (RTL or LTR) assigned to each Unicode character by the Unicode Standard. There exist, however a set of neutral direction property values, mostly for punctuation, where the placement of characters depends on the base direction.

Show default bidi_class properties for characters in this orthography.

If the base direction is not set appropriately, the directional runs will be ordered incorrectly as shown in fig_base_direction, making it very difficult to get the meaning.

In some circumstances the Unicode Bidirectional Algorithm requires additional assistance to correctly render the directionality of bidirectional text. For such cases the Unicode Standard provides invisible formatting characters for use in plain text. See directioncontrols.

In HTML the base direction and higher level controls can be set using the dir or bdi attributes. CSS should not be used to control direction. Unicode formatting codes should also not be used where markup is available.

For more information about how directionality and base direction work, see Unicode Bidirectional Algorithm basics. For information about plain text formatting characters see How to use Unicode controls for bidi text. And for working with markup in HTML, see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.

For authoring HTML pages, one of the most important things to remember is to use <html dir="rtl" … > at the top of a right-to-left page, and then use the dir attribute or bdi tag for ranges within the page, but only when you need to change the base direction. Also, use markup to manage direction, and do not use CSS styling.

For other aspects of dealing with right-to-left writing systems see the following sections:

directioncontrols
expressions
breaking_latin
mirrored_characters
page

Managing text direction

Unicode provides a set of 10 formatting characters that can be used to control the direction of text when displayed. These characters have no visual form in the rendered text, however text editing applications may have a way to show their location.

‫ RLE [U+202B RIGHT-TO-LEFT EMBEDDING] (RLE), ‪ LRE [U+202A LEFT-TO-RIGHT EMBEDDING] (LRE), and ‬ PDF [U+202C POP DIRECTIONAL FORMATTING] (PDF) are in widespread use to set the base direction of a range of characters. RLE/LRE come at the start, and PDF at the end of a range of characters for which the base direction is to be set.

More recently, the Unicode Standard added a set of characters which do the same thing but also isolate the content from surrounding characters, in order to avoid spillover effects. They are ⁧ RLI [U+2067 RIGHT-TO-LEFT ISOLATE] (RLI), ⁦ LRI [U+2066 LEFT-TO-RIGHT ISOLATE] (LRI), and ⁩ PDI [U+2069 POP DIRECTIONAL ISOLATE] (PDI). The Unicode Standard recommends that these be used instead.

There is also ⁨ PDI [U+2068 FIRST STRONG ISOLATE] (FSI), used at the start of a range to set the base direction according to the first recognised strongly-directional character.

‏ RLM [U+200F RIGHT-TO-LEFT MARK] (RLM) and ‎ LRM [U+200E LEFT-TO-RIGHT MARK] (LRM) are invisible characters with strong directional properties that are also sometimes used to produce the correct ordering of text.

For more information about how to use these formatting characters see How to use Unicode controls for bidi text. Note, however, that when writing HTML you should generally use markup rather than these control codes. For information about that, see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.

Expressions & sequences

A sequence of numbers separated by hyphens (for example a range) runs from left to right in Persian (unlike Arabic language text).

fig_bidi_range shows some Persian text, which is right-to-left overall, containing a numeric range that is ordered LTR, ie. it starts with ۱۱۶۹ (1169) and ends with ۱۱۷۰ (1170).

۱۱۶۹-۱۱۷۰ آغاز نگارش هیستوریا به دستور امالریک یکم — A numeric range in Persian language text.

Certain types of hyphen or other characters can affect the way expressions run, and you need to be aware of that when writing Persian text. For more details see the Expressions & sequences section in the description of the Arabic language orthography.

Glyph shaping & positioning

This section brings together information about the following topics: writing styles; cursive text; context-based shaping; context-based positioning; baselines, line height, etc.; font styles; case & other character transforms.

You can experiment with examples using the Persian workbench.

The orthography has no case distinction, and no special transforms are needed to convert between characters.

Writing styles

Persian may be written in a nasta'liq writing style. Key features include a sloping baseline for joined letters, and overall complex shaping and positioning for base letters and diacritics alike. There are also distinctive shapes for many glyphs and ligatures.

مستحق • شخص • کیفیت — Sloping baselines and complex joining behaviours in Persian nastaliq text.

This is achieved in Unicode by applying the correct font – the underlying characters used are not different for nasta'liq vs. other styles.

The Wikipedia Persian home page, showing the title in nasta'liq script.

Persian may also be written in several other styles, especially in artistic and historical writing.

Cursive script

Arabic script joins letters together. Fonts need to produce the appropriate joining form for a code point, according to its visual context. This results in four different shapes for most letters (including an isolated shape). The highlights in fig_cursive below show the same letter, ع [U+0639 ARABIC LETTER AIN], with two different joining forms.

عقل ودیعت — The letter ع [U+0639 ARABIC LETTER AIN] in 2 different joining contexts.

A few Arabic script letters only join on the right-hand side.

There are 2 Unicode blocks containing Arabic presentation forms: these contain individual characters corresponding to the various joining forms and ligatures. With only a handful of exceptions, characters in those blocks should not be used for text content; they are only for managing legacy encodings. Instead, characters in the main Arabic block should be used, and the font will manage the necessary cursive shaping.

Cursive joining forms

Most dual-joining characters add or become a swash when they don't join to the left. A number of characters, however, undergo additional shape changes across the joining forms. fig_joining_forms and fig_right_joining_forms show the basic shapes in Persian and what their joining forms look like.

Two pairs of characters in the first table have base shapes that are identical, but they manage the dots differently in different joining forms. These have been put onto separate rows.

isolated	right-joined	dual-join	left-joined	Persian letters
ب	ـب	ـبـ	بـ	ب,ت,ث,پ
ن	ـن	ـنـ	نـ	ن
ق	ـق	ـقـ	قـ	ق
ف	ـف	ـفـ	فـ	ف
س	ـس	ـسـ	سـ	س,ش
ص	ـص	ـصـ	صـ	ص,ض
ط	ـط	ـطـ	طـ	ط,ظ
ک	ـک	ـکـ	کـ	ک,گ
ل	ـل	ـلـ	لـ	ل
ه	ـه	ـهـ	هـ	ه
م	ـم	ـمـ	مـ	م
ع	ـع	ـعـ	عـ	ع,غ
ح	ـح	ـحـ	حـ	ح,خ,ج,چ
ی	ـی	ـیـ	یـ	ی

Joining forms for shapes that join on both sides.

isolated	right-joined	Persian letters
ا	ـا	ا,آ,أ
ر	ـر	ر,ز,ژ
د	ـد	د,ذ
و	ـو	و,ؤ

Joining forms for shapes that join on the right only.

Managing glyph shaping

200D (ZWJ) and 200C (ZWNJ) are used to control the joining behaviour of cursive glyphs. They are particularly useful in educational contexts, but also have real world applications.

ZWJ permits a letter to form a cursive connection without a visible neighbour. It can be used for illustrating cursive joining forms, eg. ان‍‍ ‍س‍‍ ‍ان Characters from the Presentation Forms blocks in Unicode should not be used in such cases.

ZWNJ prevents two adjacent letters forming a cursive connection with each other when rendered, eg. ان‌س‌ان

This is particularly useful for Persian, since certain Persian suffixes don't join with word-final letters when they appear finally in a morpheme, eg. خانه‌ها تکیه‌گاه

Click on the words above to see the composition.

To achieve this, you need to use 200C. It's also possible to sometimes see text where the suffix is written after a space, or simply joined to the end of the word. However, those alternatives are not available when the word ends with 0647.Wiktionary§https://en.wiktionary.org/wiki/ها#Persian

034F is used in Arabic to produce special ordering of diacritics. The name is a misnomer, as it is generally used to break the normal sequence of diacritics.

Context-based shaping & positioning

Context-based shaping is everywhere in Persian due to the combination of the cursive behaviour of the script plus the strong tendency to arrange joined characters in cascades or vertical arrangements.

As in Arabic, lam followed by alef ligates, and there are other such commonly ligated forms.

eg.

لاغر

علاوه

Depending on the font, Arabic letters often have special rules for joining between certain characters, and diacritic marks generally vary in height and horizontal position depending on the size of the base character.

Another example of contextual positioning rules is the placement of 0650 (zir) in vowelled text when it appears on the same letter as 0651 (tašdid). Usually, zir appears below the base letter, and this is how it can be distinguished from 064E (zebar). However, when combined, zir may be placed relative to the shadda diacritic, rather than relative to the base character, as seen in fig_kasra_placement.

مَمِمّمَّمِّ — The word تپه with vowel diacritics, showing the zir below the tašdid, rather than below the base letter.

Positioning of cursive joining forms is particularly complicated in the nastaliq style. See more details in the Urdu page.

Punctuation & inline features

Phrase & section boundaries

Persian uses a mixture of ASCII and Arabic punctuation.

phrase	، ؛ :
sentence	. ؟ !

phrase

sentence

Bracketed text

Persian commonly uses ASCII parentheses to insert parenthetical information into text.

	start	end
standard	(	)

Mirrored characters

The words 'left' and 'right' in the Unicode names for parentheses, brackets, and other paired characters should be ignored. LEFT should be read as if it said START, and RIGHT as END. The direction in which the glyphs point will be automatically determined according to the base direction of the text.

a > b > c — Both of these lines use > [U+003E GREATER-THAN SIGN], but the direction it faces depends on the base direction at the point of display.

ا > ب > ج — Both of these lines use > [U+003E GREATER-THAN SIGN], but the direction it faces depends on the base direction at the point of display.

The number of characters that are mirrored in this way is around 550, most of which are mathematical symbols. Some are single characters, rather than pairs. The following are some more common ones.

(,),<,>,[,],{,},«,»,‹,›

Quotations & citations

The following quotation marks can be found in Persian texts. (Depending on ease of input, quotations may alternatively be surrounded by ASCII double and single quote marks.)

	start	end
primary	«	»

Because they are mirrored, when using these quotation marks, LEFT should be read as if it said START, and RIGHT as END. These quotation marks typically have rounded glyph shapes, rather than sharp angles.

Line & paragraph layout

Line breaking & hyphenation

Basic line-break opportunities occur between the space-separated words.

They are not broken at the small gaps that appear where a character doesn't join on the left.

In-word line-breaking

Hyphenation isn't used for the Persian language, however other languages using the Arabic script may hyphenate (such as Uighur).a

Line-edge rules

As in almost all writing systems, certain punctuation characters should not appear at the end or the start of a line. The Unicode line-break properties help applications decide whether a character should appear at the start or end of a line.

Show default line-breaking properties for characters in this orthography.

The following list gives examples of typical behaviours for characters affected by these rules. Context may affect the behaviour of some of these and other characters.

« “ ‘ ( should not be the last character on a line
» ” ’ ) . ، ؛ ؟ ! should not begin a new line

Breaking between Latin words

When a line break occurs in the middle of an embedded left-to-right sequence, the items in that sequence need to be rearranged visually so that it isn't necessary to read lines from bottom to top.

latin-line-breaks shows how two Latin words are apparently reordered in the flow of text to accommodate this rule. Of course, the rearragement is only that of the visual glyphs: nothing affects the order of the characters in memory.

Text alignment & justification

See the section on justification for the Arabic language.

Text spacing

See the section on text spacing for the Arabic language.

Baselines, line height, etc.

The alphabetic baseline is a strong feature of Arabic script on the whole, since characters tend to join there. The nastaliq style of the script, on the other hand, uses arrangements of joined glyphs that cascade downwards from right to left, and ressemble a strongly sloping baseline.

Counters, lists, etc.

You can experiment with counter styles using the Counter styles converter. Patterns for using these styles in CSS can be found in Ready-made Counter Styles, and we use the names of those patterns here to refer to the various styles.

The Persian language uses 1 numeric and 2 fixed styles.

Numeric

The persian numeric style is decimal-based and uses these digits.rmcs§#arabic-styles

۰,۱,۲,۳,۴,۵,۶,۷,۸,۹

eg.

۱,۲,۳,۴,۱۱,۲۲,۳۳,۴۴,۱۱۱,۲۲۲,۳۳۳,۴۴۴

Fixed

The arabic-abjad fixed style uses these letters. It is only able to count to 28.rmcs§#arabic-styles

ا,ب,ج,د,ه‍,و,ز,ح,ط,ی,ک,ل,م,ن,س,ع,ف,ص,ق,ر,ش,ت,ث,خ,ذ,ض,ظ,غ

Note that the 5th counter includes a zero-width joiner formatting character. This makes the shape distinguishable from ٥ [U+0665 ARABIC-INDIC DIGIT FIVE].

The persian-alphabetic fixed style uses these letters. It is able to count to 32. The letters are arranged by shape.rmcs§#arabic-styles

ا,ب,پ,ت,ث,ج,چ,ح,خ,د,ذ,ر,ز,ژ,س,ش,ص,ض,ط,ظ,ع,غ,ف,ق,ک,گ,ل,م,ن,و,ه‍,ی

The 31st counter also includes a zero-width joiner formatting character.

Prefixes and suffixes

Persian lists generally use a full stop suffix as a separator.

Notes, footnotes, etc

See inlinenotes for purely inline annotations, such as ruby or warichu. This section is about annotation systems that separate the reference marks and the content of the notes.

Arabic, Persian

Sample

Usage & history

Basic features

Joining forms

Ijam & tashkil

Character index

Letters

Consonants & matres lectionis

Modifier letter

Not used for Persian

Combining marks

Vowel marks

Other

Numbers

Punctuation

General punctuation

ASCII

Quotes

Opening/closing

Symbols