Urdu (draft)
Nastaliq Arabic

Updated 13 November, 2022

This page brings together basic information about the Arabic script and its use for the Urdu language. It aims to provide a brief, descriptive summary of the modern, printed orthography and typographic features, and to advise how to write Urdu using Unicode.

Sample

Select part of this sample text to show a list of characters, with links to more details. Source
Change size:   28px

دفعہ ۱۔ تمام انسان آزاد اور حقوق و عزت کے اعتبار سے برابر پیدا ہوئے ہیں۔ انہیں ضمیر اور عقل ودیعت ہوئی ہے۔ اس لئے انہیں ایک دوسرے کے ساتھ بھائی چارے کا سلوک کرنا چاہیئے۔

دفعہ ۲۔ ہر شخص ان تمام آزادیوں اور حقوق کا مستحق ہے جو اس اعلان میں بیان کئے گئے ہیں، اور اس حق پر نسل، رنگ، جنس، زبان، مذہب اور سیاسی تفریق کا یا کسی قسم کے عقیدے، قوم، معاشرے، دولت یا خاندانی حیثیت وغیرہ کا کوئی اثر نہ پڑے گا۔ اس کے علاوہ جس علاقے یا ملک سے جو شخص تعلق رکھتا ہے اس کی سیاسی کیفیت دائرہ اختیار یا بین الاقوامی حیثیت کی بنا پر اس سے کوئی امتیازی سلوک نہیں کیا جائے گا۔ چاہے وہ ملک یا علاقہ آزاد ہو یا تولیتی ہو یا غیر مختار ہو یا سیاسی اقتدار کے لحاظ سے کسی دوسری بندش کا پابند ہو۔

Usage & history

The Urdu alphabet, in the nastaliq style, is used to write the Urdu language, spoken in Pakistan and India.

اُردُو حُرُوفِ تَہَجِّی

The orthography is a modification of Perso-Arabic, which derives from the Arabic alphabet with additions for Indo-European pronunciation. After the Mughal conquest, Nasta'liq became the preferred writing style for Urdu. It is the dominant style in Pakistan, and many Urdu writers elsewhere in the world use it.

Basic features

Urdu uses the Arabic script, with extensions to covers its much wider repertoire of sounds. A number of the extensions are based on those developed for Persian (Farsi). The Arabic script is an abjad. This means that in normal use the script represents only consonant and long vowel sounds. See the table to the right for a brief overview of features for the modern Urdu orthography.

Urdu text runs right-to-left in horizontal lines, but numbers and embedded Latin text are read left-to-right.

It is principally written using the nasta'liq style of Arabic writing. Glyphs are more drawn out, and the baseline tends to be sloping from word to word.

The script is cursive, and some basic letter shapes change radically, depending on what they join to. The nastaliq styling creates diagonal baselines between joined characters, and tends to reduce clarity about where one letter ends and the next starts. (The dots and other diacritics associated with letters become particularly useful for the reader.)

There is no case distinction.

Words are separated by spaces.

Modern Urdu has 39 basic consonant letters and 18 aspirated digraphs in its alphabet to represent native sounds, but tends to spell words loaned from Persian and Arabic using additional characters. Although it is not always easy to guess the vowel sounds in a word, the consonants are largely reliable phonetically. There is mostly a one-to-one correspondance between letters and sounds. Vowels, however, are a different story.

The script draws on combinations of 5 code points in order to write 10 vowel sounds in unvowelled text, but uses an additional 5-10 diacritics when precision is needed. Nasalisation is indicated by a special letter in word-final position, but by a normal n-letter word-medially, although sometimes this has an additional diacritic.

The way Urdu indicates vowels that follow another vowel without an intervening consonant, and the way it represents the izafat conjunction, use a hamza diacritic and other diacritics and letters in a somewhat complicated pattern. The choice between precomposed and decomposed realisations of characters used for these features is also complicated.

A mandatory ligature is used for combinations of lam + alif.

Additional diacritics indicate the absence of a vowel in consonant clusters, and gemination.

Urdu uses native digits, though the code points are different from those used for the Arabic language, and Arabic code points are used for several of the more common punctuation marks.

Joining forms

Because the Arabic script is 'cursive' (ie. joined-up) writing, letters tend to have different shapes depending on whether they join with adjacent letters or not (see cursive). Here we clarify some of the terminology used in this page to refer to these different forms.

Several characters have no left-joining form. This has an effect on the following letter shape.

When we say 'initial' forms, we generally refer to glyphs that only join to the left. Consonants that don't have a left-joining form use the unjoined glyph at the beginning of a word. Initial forms occur in word-medial position if they follow a glyph that doesn't join to the left.

Where we illustrate 'initial' forms of a vowel we typically show the word-initial form, which is always attached to or preceded by an aleph, eg. اَ or ای‍ـ. If an initial form is immediately preceded by a consonant, the consonant takes the place of the aleph, eg. رَ‍ـ or ری‍ـ.

In illustrations of shaping forms we normally show the 'isolated' form of a vowel as preceded by aleph, as it would be if written alone, eg. ای. In use following another letter, however, the aleph is dropped.

Word-final vowel forms come in two types. A vowel that can join with the preceding character uses the right-joining glyph, eg. بی. One that follows a letter that doesn't join to the left uses the isolated form, eg. ری. When we refer to the 'final' form, we are usually referring to the former, ie. the right-joined form.

Character index

Letters

Show

Consonants

پ␣ب␣ت␣ط␣د␣ٹ␣ڈ␣ک␣گ␣ق␣ء␣چ␣ج␣ف␣و␣س␣ث␣ص␣ز␣ذ␣ض␣ظ␣ش␣ژ␣خ␣غ␣ہ␣ح␣ھ␣ع␣م␣ن␣ں␣ر␣ڑ␣ل␣ی␣آ␣ؤ␣ئ␣ۂ␣ۓ
ۃ␣ي

Vowels

ے␣ا

Not used for Urdu

ځ␣ݬ␣ࢡ␣ه␣ة␣ك

Combining marks

Show

Vowels

َ␣ً␣ُ␣ٌ␣ِ␣ٍ␣ْ␣ٗ␣ٓ␣ٰ␣ٖ␣٘

Other

ؔ␣ؓ␣ؒ␣ؑ␣ؐ␣ّ␣ٔ

Numbers

Show
۰␣۱␣۲␣۳␣۴␣۵␣۶␣۷␣۸␣۹

Punctuation

Show
٪␣٫␣٬␣؍␣،␣؛␣؟␣٪␣٫␣٬␣۔␣“␣”␣‐␣–

ASCII

!␣(␣)␣.

Symbols

Show
؎␣؏␣﷽

Other

Show
؀␣؁␣؂␣؃␣؄␣۝

Formatting

‌␣‍␣⁧␣‫␣⁦␣‪␣⁨␣⁩␣‬␣‏␣‎
Items to show in lists

Phonology

These are sounds of the Urdu language.

Click on the sounds to reveal locations in this document where they are mentioned.

Phones in a lighter colour are non-native or allophones. Source Wikipedia.

Vowel sounds

There are 10 vowel sounds, though there are also allophonic variants. They are usually grouped into pairs of 'short' and 'long' sounds - although the difference is qualitative, rather than just length. The basic phonemes are as follows:

ɪ ɪ ʊ ʊ ə ə ɛː ɛː ɔː ɔː æ æ ɑː ɑː

The phoneme ə is often written a in phonemic transcriptions. Its pronunciation may also be slightly lower as far down as ɐ, so it is shown slightly lower than normal on the chart.

and in word-final position are typically shortened to i and u,whp,#Vowels eg. شَکتی وَستُو

Where ɦ has inherent vowels on either side, those vowels may become ɛ, eg. کَہنا A similar process occurs for word-final ɦ,whp,#Vowels eg. کَہہ

For more details, see Wikipedia.

Consonant sounds

labial dental alveolar post-
alveolar
retroflex palatal velar uvular glottal
stops p b t d     ʈ ɖ   k ɡ q ʔ
aspirated     ʈʰ ɖʱ   ɡʱ    
affricates       t͡ʃ d͡ʒ          
aspirated       t͡ʃʰ d͡ʒʱ          
fricatives f v   s z ʃ ʂ   x ɣ   h ɦ
nasals m   n   ɳ ɲ ŋ  
approximants ʋ w   l     j    
trills/flaps     r ɾ   ɽ  
aspirated         ɽʱ  

Urdu, like other Indic languages, has four forms of plosives, illustrated here with the bilabial stop: unvoiced p, voiced b, aspirated , and murmured . It also has a set of retroflex consonants.

v and w are allophones of ʋ in Urdu. w typically occurs between a consonant and vowel,whp,#Allophony_of_[v]_and_[w] eg. compare پکوان ورت

For more details, see Wikipedia.

Vowels

Vowel sounds to characters

This section maps Urdu vowel sounds to common graphemes in the Arabic orthography, grouped by word-initial ( i ), medial ( m ), and final ( f ) types. Click on a grapheme to find other mentions on this page (links appear at the bottom of the page). Click on the character name to see examples and for detailed descriptions of the character(s) shown.

Sounds listed as 'infrequent' are allophones, or sounds used for foreign words, etc.

Urdu follows Arabic in using diacritics to express short vowel sounds, but also rarely uses them in normal text. Given the extra phonetic sounds in Urdu, compared to Arabic, the way characters are used to express vowels is much more complicated.

The three short vowels are not typically found in final position.

Vowel diacritics are shown here, but are not normally shown in Urdu text.

 
m

◌ِـیـ [U+0650 ARABIC KASRA + U+06CC ARABIC LETTER FARSI YEH], eg. تِین.

 
f

◌ِـی   [U+06CC ARABIC LETTER FARSI YEH], eg. گاری.

ɪ
i

اِ [U+0627 ARABIC LETTER ALEF + U+0650 ARABIC KASRA], eg. اِنسَان.

 
m

◌ِ [U+0650 ARABIC KASRA], eg. دِن.

‍ئ‍    [U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE]  after another vowel with no intervening consonant, eg. کوئلہ.

ʊ
i

اُ [U+0627 ARABIC LETTER ALEF + U+064F ARABIC DAMMA], eg. اُڑنَا.

 
m

◌ُ [U+064F ARABIC DAMMA], eg. سُست.

و [U+0648 ARABIC LETTER WAW] in two very common words: خود, and خوش.

 
m

◌ُو [U+064F ARABIC DAMMA + U+0648 ARABIC LETTER WAW], eg. پُورا.

وٗ [U+0648 ARABIC LETTER WAW + U+0657 ARABIC INVERTED DAMMA], eg. پوٗرا

 
f

◌ُو [U+064F ARABIC DAMMA + U+0648 ARABIC LETTER WAW], eg. ہندُو.

وٗ [U+0648 ARABIC LETTER WAW + U+0657 ARABIC INVERTED DAMMA], eg. ہندوٗ

 
m

ـیـ    [U+06CC ARABIC LETTER FARSI YEH], eg. بیٹا.

 
f

ـے    [U+06D2 ARABIC LETTER YEH BARREE], eg. بجے.

 
m

و [U+0648 ARABIC LETTER WAW], eg. ٹوپی.

 
f

و [U+0648 ARABIC LETTER WAW], eg. کو.

ɛ
-

◌ِ [U+0650 ARABIC KASRA], when used as izafat, eg. شیرِ پنجاب.

ۂ [U+06C2 ARABIC LETTER HEH GOAL WITH HAMZA ABOVE] as izafat when the preceding word ends in a silent ہ [U+06C1 ARABIC LETTER HEH GOAL], eg. درجۂ حرارت.

ٔ [U+0654 ARABIC HAMZA ABOVE] as izafat when the preceding word ends with ی [U+06CC ARABIC LETTER FARSI YEH] or ۓ [U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE], eg. آزادئ مذہب.

 ئے [U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE + U+06D2 ARABIC LETTER YEH BARREE] as izafat when the preceding word ends in ا [U+0627 ARABIC LETTER ALEF] or و [U+0648 ARABIC LETTER WAW], eg. روئے زمین.

ɛː
i

اَیـ [U+0627 ARABIC LETTER ALEF + U+064E ARABIC FATHA + U+06CC ARABIC LETTER FARSI YEH], eg. اَیسا.

May also replace inherent vowels alongside ɦ, per the description above.

 
m

◌ـَیـ [U+064E ARABIC FATHA + U+06CC ARABIC LETTER FARSI YEH], eg. کَیسَا.

 
f

◌ـَے [U+064E ARABIC FATHA + U+06D2 ARABIC LETTER YEH BARREE], eg.ہَے.

 
m

◌َو [U+064E ARABIC FATHA + U+0648 ARABIC LETTER WAW], eg. شَوق.

 
f

◌َو [U+064E ARABIC FATHA + U+0648 ARABIC LETTER WAW], eg.نَو.

ə
i

اَ [U+0627 ARABIC LETTER ALEF + U+064E ARABIC FATHA], eg. اَب.

 
m

◌َ [U+064E ARABIC FATHA], eg. سَر.

ـئـ [U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE]  after another vowel with no intervening consonant, eg. ہیئت.

 
m

◌َـا [U+064E ARABIC FATHA + U+0627 ARABIC LETTER ALEF], eg. بَاغ.

 
f

◌َـا [U+064E ARABIC FATHA + U+0627 ARABIC LETTER ALEF], eg. لِکھنَا.

ـہ [U+06C1 ARABIC LETTER HEH GOAL] at the end of many words derived from Arabic or Persian, eg. مَکّہ

ـیٰ [U+06CC ARABIC LETTER FARSI YEH + U+0670 ARABIC LETTER SUPERSCRIPT ALEF] at the end of a few Arabic words, eg. اعلیٰ

◌̃

ں [U+06BA ARABIC LETTER NOON GHUNNA] when word final, eg. نہیں.

ن [U+0646 ARABIC LETTER NOON] elsewhere, eg. دانت, اونچا.

ن٘ [U+0646 ARABIC LETTER NOON + U+0658 ARABIC MARK NOON GHUNNA] if the author wishes to emphasise that this is nasalisation. 

Vowels without diacritics

ا␣آ␣ی␣ے␣و

When text is unvowelled (as it usually is), there are only a few ways of writing vowels, and a good deal of ambiguity for the novice reader about which sound is represented by a given letter.

This table shows the characters and their basic mappings to sounds. (The table should be read right-to-left.)

initial   medial   final  
ا ə ɪ ʊ ا  ɑː ا  ɑː
آ ɑː
ایـ ɛː ـیـ  ɛː ـی 
ے  ɛː
او ɔː و  ɔː و  ɔː

The vowels ə ɪ ʊ are not marked in medial position, and generally do not occur in final position.

See also vowel_mappings.

Vowel diacritics

In situations where it is necessary to unambiguously indicate the underlying vowel sounds, the following diacritics can be added to base letters.

ِ␣ُ␣ٗ␣َ␣ٓ

The following table summarises the main vowel to character assigments. Note that some sounds are distinguished in vowelled text by an absence of diacritics. More information can be found by clicking on the characters above, or in the section vowel_mappings.

Each table cell shows word-initial, word-medial, and word-final forms from right to left. Click/tap on items to see a list of the components for that cell.

ɪ ʊ
اِ◌ِ◌ِ اِی‍◌ِ‍ی‍◌ِ‍ی اُ◌ُ◌ُ اوٗ‍وٗ‍وٗ/اُو◌ُو◌ُو
ای‍‍ی‍‍ے او‍و‍و
ɛː ə ɔː
اَی‍◌‍َی‍◌َ‍ے اَ◌َ◌َ اَو◌َو◌َو
ɑː
آ◌َ‍ا◌َ‍ا

The three short vowels are not typically found in final position.

◌ٗ [U+0657 ARABIC INVERTED DAMMA] is used to indicate that the vowel is or ʊ rather than ɔ. It is not usually needed, and serves only to emphasise that this is a vowel, eg. ہندوٗ

ٓ [U+0653 ARABIC MADDAH ABOVE] is only found in decomposed text, and is associated only with alef. See آ [U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE].

Other diacritics

ً␣ٌ␣ٍ␣ٰ␣ٖ

The doubled vowel diacritics, ◌ً [U+064B ARABIC FATHATAN​], ◌ٌ [U+064C ARABIC DAMMATAN​], and ◌ٍ [U+064D ARABIC KASRATAN​] are used at the ends of certain Arabic adverbs in vowelled text. The doubled zabar (fathatan) is the most common of the three marks of this type, and is usually written over an alif, although the vowel sound is short. Examples, یقیناً مثلاً

◌ٰ [U+0670 ARABIC LETTER SUPERSCRIPT ALEF] is used in a few Arabic words over the final form of ی [U+06CC ARABIC LETTER FARSI YEH] to produce the sound ɑ: eg. اعلیٰ دعویٰ

The similar diacritic ◌ٖ [U+0656 ARABIC SUBSCRIPT ALEF] is (rarely) used to indicate that a vowel is or i rather than e, eg. نُحْیٖ nuh͓yᵢ

AIN as a vowel carrier

ع

ع [U+0639 ARABIC LETTER AIN] is used in words of Arabic origin. In these words it is typically not pronounced but can support vowels. In this way, at the beginning of a word it can fulfill the same function as the alif, but the spelling can distinguish homophones, eg. compare عَرب اَرَب

Note, in particular, that the equivalent of آ [U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE] ɑː is عا, as in عادت

A following ع may also affect a short vowel diacritic to produce a long vowel sound as follows:

  1. ɑː from zabar followed by 'ain, eg. بَعد

  2. e from zer followed by 'ain, eg. شِعر

  3. o from peʃ followed by 'ain, eg. شُعلہ

Sound changes before HE

ہ␣ح

ہ [U+06C1 ARABIC LETTER HEH GOAL] and ح [U+062D ARABIC LETTER HAH] can also modify preceding short vowels as follows:

  1. ɛ from zabar followed by he, eg. اَحمد رَہنا

  2. ɛ from zer followed by he, eg. مِہربانی واضِح

  3. o from peʃ followed by 'ain, eg. شُہرت توجُّہ

The so-called 'silent' he that appears at the end of many words of Arabic or Persian derivation is pronounced ɑː, مکَہ

Nasalisation

ن␣ں␣٘

Vowels may be nasalised, like at the end of the French word élan.

Word-medially, this is written using the normal ن [U+0646 ARABIC LETTER NOON], eg. سانپ انگریزی

Word-finally, this is indicated in Urdu by nun ghunna, which looks like the letter nun except that it has no dot. For this, use ں [U+06BA ARABIC LETTER NOON GHUNNA], eg. ماں کروں

The diacritic ◌٘ [U+0658 ARABIC MARK NOON GHUNNA] is used when people want to make it clear that a noon character represents nasalisation rather than the sound n, eg. ٹان٘گ

It is not used in a standard way, just when the user prefers, and is fairly uncommon. 

Hamza

ء␣ٔ␣ؤ␣ۂ␣ۓ

A hamzā plays more than one role in Urdu, related to vowels. It is used within a word to separate standalone vowel sounds from a preceding vowel (see standalone). It is also used at the end of a word to express a short ɛ sound between 2 words, which is typically translated 'of' (see izafat).

An isolated form of hamza, ء [U+0621 ARABIC LETTER HAMZA], is occasionally used, but generally hamza is written above a preceding base letter using ٔ [U+0654 ARABIC HAMZA ABOVE] or a precomposed character with a hamza.

A number of precomposed combinations of base letter and hamza are encoded in Unicode. Many of these decompose and recompose under normalisation as canonical alternatives, but a few do not and need to be treated with care.

For information about which precomposed characters are used or not used here see hamza_choices.

When represented by a combining character, hamza can also have two different shapes, one like the initial form of 'ain and the other more like an italic 's'.

ئ
Two alternative shapes of hamza.

Standalone vowels

A vowel that follows another vowel, with no preceding consonant, is commonly marked with a hamzā diacritic. This generally applies to words where the second vowel is one of the following: iː e ɪ uː oː, and the graphemes used are:

ئ␣ؤ

See hamza_choices for notes on the use of precomposed characters, especially ئ [U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE].

Yeh. When the second vowel is an or e represented by ی [U+06CC ARABIC LETTER FARSI YEH] or ے [U+06D2 ARABIC LETTER YEH BARREE], the hamzā 'sits on a chair' before it. The hamza on its chair is written using ئ [U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE], eg. کئی تیئیس کوئی گئے گائے

The short vowel ɪ as a second vowel is also represented by hamzā 'on its chair' alone, eg. کوئلہ لائن

Waw. When the second vowel is an or represented by و [U+0648 ARABIC LETTER WAW], the hamzā typically sits directly on top of the و. To represent this in Unicode use ؤ [U+0624 ARABIC LETTER WAW WITH HAMZA ABOVE], eg. آؤ جاؤں

Unmarked. Often the hamzā is omitted in this situation. Many words have the vowel combinations iːɑ̃ iːe iːo, where hamzā is not typically used, eg. لڑکیاں چلیے لڑکیوں کا

Izāfat

ِ␣ٔ␣ۂ␣ۓ

Izāfat ɪzɑːfat is the name given to the short vowel ɛ used to describe a relationship between two words. It may be translated of, eg. as in the Lion of Punjab, and appears at the end of the initial word in a 2-word sequence.

See hamza_choices for notes on the use of precomposed characters, especially ئ [U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE].

Word ending: Use:
ی ئ [U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE]
ے ۓ [U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE]
ہ ۂ [U+06C2 ARABIC LETTER HEH GOAL WITH HAMZA ABOVE]
unless it produces h in which case use the following in vowelled text:
ہِ [U+06C1 ARABIC LETTER HEH GOAL + U+0650 ARABIC KASRA]
ا ائے [U+0627 ARABIC LETTER ALEF + U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE + U+06D2 ARABIC LETTER YEH BARREE]
و وئے [U+0648 ARABIC LETTER WAW + U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE + U+06D2 ARABIC LETTER YEH BARREE]
otherwise add ◌ِ [U+0650 ARABIC KASRA]  (or nothing at all, in unvowelled text)
Summary of how to write izafat.

Zer This is mostly represented using zer, although in unvowelled text the combining mark is commonly not shownub,99we, eg. شیرِ پنجاب طالبِ علم

Heh If ہ [U+06C1 ARABIC LETTER HEH GOAL] is pronounced as h at the end of a word, then zer is used, as for any other consonant sound, eg. براہِ راست

However, when it represents a vowel sound or is silent, izafat is represented by a combining hamzaub,99we, eg. درجۂ حرارت قطرۂ آب

Yeh When the preceding word ends in ی [U+06CC ARABIC LETTER FARSI YEH] or ے [U+06D2 ARABIC LETTER YEH BARREE], izafat is represented by a the respective letter with a hamzaub,99we, eg. آزادئ مذہب

Alef or waw When the preceding word ends in a vowel written with ا or و, izafat is represented using hamza 'on it's chair' followed by baɽiː je, ie. ئے [U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE + U+06D2 ARABIC LETTER YEH BARREE]dmt,250ub,99we, eg. صدائے بلند روئے زمین

Consonants

Consonant sounds to characters

This section maps Urdu consonant sounds to common graphemes in the Arabic orthography. Click on a grapheme to find other mentions on this page (links appear at the bottom of the page). Click on the character name to see examples and for detailed descriptions of the character(s) shown.

Sounds listed as 'infrequent' are allophones, or sounds used for foreign words, etc.

Stops

p

پ [U+067E ARABIC LETTER PEH], eg. پانی.

b

ب [U+0628 ARABIC LETTER BEH], eg. بہت

t

ت [U+062A ARABIC LETTER TEH], eg. تین.

ط [U+0637 ARABIC LETTER TAH], in words of Arabic origin, eg. خُطُوط

d

د [U+062F ARABIC LETTER DAL], eg. دو.

ʈ

ٹ [U+0679 ARABIC LETTER TTEH], eg. ٹانگ.

ɖ

ڈ [U+0688 ARABIC LETTER DDAL], eg. انڈا‎.

k

ک [U+06A9 ARABIC LETTER KEHEH], eg. کتا‎.

ɡ

گ [U+06AF ARABIC LETTER GAF], eg. گردن‎.

q

ق [U+0642 ARABIC LETTER QAF], eg. قلم.

Affricates

t͡ʃ

چ [U+0686 ARABIC LETTER TCHEH], eg. چار‎.

d͡ʒ

ج [U+062C ARABIC LETTER JEEM], eg. جانور‎.

Fricatives

f

ف [U+0641 ARABIC LETTER FEH], eg. سفید‎.

v

و [U+0648 ARABIC LETTER WAW], as an allophone of ʋ, eg. ورت.

s

س [U+0633 ARABIC LETTER SEEN], eg. سورج‎.

ص [U+0635 ARABIC LETTER SAD] in words of Arabic origin, eg. صابُن.

ث [U+062B ARABIC LETTER THEH] in words of Arabic or Persian origin, eg. ثابت.

z

ز [U+0632 ARABIC LETTER ZAIN], eg. نزدیک

ذ [U+0630 ARABIC LETTER THAL], eg. .جذبہ

ض [U+0636 ARABIC LETTER DAD], in words of Arabic origin, eg. ضِد.

ظ [U+0638 ARABIC LETTER ZAH], in words of Arabic origin, eg. ظَاہِر.

ʃ

ش [U+0634 ARABIC LETTER SHEEN], eg. بارش‎.

x

خ [U+062E ARABIC LETTER KHAH], eg. خون‎.

ɣ

غ [U+063A ARABIC LETTER GHAIN], eg. غُلام.

ɦ

ہ [U+06C1 ARABIC LETTER HEH GOAL], eg. ہڈی‎.

ح [U+062D ARABIC LETTER HAH] in words of Arabic origin, eg. حَاکِم

Nasals

Other

ʋ

و [U+0648 ARABIC LETTER WAW], eg. توچا‎.

w

و [U+0648 ARABIC LETTER WAW] as an allophone of ʋ commonly occuring between a consonant and vowel, eg. پکوان.

r

ر [U+0631 ARABIC LETTER REH], eg اردو.

ɾ

ر [U+0631 ARABIC LETTER REH], eg. آرام. Allophone of r that ends to occur between vowels.

ɽ

ڑ [U+0691 ARABIC LETTER RREH], eg. بڑا‎.

l

ل [U+0644 ARABIC LETTER LAM], eg. لال‎.

j

ی [U+06CC ARABIC LETTER FARSI YEH], eg. نیا‎.

Sources: Wikipedia, and Google Translate.

Basic letters

The alphabet standardised in 2004 by the National Language Authority in Pakistan counts 39 letters, and 18 digraphs representing aspirated consonants. Follow the links to the character notes for the letters described below to find examples and detailed information.

پ␣ب␣ت␣ط␣د␣ٹ␣ڈ␣ک␣گ␣ق␣ء
چ␣ج
ف␣و␣س␣ث␣ص␣ز␣ذ␣ض␣ظ␣ش␣ژ␣خ␣غ␣ہ␣ح␣ھ␣ع
م␣ن␣ں
ر␣ڑ␣ل␣ی
ے␣ا

و [U+0648 ARABIC LETTER WAW] and ی [U+06CC ARABIC LETTER FARSI YEH] represent both consonants and vowels. See vowel_mappings.

ہ [U+06C1 ARABIC LETTER HEH GOAL] normally represents the sound ɦ in Urdu, but it is also pronounced ɑː or is silent in certain contexts.  ح [U+062D ARABIC LETTER HAH] is used for words of Arabic origin.

There are 3 letters for s, and 4 for z, due the retention of Arabic spelling for words of Arabic origin. The most common letter for s is س [U+0633 ARABIC LETTER SEEN], and for z is ز [U+0632 ARABIC LETTER ZAIN].

Aspirated consonants

پھ␣بھ␣تھ␣دھ␣ٹھ␣ڈھ␣کھ␣گھ
چھ␣جھ
وھ␣هھ
مھ␣نھ
رھ␣ڑھ␣لھ␣یھ

Other letters

ۃ

ۃ [U+06C3 ARABIC LETTER TEH MARBUTA GOAL] is rarely used except in certain loan words from Arabic. It is not pronounced. When replaced with an Urdu letter in naturalised loan words ہ [U+06C1 ARABIC LETTER HEH GOAL] is normally used.

Consonant clusters

ْ

The absence of a vowel sound can be indicated with the diacritic  ْ [U+0652 ARABIC SUKUN], called sukūn or jazm, although this diacritic is not normally shown in text, eg. سَخْت

It has various possible forms, including a small round circle, something that looks like peʃ, and something like a circumflex, see fig_sukun.

سَخْت
Three alternative shapes of sukun.

This diacritic is never written above the final character in a word, mainly because as a rule a short vowel is not pronounced in this position.

Consonant lengthening & gemination

ّ

Most native consonants may be lengthened, but not , ɽ, ɽʱ, or ɦ. Geminate consonants are always medial and preceded by one of ə, ɪ, or ʊ.whp,#Consonants

In vowelled text, which is very rare, this is shown using the diacritic  ّ [U+0651 ARABIC SHADDA​], called taʃdiːd, eg. ستّر More often than not, this is not written.

Arabic definite article

The pronunciation of ال (alif followed by lām) varies when it represents the Arabic definite article. This affects many words in Urdu that have come from Arabic, in particular names and adverbial expressions.

The lām is not pronounced if it precedes one of the following characters:

ت␣ث␣د␣ذ␣ر␣ز␣س␣ش␣ص␣ض␣ط␣ظ␣ل␣ن

Instead, the following sound is doubled. A tašdīd may sometimes be used to indicate this. Example: السلام علیکم

Often the alif is not pronounced after a short preceding word that ends in a vowel. If the preceding vowel was long, it is shortened in this process. Examples: بالکل فی الحال

Often the vowel is pronounced ʊ, eg. دارالحکومت

Encoding choices

In the Urdu orthography different sequences of Unicode characters may produce the same visual result. Here we look at those, and make notes on usage.

Hamza & precomposed characters

Unicode support for the various uses of the hamza are complicated.u,384 For notes on the usage of the hamza in Urdu, see standalone and izafat.

Canonically equivalent alternatives

A number of combinations with the hamza diacritic can be represented as either a precomposed character or a decomposed sequence, where the parts are separated in Unicode Normalisation Form D (NFD) and recomposed in Unicode Normalisation Form C (NFC), so both approaches are canonically equivalent. These include the following:

Precomposed Decomposed
أ [U+0623 ARABIC LETTER ALEF WITH HAMZA ABOVE] أ [U+0627 ARABIC LETTER ALEF + U+0654 ARABIC HAMZA ABOVE]
آ [U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE] آ [U+0627 ARABIC LETTER ALEF + U+0653 ARABIC MADDAH ABOVE]
ؤ [U+0624 ARABIC LETTER WAW WITH HAMZA ABOVE] ؤ [U+0648 ARABIC LETTER WAW + U+0654 ARABIC HAMZA ABOVE]
ۂ [U+06C2 ARABIC LETTER HEH GOAL WITH HAMZA ABOVE] ۂ [U+06C1 ARABIC LETTER HEH GOAL + U+0654 ARABIC HAMZA ABOVE]
ۓ [U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE]  ۓ [U+06D2 ARABIC LETTER YEH BARREE + U+0654 ARABIC HAMZA ABOVE] 
ئ [U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE] ئ [U+064A ARABIC LETTER YEH + U+0654 ARABIC HAMZA ABOVE]

The single code point per vowel-sign is the form preferred by the Unicode Standard and the form in common use for Urdu, but either could be used.

The last item is a special case. The precomposed form has a canonical decomposition, but it is to hamza over ي [U+064A ARABIC LETTER YEH] rather than ی [U+06CC ARABIC LETTER FARSI YEH]. This is used in particular for 'hamza on its chair', but also for word medial standalone vowels, and it is usually only when those are decomposed that the ي [U+064A ARABIC LETTER YEH] is found in Urdu.

Glyphs that are not canonically equivalent

The following alternatives are not converted to each other during normalisation. The precomposed characters represent letters in languages such as Pashto, Ormori, and Adamawe Fulfulde where the hamza is an ijam (ie. part of the letter) rather than a combining diacritic. These precomposed characters are therefore not appropriate for use with Urdu.

Decomposed Precomposed
حٔ [U+062D ARABIC LETTER HAH + U+0654 ARABIC HAMZA ABOVE] ځ [U+0681 ARABIC LETTER HAH WITH HAMZA ABOVE]
d͡z in Pashto
رٔ [U+0631 ARABIC LETTER REH + U+0654 ARABIC HAMZA ABOVE] ݬ [U+076C ARABIC LETTER REH WITH HAMZA ABOVE]
voiced alveolo-palatal laminal fricative in Ormuri
بٔ [U+0628 ARABIC LETTER BEH + U+0654 ARABIC HAMZA ABOVE] [U+08A1 ARABIC LETTER BEH WITH HAMZA ABOVE]
implosive bilabial stop in Adamawa Fulfulde

The decomposed forms are recommended for use with Urdu. However, if the font supports them, both approaches may yield exactly the same result when displayed, so applications will need to recognise both precomposed and decomposed alternatives as the same grapheme in case users use the precomposed character. Input mechanisms, on the other hand, can produce one rather than the other, and that choice should be made with advisement.

Confusables & spelling errors

The following lists some common errors found in Urdu text due to the similarity of Unicode characters, or perhaps sometimes due to problems inputting the correct character. Wikipedia is a rich source of such.

Correct Incorrect
ی [U+06CC ARABIC LETTER FARSI YEH] ي [U+064A ARABIC LETTER YEH] ①
ہ [U+06C1 ARABIC LETTER HEH GOAL] ه [U+0647 ARABIC LETTER HEH] 
ۂ [U+06C2 ARABIC LETTER HEH GOAL WITH HAMZA ABOVE] ۀ [U+06C0 ARABIC LETTER HEH WITH YEH ABOVE]
ۃ [U+06C3 ARABIC LETTER TEH MARBUTA GOAL]  ة [U+0629 ARABIC LETTER TEH MARBUTA] 
ک [U+06A9 ARABIC LETTER KEHEH] ك [U+0643 ARABIC LETTER KAF] ②
ْ   [U+0652 ARABIC SUKUN]  ٛ   [U+065B ARABIC VOWEL SIGN INVERTED SMALL V ABOVE]

① The Arabic YEH doesn't drop the dots below in isolate and final positions. As mentioned above, ي [U+064A ARABIC LETTER YEH] is only found in decomposed text representing yeh with a hamza; in those circumstances the font should not display the dots below.

Common fonts tend not to show the difference between these two characters, but the ability to search and compare text is impaired unless the application is aware of and takes counter-measures against this substitution.

The function of this glyph is that of the sukun, so the correct semantic character should be used. Although ٛ [U+065B ARABIC VOWEL SIGN INVERTED SMALL V ABOVE] looks like the Urdu jazm, as described in the name of the character, it was introduced to Unicode to serve as a vowel sign for African languages §.

Observation: In the Noto Nastaliq Urdu and SIL Awami Nastaliq fonts the sukun is automatically displayed with the inverted-v shape if the language of the content is declared to be Urdu (ur). It is therefore important to ensure that the language of content is correctly declared for web pages if you expect to see this shape.

Honorifics

A number of combining marks are used with names as honorifics, eg. قاضی نور محمّدؒ qɑẑy nvr mhmᵚdؒ kaziː nur mamed rahmatulla alayhe Qazi Nur Muhammad, may God have mercy upon him! They are combining characters that appear over the name at a point chosen by the author.

ؔ␣ؓ␣ؒ␣ؑ␣ؐ

Numbers

Urdu may use ASCII digits, or may use the extended arabic-indic digits in the Arabic block.

۰␣۱␣۲␣۳␣۴␣۵␣۶␣۷␣۸␣۹

This is a separate set of characters from those used for Arabic, to accommodate different shaping and directional behaviour. Shapes differ from those of Arabic for the digits 4, 5, and 7.

Persian also uses the same characters for digits, but there are some systematic shape differences between Persian and Urdu for the digits 4, 6, and 7.

Urdu۰۱۲۳۴۵۶۷۸۹
Persian۰۱۲۳۴۵۶۷۸۹
Arabic٠١٢٣٤٥٦٧٨٩
Comparison of digit shapes in Urdu, Persian and Arabic.

Urdu also has special characters for the thousands and decimal separators: ٬ [U+066C ARABIC THOUSANDS SEPARATOR] and ٫ [U+066B ARABIC DECIMAL SEPARATOR] (see fig_percent_sign), although the ASCII full stop and comma may also be used.

See expressions for a discussion of how to handle numeric ranges.

Percentages

Urdu may use the Arabic percent sign, ٪ [U+066A ARABIC PERCENT SIGN].

؜۵٬۴۳۲٫۱٪

The figure 5,432.1% using Urdu characters.

The percent sign is typed and stored after the numbers. Like the numeric sequences using the ASCII hyphen (mentioned in expressions), it will appear to the left of a number if that number is preceded by Urdu characters. However, if the percentage appears alone or at the beginning of a line it is necessary to use an ALM formatting character just before it to prevent the sign appearing on the right.

Observation: Wikipedia uses an ASCII percent sign with ASCII digits

Number sign

Urdu has a sign ؀ [U+0600 ARABIC NUMBER SIGN] which can be used to indicate a number. As shown in fig_number_sign, its length varies with the number of digits in the number.

؀۱۲۳
The Arabic number sign runs below the numbers it is used with.

To use this sign, type it before the digits. Even though it displays beneath the digits, it is a formatting character, and not a combining mark.

Dates

؍␣ء␣ھ␣؁␣؄

Dates in Urdu may be based on the Gregorian calendar or the Hijri calendar. Dates in the Gregorian calendar are followed with this word (usually represented by the abbreviation ء [U+0621 ARABIC LETTER HAMZA]):عیسوی ʿysvy iːsviː Christian Era

Dates using the Muslim calendar are followed by this word (abbreviated as ھ [U+06BE ARABIC LETTER HEH DOACHASHMEE]):ہجری ḫʤry hɪʤriː

یکم جمادی الاول 1423 ھ

An Urdu date (12 July 2002) in the Hijri calendar.

The word hijri in Arabic is written with ه [U+0647 ARABIC LETTER HEH] rather than ہ [U+06C1 ARABIC LETTER HEH GOAL] (see the Urdu spelling just above), and the abbreviation in Arabic is ه‍ [U+0647 ARABIC LETTER HEH + U+200D ZERO WIDTH JOINER], whereas in Urdu it is ھ [U+06BE ARABIC LETTER HEH DOACHASHMEE]. Here is the Arabic spelling: هجري

Dates may also be indicated by placing the long sweep of ؁ [U+0601 ARABIC SIGN SANAH] below the year digits.

؁۲۰۱۴ء
An Urdu date (2014), with a SANAH sign running below it, and a hamza to indicate the Gregorian calendar.

Like the number sign, SANAH is typed before the digits (see fig_sanah). It is not a combining character, even though it displays beneath the digits. The length of the symbol may vary according to the number of digits. It is terminated by a non-digit character.

؄ [U+0604 ARABIC SIGN SAMVAT] is another subtending mark, intended to indicate a year in the Śaka calendar.  

؍ [U+060D ARABIC DATE SEPARATOR] is used in Urdu between the date and the month nameu14,379.

27؍اگست2021ء
The date in a newspaper masthead, showing the date separator between date and month name in two calendars.

Symbol

This is one of the few characters in the presentation forms blocks that is valid for use in normal content.

[U+FDFD ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM] is used by Muslims in various contexts including the constitutions of countries where Islam has a significant presence. The shape varies significantly from font to font and usage to usage.

Formatting characters

The Arabic script uses a number of Unicode characters that affect the way that other characters are rendered. Many of those have no visible form of their own. The following set of characters used in Urdu text does have a visual representation.

؀␣؁␣؂␣؃␣؄␣۝

Follow the links to learn more about each of these characters.

Observation: The subtending character display is broken in the Noto Nastaliq Urdu font. That font only produces the expected display if (a) a RTL override is applied to the characters, or (b) the SANAH is typed after the digits (in a RTL normal base direction, but not an override). The Awami Nastaliq font handles them as expected, as long as the sign precedes the digits and the base direction is set to RTL (but not if a directional override is applied).

Urdu text also makes use of a relatively large set of invisible formatting characters, especially in plain text, many of which are used to manage text direction (see directioncontrols), and others are used to control cursive shaping behaviour (see shapingcontrols).

Text direction

Urdu is written horizontally and right-to-left in the main, but (as with most RTL scripts) numbers and embedded LTR script text are written left-to-right (producing 'bidirectional' text).

رکھتا ہے اور 2009ء میں UEFA کپ کے

Urdu words are read right-to-left, starting from the right of this line, but numbers and Latin text are read left-to-right.

The Unicode Bidirectional Algorithm automatically takes care of the ordering for all the text in fig_uefa, as long as the 'base direction' is set to RTL. In HTML this can be set using the dir attribute, or in plain text using formatting controls.

If the base direction is not set appropriately, the directional runs will be ordered incorrectly as shown in fig_bidi_no_base_direction.

رکھتا ہے اور 2009ء میں UEFA کپ کے

رکھتا ہے اور 2009ء میں UEFA کپ کے

The exact same sequence of characters with the base direction set to RTL (top), and with no base direction set on this LTR page (bottom).

Show default bidi_class properties for characters in the Urdu orthography described here.

For more information about how directionality and base direction work, see Unicode Bidirectional Algorithm basics. For information about plain text formatting characters see How to use Unicode controls for bidi text. And for working with markup in HTML, see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.

On this page, see also expressions and breaking_latin for additional features related to direction.

Managing text direction

Unicode provides a set of 10 formatting characters that can be used to control the direction of text when displayed. These characters have no visual form in the rendered text, however text editing applications may have a way to show their location.

RLE [U+202B RIGHT-TO-LEFT EMBEDDING] (RLE), LRE [U+202A LEFT-TO-RIGHT EMBEDDING] (LRE), and PDF [U+202C POP DIRECTIONAL FORMATTING] (PDF) are in widespread use to set the base direction of a range of characters. RLE/LRE come at the start, and PDF at the end of a range of characters for which the base direction is to be set.

More recently, the Unicode Standard added a set of characters which do the same thing but also isolate the content from surrounding characters, in order to avoid spillover effects. They are RLI [U+2067 RIGHT-TO-LEFT ISOLATE] (RLI), LRI [U+2066 LEFT-TO-RIGHT ISOLATE] (LRI), and PDI [U+2069 POP DIRECTIONAL ISOLATE] (PDI). The Unicode Standard recommends that these be used instead.

There is also PDI [U+2068 FIRST STRONG ISOLATE] (FSI), used initially to set the base direction according to the first recognised strongly-directional character.

؜ALM [U+061C ARABIC LETTER MARK] (ALM) is used to produce correct sequencing of numeric data. Follow the link and see expressions for details. 

RLM [U+200F RIGHT-TO-LEFT MARK] (RLM) and LRM [U+200E LEFT-TO-RIGHT MARK] (LRM) are invisible characters with strong directional properties that are also sometimes used to produce the correct ordering of text.

For more information about how to use these formatting characters see How to use Unicode controls for bidi text. Note, however, that when writing HTML you should generally use markup rather than these control codes. For information about that, see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.

Expressions & sequences

A sequence of numbers separated by hyphens (for example a range) runs from right to left in Urdu.

fig_range shows some Urdu text, which is right-to-left overall, containing a numeric range that is also ordered RTL, ie. it starts with 100 and ends with 999.

100–999 تصدیق شدہ کیس

A numeric range in Urdu language text.

When a list uses the ASCII hyphen as a separator, the Unicode Bidirectional Algorithm automatically produces the expected ordering only when a sequence or expression follows Urdu characters. However, a sequence that appears alone on a line will be ordered left-to-right. To make the sequence read right-to-left you should, in this case, add the formatting character ؜ALM [U+061C ARABIC LETTER MARK] (ALM) at the start of the line (see and click on each line in fig_ALM).

؜10-01-2018

10-01-2018

A numeric date alone on a line of RTL text, with ALM before it (top), and without (bottom). (Click on each line to see the code points.)

Note that the required order cannot be achieved by simply setting the base direction, nor by using RLM [U+200F RIGHT-TO-LEFT MARK].

Alternatively, you could use a different separator, such as [U+2013 EN DASH] (as in fig_range) or [U+2010 HYPHEN]. No special arrangements are then necessary.

Similar RTL ordering is applied to numbers in equations, such as 1 + 2 = 3, for Urdu language text.

See also percent_sign.

Glyph shaping & positioning

This section brings together information about the following topics: writing styles; cursive text; context-based shaping; context-based positioning; baselines, line height, etc.; font styles; case & other character transforms.

You can experiment with examples using the Urdu character app.

The orthography has no case distinction, and no special transforms are needed to convert between characters.

Font styles

Urdu is normally written in a nasta'liq writing style. Key features include a sloping baseline for joined letters, and overall complex shaping and positioning for base letters and diacritics alike. There are also distinctive shapes for many glyphs and ligatures.

مستحق  •  شخص  •  کیفیت

Sloping baselines and complex joining behaviours in Urdu nastaliq text.

This is achieved in Unicode by applying the correct font – the underlying characters used are not different for nasta'liq vs. other styles.

کوئی شخص محض حاکم کی مرضی پر اپنی قومیت سے محروم نہیں کیا جائے گا اور اس کو قومیت تبدیل کرنے کا حق دینے سے انکار نہ کیا جائے گا۔

Urdu is normally written in the nasta'liq writing style.

کوئی شخص محض حاکم کی مرضی پر اپنی قومیت سے محروم نہیں کیا جائے گا اور اس کو قومیت تبدیل کرنے کا حق دینے سے انکار نہ کیا جائے گا۔

The same text, written in a standard naskh writing style.

Not only does the baseline slope for connected glyphs in a word, but the sloping sequences can overlap, as shown in fig_overlap, which uses the Awami Nastaliq font.

391 میں تھیوفلس اعظم
Sloping baselines and complex joining behaviours in Urdu nastaliq text.

Cursive script

Arabic script joins letters together. Fonts need to produce the appropriate joining form for a code point, according to its visual context. This results in four different shapes for most letters (including an isolated shape). The highlights in fig_cursive below show the same letter, ع [U+0639 ARABIC LETTER AIN], with two different joining forms. 

عقل ودیعت
The letter ع [U+0639 ARABIC LETTER AIN] in 2 different joining contexts.

A few Arabic script letters only join on the right-hand side.

There are 2 Unicode blocks containing Arabic presentation forms: these contain individual characters corresponding to the various joining forms and ligatures. With only a handful of exceptions, characters in those blocks should not be used for text content; they are only for managing legacy encodings. Instead, characters in the main Arabic block should be used, and the font will manage the necessary cursive shaping.

Cursive joining forms

Most dual-joining characters add or become a swash when they don't join to the left. A number of characters, however, undergo additional shape changes across the joining forms. fig_joining_forms and fig_right_joining_forms show the basic shapes in Urdu and what their joining forms look like.

Two pairs of characters in the first table have base shapes that are identical, but they manage the dots differently in different joining forms. These have been put onto separate rows.

isolatedright-joineddual-joinleft-joinedUrdu letters
ب ـب ـبـ بـ
ب␣ت␣ث␣پ␣ٹ
ن ـن ـنـ نـ
ن
ں ـں ـںـ ںـ
ں
ق ـق ـقـ قـ
ق
ف ـف ـفـ فـ
ف
س ـس ـسـ سـ
س␣ش
ص ـص ـصـ صـ
ص␣ض
ط ـط ـطـ طـ
ط␣ظ
ک ـک ـکـ کـ
ک␣گ
ل ـل ـلـ لـ
ل
ہ ـہ ـہـ ہـ
ہ␣ۂ
ھ ـھ ـھـ ھـ
ھ
م ـم ـمـ مـ
م
ع ـع ـعـ عـ
ع␣غ
ح ـح ـحـ حـ
ح␣خ␣ج␣چ
ی ـی ـیـ یـ
ی
ئ ـئ ـئـ ئـ
ئ
Joining forms for shapes that join on both sides.
isolatedright-joined Urdu letters
ا ـا
ا␣آ
ر ـر
ر␣ڑ␣ز␣ژ
د ـد
د␣ڈ␣ذ
و ـو
و␣ؤ
ے ـے
ے
Joining forms for shapes that join on the right only.

Managing glyph shaping

ZWJ [U+200D ZERO WIDTH JOINER] (ZWJ) and ZWNJ [U+200C ZERO WIDTH NON-JOINER] (ZWNJ) are used to control the joining behaviour of cursive glyphs. They are particularly useful in educational contexts, but also have real world applications.

ZWJ permits a letter to form a cursive connection without a visible neighbour. It can be used for illustrating cursive joining forms, eg. ان‍‍   ‍س‍‍   ‍ان Characters from the Presentation Forms blocks in Unicode should not be used in such cases.

ZWNJ prevents two adjacent letters forming a cursive connection with each other when rendered, eg. ان‌س‌ان

͏MVS [U+034F COMBINING GRAPHEME JOINER] is used in Arabic to produce special ordering of diacritics. The name is a misnomer, as it is generally used to break the normal sequence of diacritics.

Context-based shaping & positioning

Context-based shaping is everwhere in Urdu due to the combination of the cursive behaviour of the script plus the strong tendency to arrange joined characters in cascades or vertical arrangements.

As in Arabic, lam followed by alef ligates, eg. اسلام and there are other such commonly ligated forms. There are also common rules about special joining arrangements when certain characters appear side by side, for example a KA followed by an ALEF takes the special shape کا

Positioning of cursive joining forms is already complicated in the nastaliq style because of the vertical placement; adding dots and hamzas then complicates matters in that they need to be aligned with the appropriate base character without overlapping adjacent character glyphs or other dots, etc. Positioning vowel diacritics, shadda, etc. then adds to the complexity.

The table in fig_gpos selects just a handful of situations to illustrate the kinds of positioning that take place.

 nastaliqnaskh notes
A حیثیت حیثیت A relatively straightforward arrangement, except for the positioning (and context-based shaping) required to achieve the sloping baseline.
B ویکیپیڈیا ویکیپیڈیا Here, the dots have been arranged vertically so that they don't crash into each other. More radical arrangements of this kind will be seen in the following examples.
C پیٹی اؔبِیجیل پیٹی اؔبِیجیل A similar situation, where additional horizontal and vertical spacing has been applied in order to allow room for the dots and other diacritics to appear without crashing into other glyphs or dots, etc.
D چاہیئے چاہیئے It is common for diacritics of characters preceding BAREE HEH to be rendered below the latter character's glyph. Here we see part of both an initial HEH and the 2 dots of aYEH separated from the other glyphs that make up those characters.
E تصدیق تصدیق In this word, the 2 dots below the YEH create most of the horizontal space between the preceding DAL and following QAF. In the Nafees Nastaleeq font, the 2 dots are moved below and slightly under the QAF, reducing the overall horizontal with of the word.
F اسلام اسلام Note the convention that the word-final MEEM here starts above the baseline, even though nothing follows it.
G دلچسپی دلچسپی A highly vertical arrangement using the Nafees Nastaleeq font, where dots are stacked together. In the Awami and Noto nastaliq fonts this looks less vertical, ie. دلچسپی
Examples of glyph positioning in the nastaliq style.

Font styling & weight

tbd

Graphemes

Grapheme clusters

tbd

Punctuation & inline features

Word boundaries

Words are separated by spaces.

Phrase & section boundaries

،␣؛␣:␣۔␣.␣؟␣!␣؎␣؏

Urdu uses a mixture of ASCII and Arabic punctuation.

phrase

، [U+060C ARABIC COMMA]

؛ [U+061B ARABIC SEMICOLON]

: [U+003A COLON]

sentence

۔ [U+06D4 ARABIC FULL STOP] 

. [U+002E FULL STOP]

؟ [U+061F ARABIC QUESTION MARK] 

! [U+0021 EXCLAMATION MARK]

poetry

؎ [U+060E ARABIC POETIC VERSE SIGN]

؏ [U+060F ARABIC SIGN MISRA]

معاشرے، … پڑے گا۔

Urdu text using an Arabic comma, and an Arabic full stop.

Poetry

In poetry, ؎ [U+060E ARABIC POETIC VERSE SIGN] is used to mark the beginning of poetic verse, and ؏ [U+060F ARABIC SIGN MISRA] is used to indicate a single line (misra) of a couplet (shayr) from an Urdu poem, when quoted in text. It is used at the beginning of the line, and is followed by the line of verse. For more information and examples, follow the links on the character names.

Bracketed text

(␣)

Urdu commonly uses ASCII parentheses to insert parenthetical information into text.

  start end
standard

( [U+0028 LEFT PARENTHESIS]

) [U+0029 RIGHT PARENTHESIS]

Quotations & citations

”␣“

Urdu texts use quotation marks around quotations. Of course, due to keyboard design, quotations may also be surrounded by ASCII double and single quote marks. Note, however, that the order of use is different from that in LTR text, because they are not automatically mirrored.

  start end
initial

[U+201D RIGHT DOUBLE QUOTATION MARK]

[U+201C LEFT DOUBLE QUOTATION MARK]

Emphasis

tbd

Abbreviation, ellipsis & repetition

tbd

Inline notes & annotations

tbd

Other punctuation

tbd

Other inline text decoration

tbd

Line & paragraph layout

Line breaking & hyphenation

Basic line-break opportunities occur between the space-separated words.

They are not broken at the small gaps that appear where a character doesn't join on the left.

Breaking between Latin words

When a line break occurs in the middle of an embedded left-to-right sequence, the items in that sequence are rearranged visually so that the reading direction remains top-to-bottom. latin_line_breaks shows how two Latin words are apparently reordered in the flow of text to accommodate this rule.

391 میں تھیوفلس اعظم (Patriarch Theophilus)کے حکم سے عیسائیوں نے اس کی کتابوں کو نیست و نابود کر دیا۔ کیونکہ ان کے خیال میں اس سے کفرپھیلنے کا اندیشہ تھا۔

Text with line break in Latin text.

Urdu with embedded Latin text. The lower of these two images shows the result of decreasing the line width, so that text wraps between a sequence of Latin words.

In digital text the rearrangement is automatic. Only the positions of the font glyphs are changed: nothing affects the order of the characters in memory.

Show (default) line-breaking properties for characters in the Urdu orthography described here.

Text alignment & justification

Calligraphic justification It is difficult to find information in English about justification of Urdu text in a nastaliq font. The following information is from Asad et al.ma, and is based on studies of calligraphy. It's not clear that it is currently possible to achieve the results described in web pages.

Interword spacing is only used as a last resort for Urdu justification. It is also noteworthy that, unlike it use in Arabic language text, ـ [U+0640 ARABIC TATWEEL] is not used, and moreover is not even functional in some fonts. For example, it is completely ignored by Noto Nastaliq Urdu, and while it actually produces a glyph for Awami Nastaliq, it doesn't join with adjacent characters.

According to Asad et al. there are 2 main ways to deal with justification: by stretching certain letter shapes (to increase line width), or by positioning some letters above the word they appear in (to decrease line width). Some of the examples they use, such as fig_justification include both.

An example of a justified Urdu line from Asad et al.

The rules about which letters can be stretched or repositioned, and when, and how, are somewhat complex. For some additional detail, see Asad et al, page 594ff (page 4 in the PDF). Some letters are never stretched, and others only stretched in certain positions within a word. Given those constraints, it is then necessary to apply rules about which of the set of available letters to stretch within a word and across a line in order to achieve the desired line length.

Other rules or judgement calls are also involved.

  1. Variations in stroke thickness between adjacent letters contribute to decisions about how to stretch letters.
  2. In some contexts, such as poetry, all lines may be stretched at the same location in the line.
  3. Given that there is usually only one stretched letter per word, certain letters are prioritised over others for stretching, based on how commonly they are stretched.

The last line in a paragraph of ordinary text is never normally stretched, however a final line in a poem is likely to be stretched.

Newspaper justification fig_justification_newspaper shows part of a column from a newspaper. The majority of columns in the newspaper are fully justified, but don't employ the stretching and positioning techniques described just above. Instead, they appear to use inter-word spacing. Note that very little spacing tends to be needed, given that Urdu words are usually short and the diagonal baseline and glyph shaping tend to further reduce the amount of horizontal space taken by a word. This means that it is relatively easy to fit approximately the right number of words on a line before applying the additonal spacing needed.

An example of a fully justified column of Urdu newspaper content.

Text spacing

tbd

This section looks at ways in which spacing is applied between characters over and above that which is introduced during justification.

Complex, two-dimensional arrangements of letters in words are common in newspaper titles. See fig_newspaper_titles. They are normally created by hand.

Complex arrangements of characters in a newspaper heading.

Baselines, line height, etc.

The alphabetic baseline is a strong feature of Arabic script on the whole, since characters tend to join there. The nastaliq style of the script, on the other hand, uses arrangements of joined glyphs that cascade downwards from right to left, and ressemble a strongly sloping baseline. See the examples in fig_baseline and fig_gpos.

fig_overlap shows overlapping baselines in the Nafees Nastaliq font. (In the Awami and Noto fonts, there is no overlap for that text.)

This cascading effect can lead to a need for quite large line height settings, compared to many other orthographies.

ڈاؤن لوڈکیجیے
An example of a cascade that requires a large line height.

fig_baselines shows Urdu text glyphs from the Noto Serif and Noto Nastaliq Urdu fonts compared to the basic metrics of Latin text. The figure clearly shows the potential differences in line height requirements for the two scripts.

qhx کلم ڈاؤن لوڈکیجیے
Font metrics of Latin text the Noto Serif compared with text in the Noto Nastaliq Urdu font. Both fonts have the same font size.

Counters, lists, etc.

tbd

Styling initials

tbd

Page & book layout

This section is for any features that are specific to Urdu and that relate to the following topics: general page layout & progression; grids & tables; notes, footnotes, etc; forms & user interaction; page numbering, running headers, etc.

Notes, footnotes, etc

؂ [U+0602 ARABIC FOOTNOTE MARKER] is used to indicate that a number is a reference to a footnote. The number sits above the symbol, although this is not a combining character. The marker should come before the number in logical order, eg. ؎۵.

(Note that, although it looks very similar, this is not the same character as ؎ [U+060E ARABIC POETIC VERSE SIGN].)

Online resources

  1. Universal Declaration of Human Rights - Urdu
  2. Jang News (images of printed text & links to web pages)

Acknowledgements

Thanks to Usmaan (عثمان ‬/‬ ਉਸਮਾਨ) for information about YEH+HAMZA.

References