Update9 February, 2019e --> • tags arabic, urdu, scriptnotes
This page provides basic information about the Urdu writing system, a variant of the Arabic script, which builds on the more general information in the Arabic script summary. For character-specific details follow the links to the Arabic character notes.
For similar information related to other scripts, see the Script comparison table.
Clicking on red text examples, or highlighting part of the sample text shows a list of characters, with links to more details. Click on the vertical blue bar (bottom right) to change font settings for the sample text.
دفعہ ۱۔ تمام انسان آزاد اور حقوق و عزت کے اعتبار سے برابر پیدا ہوئے ہیں۔ انہیں ضمیر اور عقل ودیعت ہوئی ہے۔ اس لئے انہیں ایک دوسرے کے ساتھ بھائی چارے کا سلوک کرنا چاہیئے۔
دفعہ ۲۔ ہر شخص ان تمام آزادیوں اور حقوق کا مستحق ہے جو اس اعلان میں بیان کئے گئے ہیں، اور اس حق پر نسل، رنگ، جنس، زبان، مذہب اور سیاسی تفریق کا یا کسی قسم کے عقیدے، قوم، معاشرے، دولت یا خاندانی حیثیت وغیرہ کا کوئی اثر نہ پڑے گا۔ اس کے علاوہ جس علاقے یا ملک سے جو شخص تعلق رکھتا ہے اس کی سیاسی کیفیت دائرہ اختیار یا بین الاقوامی حیثیت کی بنا پر اس سے کوئی امتیازی سلوک نہیں کیا جائے گا۔ چاہے وہ ملک یا علاقہ آزاد ہو یا تولیتی ہو یا غیر مختار ہو یا سیاسی اقتدار کے لحاظ سے کسی دوسری بندش کا پابند ہو۔
The Urdu alphabet is the right-to-left alphabet used for the Urdu language. It is a modification of the Persian alphabet known as Perso-Arabic, which is itself a derivative of the Arabic alphabet. The Urdu alphabet has up to 58 letters. With 39 basic letters and no distinct letter cases, the Urdu alphabet is typically written in the calligraphic Nastaʿlīq script, whereas Arabic is more commonly in the Naskh style. ...
The Nastaʿlīq calligraphic writing style began as a Persian mixture of scripts Naskh and Ta'liq. After the Mughal conquest, Nasta'liq became the preferred writing style for Urdu. It is the dominant style in Pakistan, and many Urdu writers elsewhere in the world use it. Nastaʿlīq is more cursive and flowing than its Naskh counterpart.
Urdu uses the Arabic script with extensions to covers its much wider repertoire of sounds. A number of the extensions are based on those developed for Persian (Farsi). See the table to the right for a brief overview of features for the Arabic script, taken from the Script Comparison Table.
The script type is abjad, ie. the script is largely consonantal and short vowel sounds are typically not shown. Some of the consonant characters double as long vowels (eg. ی and و). The vowels are not usually clearly defined, but when necessary, vowel information can be represented by combining marks appearing above or below the base consonant. The absence of a vowel and doubling of consonants can be indicated in the same way.
The alphabet includes aspirated letters that have to be composed with two Unicode characters and a je letter that uses different Unicode characters depending on the context.
Although it is not always easy to guess the vowel sounds in a word, the consonants are largely reliable phonetically. There is mostly a one-to-one correspondance between letters and sounds.
Follow this link for information about characters used for the Urdu language. The numbers in parentheses are for non-ASCII characters.
For character-specific details see the Arabic character notes.
The Urdu alphabet includes the following characters over and above those listed for Arabic.
These characters from the Arabic alphabet, are not used in Urdu:
There are a good number of other characters in use for Urdu text that are not used for Arabic. Most of them are described in this page. More detailed descriptions may be available by following the links from the text in red.
Arabic script is written horizontally and right-to-left in the main, but as with most RTL scripts, numbers and embedded LTR script text are written left-to-right (producing 'bidirectional' text).
Unicode provides a set of 10 formatting characters that can be used to control the direction of text when displayed. These are listed in the article How to use Unicode controls for bidi text.
There are 10 vowel sounds, though there are also allophonic variants. They are usually grouped into pairs of 'short' and 'long' sounds - although the difference is qualitative, rather than just length. The basic phonemes are as follows:
The phoneme ə is sometimes written a in phonemic transcriptions in this material. (This is the letter usually used in other sources too.)
Urdu follows Arabic in using diacritics to express short vowel sounds, but also rarely uses them in normal text. The basic set of diacritics used for vowels is as follows.
Given the extra phonetic sounds in Urdu, compared to Arabic, the way characters are used to express vowels is much more complicated. The following table shows the standard ways of indicating vowel sounds, and shows what diacritics would be used if they were shown. Note however, that context can change the value of a vowel diacritic (such as a following 'ain or he) – these are detailed below the table. Three short vowels are not typically found in final position. The examples only show diacritics for the sound currently being discussed.
|ə||zabar||بَب bəb||اَب əb|
|ɪ||zer||دِن dɪn||اِن ɪn|
|ʊ||peʃ||سُست sʊst||اُس ʊs
|e||je||بجے baʤe||بیٹا beʈɑː||ایک ek|
|iː||zer+je / je||گاری gɑːriː||تِین tiːn||اِینٹ iːnʈ|
|ɛ||zabar+je||ہَے hɛ||کَیسا kɛsɑː||اَیسا ɛsɑː|
|o||vɑːuː||کو ko||ٹوپی ʈopiː||اوس os|
|ɔ||zabar+vɑːuː||نَو nɔ||شَوق ʃɔq||اَور ɔr|
The letter ع [U+0639 ARABIC LETTER AIN] is used in words of Arabic origin. In these words it is typically not pronounced but can support vowels. In this way, at the beginning of a word it can fulfill the same function as the alif, eg. عَرب ʿarb arab Arab. The Urdu word اَرَب ɑarab arab necessity, though pronounced the same, becomes a completely different word by its spelling. Note, in particular, that the equivalent of آ [U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE] ɑː is عا, as in عادت ʿɑdt ɑːdat habit.
A following ع may also affect a short vowel diacritic to produce a long vowel sound as follows:
ɑː from zabar followed by 'ain, eg. بَعد baʿd bɑːd after
e from zer followed by 'ain, eg. شِعر ʃiʿr seːr verse
o from peʃ followed by 'ain, eg. شُعلہ ʃuʿlḫ ʃolɑː flame
The letters ہ [U+06C1 ARABIC LETTER HEH GOAL] and ح [U+062D ARABIC LETTER HAH] can also modify preceding short vowels as follows:
ɛ from zabar followed by he, eg. اَحمد ɑahmd ɛhmad Ahmed,رَہنا raḫnɑ rɛhnɑː to remain
ɛ from zer followed by he, eg. مِہربانی miḫrbɑny mɛhrbɑːniː kindness, and واضِح vɑẑih vɑːzɛh clear
o from peʃ followed by 'ain, eg. شُہرت ʃuḫrt ʃohrat fame, and توجُّہ tvʤuᵚḫ tavajːoh attention
The so-called 'silent' he that appears at the end of many words of Arabic or Persian derivation is pronounced ɑː, مکَہ mkaḫ makːɑː Mecca.
The diacritic ◌ٰ [U+0670 ARABIC LETTER SUPERSCRIPT ALEF] is used in a few Arabic words over the final form of ی [U+06CC ARABIC LETTER FARSI YEH] to produce the sound ɑ: eg. اعلیٰ ɑʿlyɑ̇ alɑː paramount, highest; دعویٰ dʿvyɑ̇ davɑː law suit, claim.
The similar diacritic ◌ٖ [U+0656 ARABIC SUBSCRIPT ALEF] is used to indicate that a vowel is iː or i rather than e, eg. نُحْیٖ nuh͓yᵢ. This diacritic is not usually needed, and serves only to emphasise that the vowel is long.
◌ٗ [U+0657 ARABIC INVERTED DAMMA] is used to indicate that the vowel is uː or ʊ rather than ɔ, eg. حبل حلالہٗ hbl hlɑlḫᵘ. It is not usually needed, and serves only to emphasise that the vowel is long.
The doubled vowel diacritics, ◌ً [U+064B ARABIC FATHATAN], ◌ٌ [U+064C ARABIC DAMMATAN], and ◌ٍ [U+064D ARABIC KASRATAN] are used at the ends of certain Arabic adverbs. The doubled zabar (fathatan) is the most common of the three marks of this type. Although the mark appears over an alif the vowel sound is short. Examples, یقیناً yqynɑaⁿ yakiːnan certainly; مثلاً mṡlɑaⁿ masalan for example.
Vowels may be nasalised, like at the end of the French word élan. This is indicated in Urdu by a glyph called nun ghunna that looks like the letter nun except that in word final position it has no dot, eg. ماں mɑñ mãː mother, ٹاںگ ʈɑñg tãːg leg, and کروں krvñ karũː I may do. In Unicode there are different characters for each of these uses.
The diacritic◌٘ [U+0658 ARABIC MARK NOON GHUNNA] is used when people want to make it clear that a noon character represents nasalisation rather than the sound n, eg. ٹاںگ ʈɑñg tãːg leg. It is not used in a standard way, just when the user prefers, and is fairly uncommon.
A hamzā plays more than one role in Urdu. One such role is to indicate the boundaries between vowel sounds when there is no intervening consonant. Depending on the vowels concerned, it is used in a number of different ways. It can also have two different shapes, one like the initial form of 'ain and the other more like an italic 's'.
In this example we see hamza in its isolated form, انشاءﷲ ɪnʃalːaː God willing.
When the second vowel is an iː or e represented by ی [U+06CC ARABIC LETTER FARSI YEH] or ے [U+06D2 ARABIC LETTER YEH BARREE], the hamzā 'sits on a chair' before the letter representing the second vowel.
The hamza on its chair should be written using ئ [U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE], eg. کئی kɪ͑y kaiː several; تیئیس tyɪ͑ys teiːs twenty-three; کوئی kvɪ͑y koiː someone; گئے gɪ͑ɛ gae they went; گائے gɑɪ͑ɛ gɑːe they sang. Note that ئ [U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE] is ي + ◌ٔ [U+064A ARABIC LETTER YEH + U+0654 ARABIC HAMZA ABOVE] when decomposed. The 'chair' doesn't use ی [U+06CC ARABIC LETTER FARSI YEH].
The short vowel ɪ as a second vowel is also represented by hamzā 'on its chair' alone, eg. کوئلہ kvɪ͑lḫ koɪlɑː coal; لائن lɑɪ͑n lɑːɪn queue.
When the second vowel is an uː or o represented by و [U+0648 ARABIC LETTER WAW], the hamzā typically sits directly on top of the و, eg. آؤ ɑ̄u͑ ɑːo come; جاؤں ʤɑu͑ñ ʤɑːũː I may go. Note that often the hamzā is omitted in this situation. To represent this in Unicode use ؤ [U+0624 ARABIC LETTER WAW WITH HAMZA ABOVE].
Many words have the vowel combinations iːɑ̃ iːe iːo, where hamzā is not typically used, eg. لڑکیاں lɽkyɑñ laɽkiːɑ̃ː girls; چلیے člyɛ ʧaliːe come on; لڑکیوں کا lɽkyvñ kɑ laɽkiːõ kɑː of the girls.
Hamzā is also used to represent izāfat when the preceding word ends in either choṭī he or ye (see below).
Izāfat ɪzɑːfat is the name given to the short vowel ɛ used to describe a relationship between two words. It may be translated of, eg. as in the Lion of Punjab.
This sound occurs at the end of a word and is mostly represented using zer. Sometimes, however, the combining mark is not shown, even though pronounced. Examples: شیرِ پنجاب ʃyri pnʤɑb ʃer ɛ panʤɑːb Lion of the Punjab; طالبِ علم t̂ɑlbi ʿlm tɑːlɪb ɛ ɪlm seeker of knowledge (student).
When the preceding word ends in a silent choṭī he ہ [U+06C1 ARABIC LETTER HEH GOAL], izafat is represented by a combining hamza, eg. قطرۂ آب qt̂re͑ ɑ̄b qatra ɛ ɑːb drop of water. Note that if the choṭī he is pronounced, then zer is used, eg. آہِ گرم ɑ̄ḫi grm āh-e garm hot sigh.
When the preceding word ends in ye ی [U+06CC ARABIC LETTER FARSI YEH], sources differ on the approach to take. Some sources say that you should just add zer, as described before. Others say that izafat is represented by a combining hamza, eg. ولیٔ کامل vly‘ kɑml valiː ɛ kɑːmɪl perfect saint. Should you use ئ [U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE] or ی [U+06CC ARABIC LETTER FARSI YEH] + combining hamza? Most of the sources proposing this approach seem to use the former. With Google fonts the result looks the same either way. With Nafees Nastaleeq only the latter works. The latter seems more logical, wrt searching, semantics, etc.
When the preceding word ends in a long a or u vowel, izafat is represented using hamza 'on it's chair', ie. ئ [U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE], plus ے [U+06D2 ARABIC LETTER YEH BARREE], eg. صدائے بلند ŝdɑɪ͑ɛ blnd sadɑː ɛ buland a high voice; روئے زمین rvɪ͑ɛ zmyn ruː ɛ zamiːn the surface of the ground. Sometimes, however, the hamza is not shown.[2 p99]
The alphabet standardised by the National Language Authority in Pakistan counts 59 letters, of which 18 are digraphs representing aspirated consonants.
The basic letters are:
The aspirated consonants are:
Other characters found in Urdu text include the following. These are introduced further down this page, but you can, as usual, find out more by clicking on them.
ي [U+064A ARABIC LETTER YEH] is only found in decomposed forms of ئ [U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE].
The absence of a vowel sound can be indicated with the diacritic ْ [U+0652 ARABIC SUKUN], called sukūn or jazm, although this diacritic is not normally shown in text, eg. سَخْت sax͓t saxt hard.
It has various possible forms, including a small round circle, something that looks like peʃ, and something like a circumflex.
This diacritic is never written above the final character in a word, because as a rule a short vowel is not pronounced in this position.
Consonant sounds can be lengthened. In vowelled text, which is very rare, this is shown using the diacritic ّ [U+0651 ARABIC SHADDA], called taʃdiːd, eg. ستّر stᵚr sattar seventy. More often than not, this is not written.
The pronunciation of ال (alif followed by lām) varies when it represents the Arabic definite article. This affects many words in Urdu that have come from Arabic, in particular names and adverbial expressions.
The lām is not pronounced if it precedes one of the following characters:
Instead, the following sound is doubled. A tašdīd may sometimes be used to indicate this. Example: السلام علیکم ɑlslɑm ʿlykm asːalɑːm alaikum greetings.
Often the alif is not pronounced after a short preceding word that ends in a vowel. If the preceding vowel was long, it is shortened in this process. Examples: بالکل bɑlkl bɪlkul absolutely; فی الحال fy ɑlhɑl filhɑːl at present.
Often the vowel is pronounced ʊ, eg. دارالحکومت dɑrɑlhkvmt dɑːrʊlhʊkuːmat capital.
Urdu uses the extended arabic-indic digits in the Arabic block.
This is a separate set of characters from those used for Arabic, to accommodate different shaping and directional behaviour. Shapes differ from those of Arabic for the digits 4, 5, and 7.
Persian also uses the same characters for digits, but there are some systematic shape differences between Persian and Urdu for the digits 4, 6, and 7.
Urdu has special characters for the thousands and decimal separators: ٬ [U+066C ARABIC THOUSANDS SEPARATOR] and ٫ [U+066B ARABIC DECIMAL SEPARATOR]. It also uses ٪ [U+066A ARABIC PERCENT SIGN]. Need to clarify whether the percent sign appears to the right or left of the number. When typed after, it appears to the right.
That said, Urdu Wikipedia currently uses European digits and ، [U+060C ARABIC COMMA] and . [U+002E FULL STOP], respectively, for thousands and decimal separators. (English) Wikipedia says that
In Pakistan, Western Arabic numerals are more extensively used as a considerable majority of the population is anglophone. Eastern numerals still continue to see use in Urdu publications and newspapers, as well as sign boards.
Urdu also has a sign [U+0600 ARABIC NUMBER SIGN] which can be used to indicate a number, eg. ۱۲۳. [The Noto Nastaliq Urdu webfont doesn't seem to extend the sign below the number, whereas the same font on the system does. Both that font and Nafees Nastaleeq require this sign to be added after the number, and appear to treat it like a fixed width combining mark, rather than a subtending mark that grows with the number.]
؍ [U+060D ARABIC DATE SEPARATOR] is used in Urdu. [Find out how and how often it is used.]
Dates are indicated by placing the long sweep of [U+0601 ARABIC SIGN SANAH] below the year digits. For the Gregorian calendar this is followed with the word عیسوی ʿysvy iːsviː Christian Era, usually abbreviated as a hamza ء. Dates using the Muslim calendar are followed by the word ہجری ḫʤry hɪʤriː, abbreviated with the symbol ھ.
The sanh sign is typed before the digits (in a rtl context): eg. ۲۰۰۴ء (2004). It is not a combining character, even though it displays beneath the digits. The length of the symbol may vary according to the number of digits. It is terminated by a non-digit character.
[U+0604 ARABIC SIGN SAMVAT] is another subtending mark, intended to indicate a year in the Śaka calendar.
The following combining characters are used with names as honorifics, eg. قاضی نور محمّدؒ qɑẑy nvr mhmᵚdؒ kaziː nur mamed rahmatulla alayhe Qazi Nur Muhammad, may God have mercy upon him!. They are combining characters that appear over the name at a point chosen by the author.
﷽ [U+FDFD ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM] is used by Muslims in various contexts including the constitutions of countries where Islam has a significant presence. The shape varies significantly from font to font and usage to usage.
Arabic script joins letters together. This results in four different shapes for most letters (including an isolated shape).
A few Arabic script letters only join on the right-hand side.
As in Arabic, lam followed by alef ligate, eg. اسلام ɑslɑm islam Islam.
The following invisible Unicode formatting characters can be used to control cursive joining.
Since the script is cursive (ie. letters are typically joined) the letter forms can vary considerably according to position.
Urdu is typically written in a nasta'liq style, ie. the connected letters in a word tend to follow a sloping baseline. This is achieved in Unicode by applying the correct font – the underlying characters used are not different for nasta'liq vs. other styles.
Words are separated by spaces.
Urdu uses a mixture of western and arabic punctuation.
For separators at the sentence level and below, the following are used in Urdu text, where the right column indicates approximate equivalences to Latin script.
|comma||، [U+060C ARABIC COMMA]|
|semi-colon||؛ [U+061B ARABIC SEMICOLON]|
|colon||: [U+003A COLON]|
|sentence||۔ [U+06D4 ARABIC FULL STOP]|
|question mark||؟ [U+061F ARABIC QUESTION MARK]|
In poetry, ؎ [U+060E ARABIC POETIC VERSE SIGN] is used to mark the beginning of poetic verse, and ؏ [U+060F ARABIC SIGN MISRA] is used to indicate a single line (misra) of a couplet (shayr) from an Urdu poem, when quoted in text. It is used at the beginning of the line, and is followed by the line of verse. For more information and examples, follow the links on the character names.
The alphabetic baseline is a strong feature of Arabic script on the whole, since characters tend to join there. The nastaliq style of the script, on the other hand, uses arrangements of joined glyphs that cascade downwards from right to left, and ressemble a strongly sloping baseline. See the example in Fig. fig_baseline.
[U+0602 ARABIC FOOTNOTE MARKER] is used to indicate that a number is a reference to a footnote. The number sits above the symbol, although this is not a combining character. The marker should come before the number in logical order, eg. ؎۵.
(Note that, although it looks very similar, this is not the same character as ؎ [U+060E ARABIC POETIC VERSE SIGN].)