Sindhi

Arabic script orthography notes

Updated 13 July, 2024

This page brings together basic information about the Arabic script and its use for the Sindhi language. It aims to provide a brief, descriptive summary of the modern, printed orthography and typographic features, and to advise how to write Sindhi using Unicode.

Referencing this document

Richard Ishida, Sindhi (Arabic) Orthography Notes, 13-Jul-2024, https://r12a.github.io/scripts/arab/sd

Sample

Select part of this sample text to show a list of characters, with links to more details.
Change size:   28px

آرٽيڪل 1. سمورا ينسان آزاد ۽ عزت ۽ حقن جي حوالي کان برابر پيدا ٿيا آهن. انهن کي عقل ۽ ضمير حاصل ٿيو آهي⹁ ان ڪري انهن کي هڪ ٻئي سان ڀائيچاري وارو سلو ڪ اختيار ڪرڻ گهرجي.

آرٽيڪل 2. هر فرد انهن سمورين انساني آزادين ۽ حقن جو حقدار آهي⹁ جيڪي هن اعلان ۾ بيان ڪيل آهن ۽ ان حق تي رنگ⹁ نسل⹁ جنس⹁ زبان⹁ مذهب ۽ سياسي متڀپد جو يا ڪنهن نه قسم جي عقيدي⹁ قوم⹁ سماج⹁ دولت يا خانداني حيثيت جو ڪوئي فرق نه پوندو⹁ ان کان سو اءِ جنهن ماڪ يا علائقي سان اهو فرد تعلق رکي ٿو⹁ ان جي سياسي ڪيفيت يا اختيار جو دائرو بين القوامي حيثيت جي بنياد تي ان سان ڪوئي فرق وارو سلوڪ اختيار نه ڪيو ويندو⹁ ڀلي اهو ملڪ آزاد هجي يا ڪاڻيارو يا غير مختيار هجي يا سياسي اختيار جي حوالي سان ڪنھن ٻين پابنديءَ جو شڪار هجي.

Source: Universal Declaration of Human Rights - Sindhi, articles 1 & 2

Usage & history

Origins of the Arabic script, 6thC – today.

Phoenician

└ Aramaic

└ Nabataean

└ Arabic

Sindhi is an Indo-Aryan language spoken by approximately 30 million people in the Pakistani province of Sindh, where it holds official status. Additionally, around 1.7 million people in India speak Sindhi, although it lacks state-level official status there. The primary writing system for Sindhi is the Perso-Arabic script, which is predominantly used in Pakistan. In India, both the Perso-Arabic script and Devanagari are employed for writing Sindhi.

سنڌي

During the Arab conquest of Sindh in the 8th century, the Arabic script was introduced to the region. Over time, it became the primary writing system for Sindhi. The Sindhi-Arabic script was standardized in 1853 by British colonial authorities and has been in general use since then.

More information: Wikipedia

Basic features

The Arabic script is an abjad, ie. short vowels are not normally written. See the table to the right for a brief overview of features for the Sindhi language.

The Sindhi Arabic orthography is derived from the Arabic/Persian abjads, where in normal use the script represents long vowel sounds using matres lectionis. However, the script has been adapted in this orthography in order to cope with the many more vowels sounds in Sindhi; there are many unique letters, and the use of letters for vowels is a distinguisher of vowel quality, rather than length.

Sindhi text runs right to left in horizontal lines, but numbers and embedded Latin text are read left-to-right. There is no case distinction. Words are separated by spaces.

❯ consonantSummary

Sindhi represents consonant sounds using 49 basic letters and 7 more digraphs for aspirated sounds. A number of consonant sounds can be written with alternative consonant letters since the original spelling is retained for many words. But there are also 15-20 letters that are only used for Sindhi.

Dedicated letters are available for some aspirated consonant sounds, but others are represented using the aformentioned digraphs.

Sindhi uses 3 code points for sounds related to h, but they are used very inconsistently in the wild, creating difficulties for searching and other operations. Unicode experts recommend specific roles for each of the letters, but in some joining contexts there is no difference in appearance, which makes mistakes possible.

❯ basicV

The Sindhi abjad indicates the location of 7 vowel sounds using 4 letters. Three more sounds are not normally written.

When needed, all vowels can be unambiguously represented using the letters and 3 combining marks. Post-consonant vowel sounds are written using the same code points, regardless of the position within a word.

On the other hand, standalone vowels are preceded by or attached to a letter that varies according to whether it occurs at the beginning of a word or word-medially, and in some cases word-finally. These carrier letters are 0627, 0626, and 0621, respectively.

Nasalisation is indicated using ن. Vowel absence is not normally marked. Even in vowelled text, the sukun is infrequently used.

Kashmiri uses native digits, and a mixture of ASCII and Arabic code points for punctuation marks, but uses reversed comma and semicolon punctuation marks.

Character index

Letters

Show

Basic consonants

ء␣آ␣ئ␣ا␣ب␣ت␣ث␣ج␣ح␣خ␣د␣ذ␣ر␣ز␣س␣ش␣ص␣ض␣ط␣ظ␣ع␣غ␣ف␣ق␣ل␣م␣ن␣ه␣و␣ي␣ٺ␣ٻ␣ٽ␣پ␣ٿ␣ڀ␣ڃ␣ڄ␣چ␣ڇ␣ڊ␣ڌ␣ڍ␣ڏ␣ڙ␣ڦ␣ک␣ڪ␣گ␣ڱ␣ڳ␣ڻ␣ھ␣ہ

Combining marks

Show

Vowels

َ␣ُ␣ِ␣ْ

Other

ٓ␣ٔ

Numbers

Show
۰␣۱␣۲␣۳␣۴␣۵␣۶␣۷␣۸␣۹

Punctuation

Show
؟␣⁏␣⹁

ASCII

!␣(␣)␣,␣-␣.␣:␣;␣?

Symbols

Show
۽␣۾

Other

Show
‌␣‍␣⁧␣‫␣⁦␣‪␣⁨␣⁩␣‬␣‏␣‎␣؜␣͏

To be investigated

,␣-␣«␣»␣ـ␣‑␣–␣—␣‘␣’␣“␣”␣…␣‹␣›␣﴾␣﴿
Items to show in lists

Phonology

The following represents the repertoire of the Sindhi language.

Click on the sounds to reveal locations in this document where they are mentioned.

Phones in a lighter colour are non-native or allophones. Source Wikipedia.

Vowel sounds

Plain vowels

i u ɪ ʊ e o ə ə ɔ æ ɑ ɑ

Consonant sounds

labial labio-
dental
alveolar post-
alveolar
retroflex palatal velar uvular glottal
stop p b   t d     t͡ɕ d͡ʑ k ɡ q ʔ
      ʈʰ ɖʰ t͡ɕʰ d͡ʑʰ ɡʰ    
implosive ɓ   ɗ     ʄ ɠ    
fricative   f s z   ʂ   x ɣ   h ɦ
nasal m   n   ɳ ɲ ŋ  
      ɳʰ      
approximant, trill, flap   ʋ r l   ɽ j    
        ɽʰ      
  

Tone

Sindhi is not a tonal language.

Structure

tbd

Vowels

Vowel summary table

The following table summarises the main vowel to character assigments.

Vowel diacritics are shown in this table. In normal text these diacritics do not appear. Where I have not yet seen an example, a question mark appears. From right to left, the columns indicate word-initial standalone forms, word-medial standalone forms, and post-consonant forms.

      word-final word-medial word-initial
Post-consonant
◌ِي␣◌ُو
Standalone
ئِي␣ئو
اي␣اُو
◌ِ␣◌ُ
 
ءِ␣ءُ
ئِ␣ئُ
اِ␣اُ
ي␣و
 
ي␣ئو
اي␣او
◌َ
 
ءَ
ئَ
اَ
ع␣ا
 
ع␣ا
ع␣آ

For additional details see vowel_mappings.

Post-consonant vowels

The Sindhi abjad indicates the location of 7 vowel sounds using 4 letters. Three more sounds are not normally written. When needed, all vowels can be unambiguously represented using the letters and 3 combining marks. Post-consonant vowel sounds are written using the same code points, regardless of the position within a word.

Vowel letters

Normally speaking, after a consonant, Sindhi represents certain vowel sounds using the following consonant letters, and other vowel sounds are not written at all.

ي␣و␣ع␣ا

See examples of these letters below. The words include an unwritten vowel sound (one of ɪ ʊ ə) and a vowel sound indicated by one of the above letters (i u æ o ɑ).

سنڌي

لوڻ

قلعو

برابر

Text without vowel diacritics can sometimes have ambiguous readings. For example, take the following word:

شڪر

When vowel diacritics are applied, the following 3 different pronunciations and meanings are possible for this sequence of letters.

شَڪَرِ

شُڪْرُ

شُڪُرُ

Combining marks used for vowels

Where needed, vowel sounds can be clarified using diacritics, as shown in basicV.

As just mentioned, some vowels in post-consonant position (ɪ, ʊ, and ə) are normally not written (or distinguished, one from another) at all, because the diacritics are not used in normal text.

The combining marks are listed just below, but it is the combination of these diacritics with other letters that determines the intended pronunciation. In other cases, such as for e and o, in vowelled text the absence of a diacritic can distinguish these sounds from i and u. See basicV for details.

ِ␣ُ␣َ

The following 2 additional combining marks can be found in decomposed text (only).

ٓ␣ٔ

Vowel length

Vowel length appears to be somewhat inconsistent. There are no special diacritics or conventions from separating long from short vowels.

Nasalisation

ن

Nasalisation is common in Sindhi, and is indicated using the letter ن.

متان

Observation: It's not always clear from transcriptions whether there is a difference between nasalisation or nasal coda, eg.

شينهن

Standalone vowels

Standalone vowels are written in 3 different ways in Sindhi. See basicV for the various forms. The following characters are used as vowel carriers.

ا␣ئ␣ء

Vowels that are only distinguished by diacritics are not distinguished in normal text, and the sound represented by the standalone vowel carrier is ambiguous.

Word-initial standalone vowels are written using ا as a vowel carrier.

اتر

اسين

اوڪڻ

اونڪارڻ

Word-medial standalone vowels use ئ as the vowel carrier.

ڏئڻ

سائو

ڳئون

آئون

Word-final standalone vowels ɪ, ʊ, and ə use ء as the vowel carrier.

جوء

جونء

ڀاء

Note how the vowel carrier is used after ن when that represents a nasalisation of the vowel.

ڪانئر

پنئن

Vowel absence

Sindhi doesn't normally use any mark to indicate a consonant cluster or consonant without a following vowel. A cluster is simply written as a sequence of consonants.

ْ

Vowelled text may use ْ, but it is rare.

ورسپت

Vowel sounds to characters

This section maps Sindhi vowel sounds to common graphemes in the Arabic orthography.

The left column shows dependent vowels, and the right column independent vowel letters.

Click on a grapheme to find other mentions on this page (links appear at the bottom of the page). Click on the character name to see examples and for detailed descriptions of the character(s) shown.

Standard text

 
 
Post-consonant
Standalone
i
 

ي ي ي

زمين

اي as a word-initial standalone vowel.

ايران

ئي ئي as a word-medial/final standalone vowel.

سئي

ɪ
 

Not written.

تٿ

ا as a word-initial standalone vowel.

ارادو

ئ as a word-medial standalone vowel.

پائڻ

ء as a word-final standalone vowel.

جوء

ʊ
 

Not written.

زبان

ا as a word-initial standalone vowel.

اتر

ئ as a word-medial standalone vowel.

پائلو

ء as a word-final standalone vowel.

ڀاء

u
 

و

لوڻ

او as a word-initial standalone vowel.

اوناڙڻ

ئو as a word-medial/final standalone vowel.

ڳئون

e
 

ي ي ي

تیز

ئي ئي as a word-medial/final standalone vowel.

ائين

o
 

و

آنو

او as a word-initial standalone vowel.

اوڀر

ئو as a word-medial/final standalone vowel.

سائو

ə
 

Not written.

برف

ا as a word-initial standalone vowel.

افسوس

ئ as a word-medial standalone vowel.

پئڻ

ء as a word-final standalone vowel.

جونء

æ
 

ع

قلعو

ع

علاقو

a ɑ
 

ا

برابر

انار

آ as a word-initial standalone vowel

آسمان

ا as a word-medial standalone vowel

انار

Vowelled text

 
 
Post-consonant
Standalone
i
 

ِي

زمين

اِي as a word-initial standalone vowel.

ايران

ئِي ئِي as a word-medial/final standalone vowel.

سئي

ɪ
 

ِ

تٿ

اِ as a word-initial standalone vowel.

ارادو

ئِ as a word-medial standalone vowel.

پائڻ

ءِ as a word-final standalone vowel.

جوء

ʊ
 

ُ

زبان

اُ as a word-initial standalone vowel.

اتر

ئُ as a word-medial standalone vowel.

ارادو

ءُ as a word-final standalone vowel.

ڀاء

u
 

ُو

لوڻ

اُو as a word-initial standalone vowel.

اوناڙڻ

ئُو as a word-medial/final standalone vowel.

ڳئون

e
 

ي

تیز

ئي ئي as a word-medial/final standalone vowel.

ائين

o
 

و

آنو

او as a word-initial standalone vowel.

اوڀر

ئو as a word-medial/final standalone vowel.

سائو

ə
 

َ

برف

اَ as a word-initial standalone vowel.

افسوس

ئَ as a word-medial standalone vowel.

پئڻ

ءَ as a word-final standalone vowel.

جونء

æ
 

ع ع ع

قلعو

ع ع

علاقو

a ɑ
 

ا

برابر

انار

آ

آسمان

ا

انار

Consonants

Consonant summary table

The following table summarises the main consonant to character assigments.

The left column is lowercase, and the right uppercase.

Onsets
پ␣ب␣ت␣ط␣د␣ٽ␣ڊ␣چ␣ج␣ڪ␣گ␣ق
ڦ␣ڀ␣ٿ␣ڌ␣ٺ␣ڍ␣ڇ␣جھ␣ک␣گھ
ٻ␣ڏ␣ڄ␣ڳ
ف␣س␣ص␣ث␣ذ␣ظ␣ض␣ز␣ش␣خ␣غ␣ح␣ه
م␣ن␣ڻ␣ڃ␣ڱ
مھ␣نھ␣ڻھ
و␣ر␣ڙ␣ل␣ي
ڙھ␣لھ
Finals
ہ

For additional details see consonant_mappings.

Basic consonants

The following list shows the basic set of consonant letters used for native Sindhi.

پ␣ڦ␣ب␣ڀ␣ت␣ٿ␣ڇ␣د␣ڌ␣ٽ␣ٺ␣ڊ␣ڍ␣ڪ␣ک␣گ␣ق␣ٻ␣ڏ␣ڄ␣ڳ␣چ␣ج␣ف␣س␣ز␣ش␣خ␣غ␣ح␣ه␣م␣ن␣ڻ␣ڃ␣ڱ␣و␣ر␣ل␣ڙ␣ي

Six more consonant letters are hangovers from the original spellings of loan words.

ط␣ث␣ص␣ذ␣ض␣ظ

The final and isolated forms of م have a short tail in Sindhi, rather than the long downwards tail in many other orthographies, such as Arabic, Persian, & Urdu.

Aspiration

A number of Sindhi phones are accompanied by aspiration. This is indicated in the orthography in 2 different ways, depending on the base consonant.

The following consonants indicate aspiration by a dedicated glyph:

ڦ␣ڀ␣ٿ␣ڌ␣ٺ␣ڍ␣ڇ␣ک

The other aspirated consonants, listed below, use a digraph with ھ.

جھ␣گھ␣مھ␣نھ␣ڻھ␣ڙھ␣لھ

Variant forms of heh

Recent discussions in Unicode committees have highlighted how historical limitations of technology have lead to an inconsistent use of code points and glyphs to represent the various forms of heh in Sindhi. An attempt was made by Evanslesd and Mansourkmsd to provide guidelines for correct usage which would promote consistency looking forward.

The scenarios that involve a form of heh are the following:

  1. The phoneme h as used for a syllable onset.

    In this case use ه.

    لاهور

  2. An aspiration marker following several consonants.

    For this, use ھ.

    پڙھڻ

  3. The so-called 'silent heh', which only occurs word-finally, and which generally is either silent or represents a waning breath after a short vowel.

    For this, use ہ.

    بيکہ

The confusion around which code point to use is understandable where the glyph shapes look the same, but it is easy to find examples that don't match the advice of the Unicode experts even where the codepoints used result in a different glyph from that you would expect. The following table shows expected forms when rendered for each code point.

characterright-joiningmedialleft-joining
هههه
ھھھھ
ہہہ (n/a)ہ (n/a)

Consonant clusters

No special mechanisms are used to indicate consonant clusters. These are simply written as a sequence of characters.

Consonant sounds to characters

This section maps Sindhi consonant sounds to common graphemes in the Arabic orthography. Sounds listed as 'infrequent' are allophones, or sounds used for foreign words, etc.

The right-hand column shows the various joining forms.

Click on a grapheme to find other mentions on this page (links appear at the bottom of the page). Click on the character name to see examples and for detailed descriptions of the character(s) shown.

 
 
 
 
Joining forms
p
 

پ

پئڻ

067E067E067E

 

ڦ

ڦار

06A606A606A6

b
 

ب

برابر

062806280628

 

ڀ

ڀارت

068006800680

ɓ
 

ٻ

ٻج

067B067B067B

t
 

ت

تنگ

062A062A062A

 
 

ط allograph, retained loan words.

غلطي

063706370637

 

ٿ

ٿورو

067F067F067F

t͡ɕ
 

چ

چيرڻ

068606860686

t͡ɕʰ
 

ڇ

ڇٽڻ

068706870687

d
 

د

دڪان

062F062F

 

ڌ

ڌرتي

068C068C

d͡ʑ
 

ج

جبل

062C062C062C

d͡ʑʰ
 

جھ

جھنگ

062C 06BE062C 06BE062C 06BE

ɗ
 

ڏ

ڏينهن

068F068F

ʈ
 

ٽ

ٽنگ

067D067D067D

ʈʰ
 

ٺ

ٺڪر

067A067A067A

ɖ
 

ڊ

ڊڄڻ

068A068A

ɖʰ
 

ڍ

ڍڳو

068D068D

ʄ
 

ڄ

ڄمڻ

068406840684

k
 

ڪ

ڪاغذ

06AA06AA06AA

 

ک

کاند

06A906A906A9

ɡ
 

گ

گسڻ

06AF06AF06AF

ɡʰ
 

گھ

اگھڻ

06AF 06BE06AF 06BE06AF 06BE

ɠ
 

ڳ

ڳئون

06B306B306B3

q
 

ق

قهوو

064206420642

f
 

ف

افسوس

064106410641

s
 

س

سنڌي

063306330633

 
 

ص allograph, retained in loan words.

صحيح

063506350635

 
 

ث allograph, retained in loan words.

ثواب

062B062B062B

z
 

ز

زبان

063206320632

 
 

ذ allograph, retained in loan words.

ڪاغذ

06300630

 
 

ظ allograph, retained in loan words.

مظفر ڳڙھ

063806380638

 
 

ض allograph, retained in loan words.

رمضان

063606360636

ʂ
 

ش

شڪر

063406340634

x
 

خ

خواب

062E062E062E

ɣ
 

غ

غلطي

063A063A063A

h
 

ه for syllable onsets.

لاهور

064706470647

 
 

ح allograph, retained in loan words.

حجاب

062D062D062D

ʰ
 

ھ used only after another consonant to create aspiration.

06BE06BE06BE

 
 

ہ, like a waning breath when pronounced, but this is more often silent. Used in word-final position only.

باهہ

06C1

m
 

م

مڻيار

064506450645

 

مھ

0645 06BE0645 06BE0645 06BE

n
 

ن

نئون

064606460646

 

نھ

0646 06BE0646 06BE0646 06BE

ɳ
 

ڻ

لوڻ

06BB06BB06BB

ɳʰ
 

ڻھ

06BB 06BE06BB 06BE06BB 06BE

ɲ
 

ڃ

ڀڃڻ

068306830683

ŋ
 

ڱ

اڱارو

06B106B106B1

ʋ
 

و

واءُ

06480648

r
 

ر

راجا

06310631

ɽ
 

ڙ

ڪيوڙو

06990699

ɽʰ
 

ڙھ

پڙھڻ

0699 06BE0699 06BE0699 06BE

l
 

ل

لوڻ

064406440644

 

لھ

0644 06BE0644 06BE0644 06BE

j
 

ي

ڏياري

064A064A064A

Symbols

Sindhi uses 2 symbols.

۾␣۽

۽ is equated with an ampersand.

۾ is a locative case marker, pronounced mẽ.

Encoding choices

This section offers advice about characters or character sequences to avoid, and what to use instead. It takes into account the relevance of Unicode Normalisation Form D (NFD) and Unicode Normalisation Form C (NFC)..

Although usage is recommended here, content authors may well be unaware of such recommendations. Therefore, applications should look out for the non-recommended approach and treat it the same as the recommended approach wherever possible.

Writing heh

Unicode has a range of code points dedicated to the sound referred to as heh. Some of the code points are separate because they produce different glyphs in positional forms than others; sometimes the difference is semantic. For want of guidelines, and due to historical technological complications, Sindhi users tend to make use of code points in creative ways, typically employing any code point that produces the glyph form they expect to see in a given context.

The inconsistencies produced by this approach hamper search and other machine-based algorithms that deal with the text.

Recently Unicode committees have been considering recommendations for consistent usage which are described in the section heh.

Canonically equivalent encodings

Two letters can be represented as an atomic character (the norm), or as a sequence of base letter plus combining mark. The parts are separated in Unicode Normalisation Form D (NFD), and recomposed in Unicode Normalisation Form C (NFC), so both approaches should be treated as canonically equivalent.

Atomic (recommended) Decomposed ( NOT recommended )
آ 0627 0653
ئ 064A 0654

Normally, text will use the atomic form, and this is generally recommended by the Unicode Standard.

Confusables & spelling errors

This table lists characters that are often mistakenly used because they look the same as or similar to the code points used for Sindhi, or perhaps because the correct character is not available on the user's keyboard.

Incorrect Correct Notes
06CC 064A The Farsi YEH drops the dots below in isolate and final positions.

Codepoint sequences

Combining marks always follow the base character.

Numbers

Digits

Sindhi uses the set of native digits in the Unicode Arabic block known as Eastern Arabic-Indic digits.

۰␣۱␣۲␣۳␣۴␣۵␣۶␣۷␣۸␣۹

The glyph shapes are typically different for 3 of the digits (although not always the same 3 digits) in Persian, Urdu and Sindhi.

Arabic٠١٢٣٤٥٦٧٨٩
Persian۰۱۲۳۴۵۶۷۸۹
Urdu۰۱۲۳۴۵۶۷۸۹
Sindi۰۱۲۳۴۵۶۷۸۹
Arabic-indic numerals, as used in Arabic, Persian, Urdu and Sindhi language text.

Text direction

Arabic script text is written horizontally and right-to-left in the main but, as in most right-to-left scripts, numbers and embedded text in other scripts are written left-to-right (producing 'bidirectional' text).

العاشر ليونيكود (Unicode Conference)،الذي سيعقد في 10-12 آذار 1997 مبدينة
Arabic words are read right-to-left, starting from the right of this line, but numbers and Latin text (highlighted) are read left-to-right.

The Unicode Bidirectional Algorithm automatically takes care of the ordering for all the text in fig_bidi, as long as the 'base direction' is set to RTL. In HTML this can be set using the dir attribute, or in plain text using formatting controls.

If the base direction is not set appropriately, the directional runs will be ordered incorrectly as shown in fig_bidi_no_base_direction, making it very difficult to get the meaning.

في XHMTL 1.0 يتم تحقيق ذلك بإضافة العنصر المضمن bdo.
في XHMTL 1.0 يتم تحقيق ذلك بإضافة العنصر المضمن bdo.
The exact same sequence of characters with the base direction set to RTL (top), and with no base direction set on this LTR page (bottom). Certain items are highlighted to help track their position.

Show default bidi_class properties for characters in the Sindhi language.

For other aspects of dealing with right-to-left writing systems see the following sections:

For more information about how directionality and base direction work, see Unicode Bidirectional Algorithm basics. For information about plain text formatting characters see How to use Unicode controls for bidi text. And for working with markup in HTML, see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.

For authoring HTML pages, one of the most important things to remember is to use <html dir="rtl" … > at the top of the page. Also, use markup to manage direction, and do not use CSS styling.

Managing text direction

Unicode provides a set of 10 formatting characters that can be used to control the direction of text when displayed. These characters have no visual form in the rendered text, however text editing applications may have a way to show their location.

202B (RLE), 202A (LRE), and 202C (PDF) are in widespread use to set the base direction of a range of characters. RLE/LRE comes at the start, and PDF at the end of a range of characters for which the base direction is to be set.

In Unicode 6.1, the Unicode Standard added a set of characters which do the same thing but also isolate the content from surrounding characters, in order to avoid spillover effects. They are 2067 (RLI), 2066 (LRI), and 2066 (PDI). The Unicode Standard recommends that these be used instead.

There is also 2068 (FSI), used initially to set the base direction according to the first recognised strongly-directional character.

061C (ALM) is used to produce correct sequencing of numeric data. Follow the link and see expressions for details.

200F (RLM) and 200E (LRM) are invisible characters with strong directional properties that are also sometimes used to produce the correct ordering of text.

For more information about how to use these formatting characters see How to use Unicode controls for bidi text. Note, however, that when writing HTML you should generally use markup rather than these control codes. For information about that, see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.

Expressions & sequences

Sequences of numbers are sets of numbers separated by punctuation or spaces, such as 10–12–2022. Sequences of digits, such as 123, in Arabic script text run LTR automatically. Expressions and sequences of numbers follow somewhat complicated rules, which are described in the Arabic language orthography notes.

Glyph shaping & positioning

Experiment with examples using the Sindhi character app.

Cursive script

See type samples.

Arabic script is always cursive, ie. letters in a word are joined up. Fonts need to produce the appropriate joining form for a letter, according to its visual context, but the code point used doesn't change. This results in four different shapes for most letters (including an isolated shape). Ligated forms also join with characters alongside them.

The highlights in the example below show the same letter, ع, with three different joining forms.

على • متعددة • وسيجمع

The letter ع (ain) in 3 different joining contexts.

Most Arabic script letters join on both sides. A few only join on the right-hand side: this involves 4 basic shapes for Modern Standard Arabic.

ء doesn't join on either side.

Cursive joining forms

Most dual-joining characters add or become a swash when they don't join to the left. A number of characters, however, undergo additional shape changes across the joining forms. fig_joining_forms and fig_right_joining_forms show the basic shapes in Modern Standard Arabic and what their joining forms look like. Significant variations are highlighted.

isolatedright-joineddual-joinleft-joined Sindhi letters
ب ـب ـبـ بـ
پ␣ت␣ٽ␣ب␣ٿ␣ٺ␣ڀ␣ٻ␣ث
ن ـن ـنـ نـ
ن␣ڻ
ق ـق ـقـ قـ
ق
ف ـف ـفـ فـ
ڦ␣ف
س ـس ـسـ سـ
س␣ش
ص ـص ـصـ صـ
ص␣ض
ط ـط ـطـ طـ
ط␣ظ
ک ـک ـکـ کـ
گ␣ک␣ڳ␣ڱ
ڪ ـڪ ـڪـ ڪـ
ڪ
ل ـل ـلـ لـ
ل
ه ـه ـهـ هـ
ه
ھ ـھ ـھـ ھـ
ھ
م ـم ـمـ مـ
م
ع ـع ـعـ عـ
ع␣غ
ح ـح ـحـ حـ
چ␣ج␣ڇ␣ڄ␣خ␣ح␣ڃ
ي ـي ـيـ يـ
ي␣ئ
Joining forms for shapes that join on both sides..
isolatedright-joined MSA letters
ا ـا
ا
ر ـر
ز␣ر␣ڙ
د ـد
د␣ڊ␣ڌ␣ڍ␣ڏ␣ذ
و ـو
و
Joining forms for shapes that join on the right only.

Managing glyph shaping

200D (ZWJ) and 200C (ZWNJ) are used to control the joining behaviour of cursive glyphs. They are particularly useful in educational contexts, but also have real world applications.

ZWJ permits a letter to form a cursive connection without a visible neighbour. For example, the marker for hijri dates in Arabic is an initial form of heh, even though it doesn't join to the left, ie. ه‍. For this, use ZWJ immediately after the heh, eg. الاثنين 10 رجب 1415 ه‍..

ZWNJ prevents two adjacent letters forming a cursive connection with each other when rendered. For example, it is used in Persian for plural suffixes, some proper names, and Ottoman Turkish vowels. Ignoring or removing the ZWNJ will result in text with a different meaning or meaningless text, eg, تن‌ها is the plural of body, whereas تنها is the adjective alone.2 The only difference is the presence or absence of ZWNJ after noon.

034F is used in Arabic to produce special ordering of diacritics. The name is a misnomer, as it is generally used to break the normal sequence of diacritics.

Context-based shaping & positioning

In addition to the cursive shaping, Arabic script glyphs also require context-dependent shaping and positioning. For more information, see the Arabic language orthography notes.

The usual mandatory ligature applies for لا.

لاسي

اخلاق

Typographic units

Word boundaries

Words are separated by spaces.

Graphemes

tbd

Phrase, sentence, and section delimiters are described in phrase.

Punctuation & inline features

Phrase & section boundaries

⹁␣:␣⁏␣.␣؟␣!␣—

Sindhi uses a mixture of ASCII, Arabic, and other punctuation.

phrase

:

sentence

.

۔

؟

!

The Unicode Standard says that Sindhi uses the reversed comma and semicolon, rather than the Arabic punctuation marks ، and ؛.u16,#G3681 However, it is easy to find text in Sindhi that uses the Arabic punctuation.

Some Sindhi texts use . as a full stop, whereas others use ۔.

Bracketed text

See type samples.

(␣)

Sindhi commonly uses ASCII parentheses to insert parenthetical information into text.

  start end
standard

(

)

Mirrored characters

The words 'left' and 'right' in the Unicode names for parentheses, brackets, and other paired characters should be ignored. LEFT should be read as if it said START, and RIGHT as END. The direction in which the glyphs point will be automatically determined according to the base direction of the text.

a > b > c
ا > ب > ج
Both of these lines use > U+003E GREATER-THAN SIGN, but the direction it faces depends on the base direction at the point of display.

The number of characters that are mirrored in this way is around 550, most of which are mathematical symbols. Some are single characters, rather than pairs. The following are some of the more common ones.

(␣)␣<␣>␣[␣]␣{␣}␣«␣»␣‹␣›

Quotations & citations

See type samples.

”␣“␣’␣‘

The following type of quotation mark can be found in Sindhi texts. When quoted text appears within quoted text different characters are used, though usually of the same type. (Of course, depending on ease of input, quotations may also be surrounded by ASCII double and single quote marks.)

  start end
primary

nested

Unlike brackets, these quote marks are not mirrored during display. As a result, LEFT means use on the left, and RIGHT means use on the right.

Line & paragraph layout

Line breaking & hyphenation

Lines are generally broken between words. They are not broken at the small gaps that appear where a character doesn't join on the left.

Line-edge rules

As in almost all writing systems, certain punctuation characters should not appear at the end or the start of a line. The Unicode line-break properties help applications decide whether a character should appear at the start or end of a line.

Show default line-breaking properties for characters in this orthography.

The following list gives examples of typical behaviours for characters affected by these rules. Context may affect the behaviour of some of these and other characters.

  • « “ ‘ (   should not be the last character on a line
  • » ” ’ ) . ⹁ ⁏ ؟ !   should not begin a new line

Breaking between Latin words

When a line break occurs in the middle of an embedded left-to-right sequence, the items in that sequence need to be rearranged visually so that it isn't necessary to read lines upwards.

latin-line-breaks shows how two Latin words are apparently reordered in the flow of text to accommodate this rule. Of course, the rearragement is only that of the visual glyphs: nothing affects the order of the characters in memory.

Text with no line break in Latin text.

Text with line break in Latin text.

In this Arabic language text, the lower of these two images shows the result of decreasing the line width, so that text wraps between a sequence of Latin words.

Page & book layout

General page layout & progression

Sindhi books, magazines, etc., are bound on the right-hand side, and pages progress from right to left.

عنوان كتاب

Binding configuration for Sindhi books, magazines, etc.

Columns are vertical but run right-to-left across the page.

Online resources

References