Kashmiri (arab) orthography notes v32

There are few nastaliq fonts around, and only a few of those support Kashmiri. The Noto Nastaliq Urdu font incorporated adaptations to support Kashmiri as of version 3.002. This page uses a webfont based on recent versions of that font. Note that the language of the text needs to be set to 'ks' for the correct shapes to be applied.
(At the time of writing, a macOS bug appears to prevent use of the latest version of the font when installed on your system – only the pre-installed version is available, but the webfont should work. And note also that Safari web browser, by policy, will only use pre-installed fonts. However, the latest version of the font works fine when installed on Windows.)
An alternative is to use SIL's Awami Nastaliq font, but this is a Graphite font and so only works fully in the Firefox web browser, and font settings are needed to produce the rounded hamza diacritics, rather than the s-shaped ones produced by default by the Awami font.

Basic features

The Kashmiri Arabic orthography is an alphabet, ie. all vowels are written explicitly, alongside consonants; there is no inherent vowel in a consonant (abugidas), certain vowels are not systematically dropped (abjads), and consonant and vowel are not combined in the same character (syllabaries).

The Kashmiri Arabic orthography is derived from the Arabic/Persian abjads, where in normal use the script represents only consonant and long vowel sounds. However, the script has been adapted in this orthography in order to cope with the many more vowels sounds in Kashmiri, and this is one of the Arabic orthographies that regularly indicates all vowel sounds, making it an alphabet.

❯ basicV

Vowels Vowels are written using a mixture of 10 combining marks and 10 letters. Unlike Arabic, Persian, and Urdu, all vowel diacritics are always visible in Kashmiri texts. Representation of 3 vowel sounds is complicated by the use of different code points in medial vs. final position.

Standalone vowels in word-initial position are preceded by or attached to either 0627 or 0639.

❯ consonantSummary

Consonants Modern Kashmiri has 21 basic consonant letters and 6 aspirated digraphs, but can use 13 more consonants to spell words loaned from Persian, Arabic and Urdu.

A special letter is used to indicate palatalisation, which is common in Kashmiri. Similarly to the other yeh used in Kashmiri, it has a circle below when used in syllable onsets, and a swash with no circle after a syllable coda.

The distinction between ijam vs. tashkil has a bearing on several Kashmiri graphemes, and the choice between precomposed and decomposed realisations of a vowel letter can be complicated (see encoding).

Vowel absence Vowel absence is generally not indicated, apart from for nasalisation and medial consonants.

The jazm is used over a word-medial 0646 to indicate nasalisation of a preceding vowel sound.

Unlike other Arabic orthographies, 0652 (jazm), normally used to show vowel absence, is placed over the second consonant in an onset cluster (such as tr). That letter may therefore carry both the jazm diacritic and a vowel diacritic, which is quite unusual.

NumbersKashmiri uses native digits.

Layout Kashmiri is written right to left in horizontal lines, but numbers and embedded Latin text are read left-to-right. Words are separated by spaces. There is no case distinction.

Kashmiri is principally written using the nasta'liq style of Arabic writing. Glyphs are more drawn out, and the baseline tends to be sloping from word to word. The script is cursive, and some basic letter shapes change radically, depending on what they join to. The nastaliq styling creates diagonal baselines between joined characters, and tends to reduce clarity about where one letter ends and the next starts. (The dots and other diacritics associated with letters become particularly useful for the reader.)

Letters are joined (cursive) as is usual for the Arabic script.

Kashmiri uses Arabic code points for several of the more common punctuation marks.

Joining forms

Because the Arabic script is 'cursive' (ie. joined-up) writing, letters tend to have different shapes depending on whether they join with adjacent letters or not (see cursive). In addition, vowels can be represented using different characters, depending on where in a word they appear.

In scripts such as Arabic, several characters have no left-joining form. In what follows we'll use the characters ي and د to illustrate shapes. The former can join on both sides, but the latter can only join on the right.

Left-joining glyphs are commonly called initial; dual-joining are called medial; and right-joining are called final. Glyphs that don't join on either side are called isolated. However, these glyph shapes can be found in various places within a single word.

Word-initial characters usually have initial glyph shapes (eg. 064A ). However, characters that only join to the right will use an isolated glyph shape (eg. 062F ). Furthermore, words beginning with a vowel are always preceded by a vowel carrier, which is normally ا (eg. 0627 06CC or 0627 064E ).

Word-medial characters will typically join on both sides (eg. 064A ) but those that only join to the right will use a final glyph (eg. 062F ). However, if either of those is preceded by another character that only joins to the right, the glyph shapes rendered will be initial (eg. 064A ) and isolated (eg. 062F ), respectively.

Word-final characters will typically use a final glyph shape (eg. 064A and 062F ). However, if the previous character joins only to the right, they will use isolated glyph shapes (eg.064A and 062F ).

In all this contextual glyph shaping the basic shapes used for a character can vary significantly in a script like Arabic. This also includes some characters that only have ijam dots in certain contexts.

	Front	Central	Back
High	i iː	ɨ ɨː	u uː
Mid	e eː	ə əː	o oː
Low		a aː	ɔ

	Bilabial	Dental	Alveolar	Retroflex	Alveolo -palatal	Velar	Glottal
Stop / Affricate	plain	p b	t d	ts	ʈ ɖ	tʃ dʒ	k ɡ
aspirated	pʰ	tʰ	tsʰ	ʈʰ	tʃʰ	kʰ
Fricative			s z		ʃ		h
Nasal	m	n
Approximant		l			j	w
Trill		r

Vowels

i اِ‍ ◌ِ ◌ِ iː ایٖ‍ ‍یٖ‍ ‍ی	ɨ إ‍ ◌ٕ ◌ٕ ɨː اٟ ◌ٟ ◌ٟ	u اُ ◌ُ ◌ُ uː اوٗ ‍وٗ ‍وٗ
e ایٚ‍ ‍یٚ‍ ےٚ eː ای‍ ‍ی‍ ‍ے		o اوٚ ‍وٚ ‍وٚ oː او ‍و ‍و
	ə أ ◌ٔ ◌ٔ əː ٲ ‍ٲ ‍ٲ	ɔ اۄ ‍ۄ ‍ۄ ɔː اۄا ۄا ۄا
	a اَ ◌َ ◌َ aː آ ‍ا ‍ا

Each table cell shows word-initial, word-medial, and word-final forms from right to left. The glyphs shown are illustrative; alternative shapes may occur (see joining_forms). Click/tap on items to see a list of the components for that cell.

Observation: Several items in the Kashmiri dictionary end with a vowel followed by h. Is this the standard way to write word-final short vowels - and some long ones?

For a question about the ordering of characters in final e, see final_e. For questions about whether to use precomposed or decomposed letters, see encoding.

Post-consonant vowels

Kashmiri is an alphabet where 16 vowel sounds (far more than in Arabic or Persian) are written using a mixture of 10 combining marks and 10 letters. Unlike Arabic, Persian, and Urdu, all vowel diacritics are always visible in Kashmiri texts. Representation of 3 vowel sounds is complicated by the use of different code points in medial vs. final position.

The distinction between ijam vs. tashkil has a bearing on several Kashmiri graphemes, and the choice between precomposed and decomposed realisations of a vowel letter can be complicated.

Vowel components

Unlike the Modern Standard Arabic orthography, vowels sounds are always spelled out in Kashmiri.

Kashmiri vowels are written using the following combining marks and letters, either alone or in combination. The basicV section shows how the various vowel components are combined to represent particular vowel sounds.

َ,ُ,ِ,ٔ,ٕ,ٖ,ٗ,ٚ,ٟ

ا,و,ٲ,ۄ,ی,ے

In post-consonant position, two of the vowel sounds are written differently, depending on whether they occur medially or finally in a word.

The sound eː is written using ی when word-medial, and ے when word-final. The short vowel works in the same way, but with the addition of a diacritic. For example, compare:

بیمہٕ

باضے

The sound iː doesn't use a different letter for medial vs. final representations, but the word-final spelling drops the diacritic 0656. For example, compare the two spellings of the iː vowel in the following word.

اَنْگریٖزی

اَنْگر,یٖ,ز,ی

Precomposed vs. decomposed characters

Kashmiri uses 0654 to represent the vowel ə. Since it represents a vowel, this should normally be typed and encoded separately from the base letter in the encoding. See ijam vs. tashkil.

However, NFC-normalisation will produce atomic characters for the combinations of hamza with a base letter shown in the box below. (NFD-normalisation produces a code point sequence.) In most orthographies, precomposed forms are preferred and more common, but since the hamza is a harakat in Kashmiri it is more logical to encode and type it separately from the base. On the other hand, since the Unicode Standard regards both alternatives as canonically equivalent in this case, it is less important whether they are encoded atomically or as a sequence. The following atomic characters are therefore included in the Kashmiri repertoire for representing vowel sounds.

إ,أ,ؤ,ۂ

There are, however, other visual combinations of base letter with hamza whose alternative encodings are not considered canonically equivalent, and these should be encoded in Kashmiri with a separate hamza (see deprecated_vowel). For example, حٔ is correct for Kashmiri, but ځ is not (it is used for the consonant d͡z in Pashto). For more information, see Arabic script homographs.

Nasalisation

نْ,ں

Vowels are commonly nasalised in Kashmiri. Word-internally, a nasalised vowel is followed by 0646 0652.

eg.

اَنْگریٖزی

This makes a nasalised vowel indistinguishable from a normal n.

eg.

پَنہٕ پونْپُر

At the end of a word, ں is used^§, like in Urdu, although this doesn't appear to be very common.

eg.

اٟں

Vowel length

Vowel length is indicated by use of different characters or character sequences. See fig_vowelgrid.

Standalone vowels

Word-initial standalone vowels have one of the following before the normal vowel indicators.

ا,آ,ع

Word-initial aː is written using آ. This is canonically equivalent with the decomposed sequence 0627 0653, but the atomic character is the one that is normally used.

Other word-initial standalone vowels always begin with ا or (for loan words) ع, either as a carrier for a diacritic, or before the other characters that represent the vowel.

eg.

اوٚنْگٕج

اِنسان

آتھوار

عَکٕس

عٲقٟل

Characters to avoid

ٳ,ێ,ۆ,ځ,ݬ,ࢡ

The Unicode character ٳ is explicitly deprecated by the Standard in favour of the decomposed sequence 0627 065F. There is no normalisation equivalence.

The list above contains several other single Unicode code points that look like combinations of Kashmiri letters and vowel diacritics, but they neither decompose nor recompose during normalisation. The Unicode Standard descriptions for these characters indicate that they are intended for use with specific languages, and Kashmiri is not listed amongst those. The hamza in these characters is an ijam, rather than a vowel diacritic, ie. it is an integral part of the letter. See Ijam, tashkil, hamza.

Nevertheless, they may appear in Kashmiri text – for example, ۆ is the default encoding for the vowel o in Wiktionary's list of words.

Content authors should use the decomposed forms, but because that can't be guaranteed, applications need to apply special rules to recognise both precomposed and decomposed forms as equivalent. See non_canonical for more details.

Vowel sounds to characters

This section maps Kashmiri vowel sounds to common graphemes in the Arabic orthography. The allocation of characters to vowel sounds is somewhat complicated. The complexity arises from the number of vowels in Kashmiri compared to the Arabic language, and the need to represent them all, but also because different sequences are needed for different positional forms. In addition, often more than one character sequence can achieve the same result.

The joining forms shown are illustrative; alternatives may occur (see joining_forms). Vowels in word-initial position or written alone are written with a preceding ا, or sometimes ع (we use the former for this table).

Sounds listed as 'infrequent' are allophones, or sounds used for foreign words, etc. Light coloured characters occur infrequently.

initial 0627 0650 eg. اِنسان

medial 0650 eg. صِفَر

final 0650 eg. زٲمِیہِ

iː

initial 0627 06CC 0656 eg. ایٖمان

medial 06CC 0656 eg. شيٖتھ

final 06CC eg. زٲمی

initial 0625 Canonically equivalent with 0627 0655.

medial ٕ eg. گَگٕر

final ٕ eg. چھِرٕ

ɨː

initial 0627 065F The precomposed character ٳ is not canonically equivalent, and is strongly deprecated by the Unicode Standard.

medial ٟ eg. تٟر

final ٟ

initial 0627 064F eg. اُجرَتھ

medial ُ eg. سَرُف

final ُ eg. دُ ہٲٹھ

uː

initial 0627 0648 0657 eg. اوٗترٕ

medial 0648 0657 eg. نوٗل

final 0648 0657 eg. قوبوٗ

initial 0627 06CC 065A

medial 06CC 065A eg. بیٚنہِ

final 06D2 065A eg. شےٚ

eː

initial 0627 06CC

medial 06CC eg. شیر

final 06D2 eg. باضے

initial 0627 0648 065A eg. اوٚنْجوٗر

medial 0648 065A eg. توٚت

final 0648 065A

oː

initial 0627 0648 eg. اوش

medial 0648 eg. پوش

final 0648 eg. ہیٖرو

initial 0623 eg. أنْز

medial ٔ eg. ژٔر
Several canonically equivalent, precomposed characters are available for use with hamza above. These include:
أ
ؤ
ۂ

final ٔ

əː

initial 0672 eg. ٲس

medial 0672 eg. کٲشُر

final 0672

initial 0627 06C4

medial 06C4 eg. کۄہ

final 06C4 eg. سۄ

ɔː

initial 0627 06C4 0627

medial 06C4 0627 eg. سۄاد

final 06C4 0627

initial 0627 064E eg. اَرَب

medial َ eg. ہَرُد

final َ

aː

initial 0622 eg. آزاد.

medial 0627 eg. آتھوار

final 0627 eg. آرادَنا

◌̃

Nasalisation

medial 0646 0652 نْ eg. کَنْگُو

final 06BA ں eg. اٟں

Observation: Final ɨ seems to also be commonly spelled using ہٕ [U+06C1 ARABIC LETTER HEH GOAL + U+0655 ARABIC HAMZA BELOW], eg. طوطہٕ بہٕ

Vowel absence

Vowel absence principally occurs either when a consonant is a syllable coda, or when a consonant is part of a consonant cluster.

This is an alphabetic script, so there is no inherent vowel to suppress, and consonant clusters in Kashmiri are typically not marked in any way, nor are word-final consonants.

eg.

لٔڑکہٕ

اکشر

ہوٚست

Exceptions

There are, however, 2 exceptions: medial consonants, and syllable-final nasals.

Medials

The medial letters representing -r and -j in syllable onsets are marked with 0652 (jazm). In an orthography such as Urdu the jazm is attached to the consonant which is not followed by a vowel, however in this case the jazm goes above the medial consonant, not the initial consonant. These medial letters can therefore be associated with both a jazm and a vowel diacritic. See onsets for more details.

eg.

برَْگ

Syllable-final nasals

The jazm diacritic is also used with a syllable-final ن when it is immediately followed by a consonant sound. In this case, the jazm sits above the letter representing the nasal sound (unlike the medial case just described).

eg.

وانْدُر

کانْتُر

Choice of code point

Note that Kashmiri uses an inverted-v shape for the jazm, rather than the small round circle used for the sukun in Arabic language orthographies. However, the semantics are the same, and so is the code point.

show composition

کرْال

Note that this is NOT 065B. That character is used as a vowel diacritic, eg. to write the letter o in Fulfulde. The ARABIC SUKUN code point has the semantic meaning intended here, and is also used for this function in Standard Arabic, Persian, Urdu, etc.. For Kashmiri you should use a font that produces the expected glyph shape. Using a different character that has the same shape but not the same semantics will cause problems for interoperable use of your text, and some fonts may fail to display it correctly (see confusables).

Consonant clusters

In general, consonant clusters in Kashmiri are just represented by a sequence of unmarked consonants, eg.

eg.

اکشر

بابتھٕر

However, see also onsets.

Consonants

Consonant summary table

This table summarises only basic consonant to character assignments. Click on the phonetic transcriptions for more detail.

The consonants in the right column map mostly to the same phonemes, but are generally for loan words and to preserve the original spellings in the language of origin.

	Basic letters	Unassimilated spellings in loan words
Stops	پ,ب,ت,د,ٹ,ڈ,ک,گ	ط,ق,غ,ع
Stops	پھ,تھ,ٹھ,کھ	ف,خ
Affricates	ژ,چ,ج, ,ژھ,چھ
Fricatives	و,س,ز,ش,ھ,ہ	ف,ث,ص,ذ,ض,ظ,ح
Nasals	م,ن
Approximants trills & taps	و,ر,ل,ی,ؠ	ڑ

For additional details see consonant_mappings.

Basic consonants

The following constitute a basic set of consonants used for Kashmiri, that all represent standard phonemes for the Kashmiri language.

Click on each letter for more details and for examples of usage.

پ,ب,ت,د,ٹ,ڈ,ک,گ,ژ,چ,ج,س,ز,ش,ہ,م,ن,و,ر,ل,ی

Aspirated consonants

Six additional letters of the alphabet represent aspirated sounds. These are all written by combining a standard character with a following ھ.

پھ,تھ,ٹھ,کھ, ,ژھ,چھ

Additional consonants

The following set of consonants map mostly to the same phonemes, but are generally for loan words and preserve the original spellings in the language of origin.

ط,ق,خ,غ,ع,ف,ص,ث,ذ,ض,ظ,ح,ڑ

Palatalisation

Palatalisation is a frequent feature of Kashmiri words. It is represented using ؠ after the consonant to be palatalised. Initial and medial forms have a small circle beneath. This follows the same pattern as 06CC, which has 2 dots below initial and medial forms, but no dots below final and isolate forms.

ؠؠؠ ؠ — The 4 joining forms of KASHMIRI YEH.

The form with a circle below occurs as part of a syllable onset. The final/isolate form appears with a syllable coda.

eg.

بؠنتھٕر

وَہؠکھ

پونؠ

کٲڈؠ

It is common for single lexical items to be split in Kashmiri. When palatalisation is applied to the coda of a syllable within a lexical item, the swash form is always used. To produce this, it can be followed by a space or 200C.

eg.

ۂسؠ تِنؠ

ۂسؠ‌تِنؠ

کھٔرؠ پھٕ

کھٔرؠ‌پھٕ

Onsets

The letters representing medial -r and -j in syllable onsets are marked with 0652 (jazm). In an orthography such as Urdu the jazm is attached to the vowelless consonant before the medial, however in this case the jazm goes above the medial consonant, not the initial consonant. These medial letters can therefore be associated with both a jazm and a vowel diacritic, which is very unusual for Arabic script orthographies. This behaviour is explicitly described in Rainamkr§p11-12 and occurs in Wiktionary lemmas.

eg.

کرْٕم

ک,ر,ْ,ٕ,م

Medial r is written using ر.

eg.

برَْگ

کرُْہُن

Medial -j is written using ی.

eg.

بیْٲر

کیْوٚم

There are words where ر and ی follow a consonant but do not have a jazm above. Presumably this is because they form the onset of a new syllable after the coda of the previous syllable.

eg.

نَزرانٕہ

There is a question about the ordering of the jazm and the vowel diacritic (see jazm_placement).

Codas

With one exception, there are no special letters for syllable-final consonants. They are not marked using the sukun mark.

eg.

بادَم

The exception is the use of ں (rather than ن) to indicate word-final nasalisation. See nasalisation for more details.

eg.

اٟں

Gemination

The diacritic 0651 doubles the value of the consonant it is attached to.

Observation: It's not clear that this is used for Kashmiri.

Consonant sounds to characters

This section maps Kashmiri consonant sounds to common graphemes in the Arabic orthography.

The right-hand side shows various joining forms.

Sounds listed as 'infrequent' are allophones, or sounds used for foreign words, etc. Light coloured characters occur infrequently.

067E067E067E067E consonant پ

pʰ

067E 06BE067E 06BE067E 06BE067E 06BE aspirated consonant پھ

0641064106410641 consonant ف In words of Arabic and/or Persian origin.

0628062806280628 consonant ب

062A062A062A062A consonant ت

0637063706370637 consonant ط In words of Arabic and/or Persian origin.

tʰ

062A 06BE062A 06BE062A 06BE062A 06BE aspirated consonant تھ

t͡s

06980698 consonant ژ

t͡sʰ

0698 06BE0698 06BE0698 06BE0698 06BE aspirated consonant ژھ

t͡ʃ

0686068606860686 consonant چ

t͡ʃʰ

0686 06BE0686 06BE0686 06BE0686 06BE aspirated consonant چھ

062F062F consonant د

d͡ʒ

062C062C062C062C consonant ج

0679067906790679 consonant ٹ

ʈʰ

0679 06BE0679 06BE0679 06BE0679 06BE aspirated consonant ٹھ

06880688 consonant ڈ

06A906A906A906A9 consonant ک

0642064206420642 consonant ق In words of Arabic and/or Persian origin.

kʰ

06A9 06BE06A9 06BE06A9 06BE06A9 06BE aspirated consonant کھ

062E062E062E062E aspirated consonant خ In words of Arabic and/or Persian origin.

06AF06AF06AF06AF consonant گ

063A063A063A063A consonant غ In words of Arabic and/or Persian origin.

0639063906390639 consonant/vowel carrier ع Vowel carrier in words of Arabic and/or Persian origin.

0641064106410641 consonant ف In words of Arabic and/or Persian origin.

0633063306330633 consonant س

062B062B062B062B consonant ث In words of Arabic and/or Persian origin.

0635063506350635 consonant ص In words of Arabic and/or Persian origin.

06320632 consonant ز

06300630 consonant ذ In words of Arabic and/or Persian origin.

0636063606360636 consonant ض In words of Arabic and/or Persian origin.

0638063806380638 consonant ظ In words of Arabic and/or Persian origin.

0634063406340634 consonant ش

062E062E062E062E aspirated consonant خ Sometimes, in words of Arabic and/or Persian origin.

06C106C106C106C1 consonant ہ

062D062D062D062D consonant ح In words of Arabic and/or Persian origin.

0645064506450645 consonant م

0646064606460646 consonant/nasalisation marker ن

06480648 consonant/vowel و

06310631 consonant ر

06910691 consonant ڑ In words of foreign origin.

0644064406440644 consonant ل

06CC06CC06CC06CC consonant/vowel ی

0620062006200620 palatalisation marker ؠ Expresses palatalisation of the preceding consonant.

Encoding choices

In the Kashmiri orthography different sequences of Unicode characters may produce the same visual result. Here we look at those, and raise questions where clarifications are needed.

Canonically equivalent alternatives

Normalisation converts the following precomposed to decomposed alternatives, and vice versa.

Precomposed	Decomposed
إ	0627 0655
0623	0627 0654
آ	0627 0653
0624	0648 0654
06C2	06C1 0654

The single code point per vowel-sign is the form preferred by the Unicode Standard and the form in common use for Kashmiri. The parts are separated in Unicode Normalisation Form D (NFD), and recomposed in Unicode Normalisation Form C (NFC), so both approaches are canonically equivalent.

Alternatives that are not canonically equivalent

The following alternatives are not converted to each other during normalisation. The diacritics in the precomposed characters are ijam, whereas those in the decomposed sequences are tashkil.

Precomposed	Decomposed	Notes
0673 (deprecated!)	0627 065F	The Unicode Standard indicates that the first precomposed item in the list above is strongly deprecated. There are no such indications, however, for the others.
06CE	06CC 065A	Neither alternative on this line currently was supported by older versions of the Noto Nastaliq Urdu font, causing a major problem for writing the sound e in Kashmiri. But it is supported by versions 3.002 and above of that font, and by the Awami Nastaliq font.
06C6	0648 065A	The diacritics in atomic characters without decompositions, like those in this table, are generally intended to represent ijam rather than vowel sounds. (See Ijam, tashkil, hamza.) In a search on a sample that included various Wikpedia pages and 369 Wiktionary lemmas the decomposed sequences on the right side of this table typically scored most hits, and there were zero to 3 of each of the precomposed variants. Except for this vowel o: there were 30 instances of the precomposed character and only 2 of the decomposed. The Unicode Standard says that this precomposed character is for use with Uighur, Kurdish, Kazakh, Azerbaijani, and Bosnian, but doesn't indicate that it should be used for Kashmiri. The precomposed characters listed are associated with particular languages by the annotations in the Unicode Standard. (See list of homographs in Ijam, tashkil, hamza.) The decomposed forms are therefore recommended for use with Kashmiri, with the possible exception of OE. However, both versions have been seen in digital text in Kashmiri, so applications will need to recognise both precomposed and decomposed alternatives as the same grapheme. Input mechanisms, on the other hand, can produce one rather than the other, and that choice should be made with advisement.
0681	062D 0654
076C	0631 0654
08A1	0628 0654

Confusables & spelling errors

The following lists some common errors found in Kashmiri text due to the similarity of Unicode characters, or perhaps sometimes due to problems inputting the correct character. Wikipedia is a rich source of such.

Incorrect	Correct	Notes
064A	06CC	The Arabic YEH doesn't drop the dots below in isolate and final positions.
0626	06CC 0654	This precomposed form becomes 064A 0654 when the text is decomposed during normalisation, ie. the base character is replaced by U+064A instead of U+06CC.
0643	06A9	Common fonts tend not to show the difference between these two characters, but the ability to search and compare text is impaired unless the application is aware of and takes counter-measures against this substitution.
ۍ	ؠ	The letter ۍ is used in Pashto to represent the diphthong əi, but it sometimes appears in Kashmiri texts instead of a word-final KASHMIRI YEH. This usage is incorrect and should be avoided.
ی࣑ U+06CC LETTER FARSI YEH + U+08D1 LARGE CIRCLE BELOW	ؠ	Sometimes ی࣑ U+06CC LETTER FARSI YEH + U+08D1 LARGE CIRCLE BELOW is used for isolated and final forms of KASHMIRI YEH in naskh style text. Version 15.1 of the Unicode Standard says that it is the normal form for the naskh style of Kashmiri, but this usage is incorrect and should be avoided. The text in the Unicode Standard was updated for version 16.0, and font vendors will be contacted to modify their glyphs.
066E 06EA	0620	This occurs when the KASHMIRI YEH is right-joining or dual-joining, in which case it has the ring below. This usage is also incorrect and should be avoided, but arose from a time when the FARSI YEH character was not available and people were trying to show palatalisation. The incorrect solution doesn't work well with common fonts, as well as corrupting the semantics of the text stream.
065B	0652	The function of this glyph is that of the sukun, so the correct semantic character should be used. Although 065B looks like the Kashmiri jazm, it was introduced to Unicode to serve as a vowel sign for African languages^§. In order to produce the correct glyph using a font such as Noto it is essential to indicate that the language of the text is Kashmiri. (In HTML this can be done using the attribute `lang="ks"`.) Otherwise, the shape is likely to be a small circle.
ۅU+06C5 LETTER KIRGHIZ OE	06C4	The incorrect letter is intended for use with Kirghiz. Some fonts add a loop to the tail, similar to that of the recommended character, but other fonts render it with a bar through the tail.

Code point sequences

When typing and in storage, combining marks always follow the base character they are associated with.

Special rendering rules

In principle, if more than one combining mark appears on the same side of the base character, Unicode expects applications to render the marks such that those marks closer to the base character in memory appear closer to the base character when rendered. (This is called the inside-out rule.) However, due to the reordering applied by the Unicode normalisation forms, some of the Arabic script diacritics end up in an inappropriate order on display.

For example, if a user types the sequence of characters in fig_amtra, the order of the marks will be changed such that applying the inside-out rule would render the shadda above the vowel (which is incorrect). (In fact, most application renderers have special rules to correct this.)

The Unicode Standard formally addresses this anomaly in the Technical Annex Unicode® Arabic Mark Rendering (AMTRA), with a set of rules for how to render sequences of Arabic characters. The rules generally move shadda, hamza, round dots, etc. so that they are close to the base character.

User input	Post-normalisation output
بُّ ب ّ ُ	بُ͏ّ ب ُ ّ

User input

Post-normalisation output

بُّ

بُ͏ّ

A sequence of shadda and damma as the user is likely to input it (left), and how it could potentially be arranged after normalisation (right).

In the rare exceptions where the AMTRA rules should not change the rendering, this can be achieved by placing an invisible 034F character between the combining marks. (In fact, this is what was done to simulate the incorrect appearance in fig_amtra, because otherwise the browser rendering engine would have automatically produced the same output as in the first column. Clicking on the example will show the sequence used.)

Final e

In the online term list at Wiktionary there are items containing a final e where the order of code points is 065A 06D2, and others where the order is reversed to give 06D2 065A.

Click on the following examples to see their composition:

شےٚ

It's unclear whether this is simply driven by user preference, or by orthographic rules, or the words are wrongly encoded. If the inverted-v occurs after the consonant in the word for 'six', it would look like this:

شٚے

Jazm placement for medials

There is a significant difference in the way jazm is used in Kashmiri, compared to other Arabic orthographies. It appears above and is stored after the second character in a consonant cluster when that is a medial -r or -j (see novowel and onsets). This behaviour is explicitly described in Rainamkr§p11-12 and occurs in Wiktionary lemmas.

In these cases, a base letter may support both the jazm and a vowel diacritic.

eg.

واریُْل

It is not clear is whether the jazm should precede the vowel diacritic in the code point sequence, or vice versa. The font in use for this page supports either. For example, compare the following alternatives for krɨm sea turtle.

کرْٕم

ک,ر,◌ْ,◌ٕ,م

کرْٕم

ک,ر,◌ٕ,◌ْ,م

Arabic
Persian
Urdu
Sindi

Text direction

Kashmiri text is written horizontally and right-to-left in the main but, as in most right-to-left scripts, numbers and embedded text in other scripts are written left-to-right (producing 'bidirectional' text).

ديٖنا ناتھ نٲدِم (١٩۱٦–۱۹٨٨) (کٲشُر : /diːnaːnaːth nəːdim/ ) اوس کٲشِر زَبانُک مَشہوٗر شٲعِر۔ — Kashmiri words are read right-to-left, starting from the right of this line, but numbers and Latin text (highlighted) are read left-to-right.

The Unicode Bidirectional Algorithm automatically takes care of the ordering for all the text in fig_bidi, as long as the 'base direction' is set to RTL. In HTML this can be set using the dir attribute, or in plain text using formatting controls.

If the base direction is not set appropriately, the directional runs will be ordered incorrectly as shown in fig_bidi_no_base_direction, making it very difficult to get the meaning.

Show default bidi_class properties for characters in the Kashmiri language.

For other aspects of dealing with right-to-left writing systems see the following sections:

directioncontrols
expressions
breaking_latin
mirrored_characters
page

For more information about how directionality and base direction work, see Unicode Bidirectional Algorithm basics. For information about plain text formatting characters see How to use Unicode controls for bidi text. And for working with markup in HTML, see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.

For authoring HTML pages, one of the most important things to remember is to use <html dir="rtl" … > at the top of the page. Also, use markup to manage direction, and do not use CSS styling.

Managing text direction

Unicode provides a set of 10 formatting characters that can be used to control the direction of text when displayed. These characters have no visual form in the rendered text, however text editing applications may have a way to show their location.

202B (RLE), 202A (LRE), and 202C (PDF) are in widespread use to set the base direction of a range of characters. RLE/LRE comes at the start, and PDF at the end of a range of characters for which the base direction is to be set.

In Unicode 6.1, the Unicode Standard added a set of characters which do the same thing but also isolate the content from surrounding characters, in order to avoid spillover effects. They are 2067 (RLI), 2066 (LRI), and 2069 (PDI). The Unicode Standard recommends that these be used instead.

There is also 2068 (FSI), used initially to set the base direction according to the first recognised strongly-directional character.

061C (ALM) is used to produce correct sequencing of numeric data. Click on the character name, and see also expressions for details.

200F (RLM) and 200E (LRM) are invisible characters with strong directional properties that are also sometimes used to produce the correct ordering of text.

For more information about how to use these formatting characters see How to use Unicode controls for bidi text. Note, however, that when writing HTML you should generally use markup rather than these control codes. For information about that, see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.

Expressions & sequences

A sequence of numbers used to express a range of values generally runs right to left in the Arabic language (and languages using the Thaana or Syriac scripts), whereas for Persian language text (and in Hebrew, N’Ko or Adlam scripts) it runs left to right.

For more information, see the section Expressions & sequences in the Arabic script notes.

Glyph shaping & positioning

You can experiment with examples using the Kashmiri workbench.

Kashmiri written in the Arabic script is cursive, and there are combining characters and special joining behaviours.

The orthography has no case distinction, and no special transforms are needed to convert between characters.

See the Arabic overview for details.

Cursive script

Arabic script is always cursive, ie. letters in a word are joined up. Fonts need to produce the appropriate glyph for a letter, according to its visual context, but the code point used doesn't change. This results in four different shapes for most letters, however some letters never join to the left. Ligated forms also join with characters alongside them.

سٲری اِنسان چھِ آزاد زامٕتؠ۔ وؠقار تہٕ حۆقوٗق چھِ ہِوی۔ تِمَن چھُ سوچ سَمَج عَطا کَرنہٕ آمُت تہٕ تِمَن پَزِ بٲے بَرادٔری ہٕنٛدِس جَذباتَس تَحَت اکھ أکِس اکار بَکار یُن ۔ — Highlighted characters in this text do not join to the left.

In the lists below 30 Kashmiri letters are dual-joining, whereas 17 join only to the right. However, the high frequency of the latter and short word lengths produce text that doesn't usually have long joined sequences (see fig_unjoined).

Cursive joining forms

Most dual-joining characters add or become a swash when they don't join to the left. A number of characters, however, undergo additional shape changes across the joining forms. fig_joining_forms and fig_right_joining_forms show the basic shapes in Kashmiri and what their joining forms look like.

isolated	right-joined	dual-join	left-joined	Kashmiri letters
ب	ـب	ـبـ	بـ	ب,ت,ث,پ,ٹ
ن	ـن	ـنـ	نـ	ن
ق	ـق	ـقـ	قـ	ق
ف	ـف	ـفـ	فـ	ف,ڤ
س	ـس	ـسـ	سـ	س,ش
ص	ـص	ـصـ	صـ	ص,ض
ط	ـط	ـطـ	طـ	ط,ظ
ک	ـک	ـکـ	کـ	ک,گ
ل	ـل	ـلـ	لـ	ل
ہ	ـہ	ـہـ	ہـ	ہ,ۂ
ھ	ـھ	ـھـ	ھـ	ھ
م	ـم	ـمـ	مـ	م
ع	ـع	ـعـ	عـ	ع,غ
ح	ـح	ـحـ	حـ	ح,خ,ج,چ
ی	ـی	ـیـ	یـ	ی
ؠ	ـؠ	ـؠـ	ؠـ	ؠ

Joining forms for shapes that join on both sides. Those showing notable shape change are highlighted.

isolated	right-joined	Kashmiri letters
ا	ـا	ا,أ,إ,آ,ٲ
ر	ـر	ر,ز,ژ,ڑ
د	ـد	د,ذ,ڈ
و	ـو	و,ؤ,ۄ,ۆ,ؤ
ے	ـے	ے

Joining forms for shapes that join on the right only.

Managing glyph shaping

200D (ZWJ) and 200C (ZWNJ) are used to control the joining behaviour of cursive glyphs. They are particularly useful in educational contexts, but also have real world applications.

ZWJ permits a letter to form a cursive connection without a visible neighbour. For example, the marker for hijri dates in Arabic language text is an initial form of heh, even though it doesn't join to the left, ie. ه‍. For this, use ZWJ immediately after the heh, eg. الاثنين 10 رجب 1415 ه‍..

ZWNJ prevents two adjacent letters forming a cursive connection with each other when rendered. For example, it is used in Persian for plural suffixes, some proper names, and Ottoman Turkish vowels. Ignoring or removing the ZWNJ will result in text with a different meaning or meaningless text, eg, تن‌ها is the plural of body, whereas تنها is the adjective alone.² The only difference is the presence or absence of ZWNJ after noon.

034F is used in Arabic-script text to produce special ordering of diacritics. The name is a misnomer, as it is generally used to break the normal sequence of diacritics.

phrase	، ؛ :
sentence	۔ ؟ !

	start	end
standard	(	)

	start	end
primary	”	“

Line & paragraph layout

Line breaking & hyphenation

Lines are normally broken at word boundaries. They are not broken at the small gaps that appear where a character doesn't join on the left.

Like most writing systems, certain characters are expected not to start or end a line. For example, periods and commas shouldn't start a line, and opening parentheses shouldn't end a line.

Line-edge rules

As in almost all writing systems, certain punctuation characters should not appear at the end or the start of a line. The Unicode line-break properties help applications decide whether a character should appear at the start or end of a line.

Show default line-breaking properties for characters in this orthography.

The following list gives examples of typical behaviours for characters affected by these rules. Context may affect the behaviour of some of these and other characters.

« “ ‘ ( should not be the last character on a line
» ” ’ ) ۔ ، ؛ ؟ ! should not begin a new line

Breaking between Latin words

When a line break occurs in the middle of an embedded left-to-right sequence, the items in that sequence need to be rearranged visually so that it isn't necessary to read lines upwards.

latin-line-breaks shows how two Latin words are apparently reordered in the flow of text to accommodate this rule. Of course, the rearragement is only that of the visual glyphs: nothing affects the order of the characters in memory.

Text with no line break in Latin text. — The lower of these two images shows the result of decreasing the line width, so that text wraps between a sequence of Latin words.

Text with line break in Latin text. — The lower of these two images shows the result of decreasing the line width, so that text wraps between a sequence of Latin words.

Baselines, line height, etc.

tbd

The nastaliq writing style uses arrangements of joined glyphs that cascade downwards from right to left, and ressemble a strongly sloping baseline.

مستحق • شخص • کیفیت — Sloping baselines in Urdu nastaliq text.

An obvious consequence is that the height of inline text in Kashimiri travel much further from the baseline than is usual in Latin script text. Allowances for this need to be made for line height settings on a page, but also it can be problematic when combining Latin and Arabic text on the same line using different fonts for each.

If the Arabic font supports the needed Latin letters, the font design will already take into account the relative sizes of the letters, and their placement relative to the baselines of each script. If different fonts are used, though, it's important to match the baselines and harmonise the font sizes used.

Notes, footnotes, etc

See inlinenotes for purely inline annotations, such as ruby or warichu. This section is about annotation systems that separate the reference marks and the content of the notes.

	labial	dental	alveolar	post- alveolar	retroflex	palatal	velar	glottal
stops	p b	t d			ʈ ɖ		k ɡ
aspirated	pʰ	tʰ			ʈʰ		kʰ
affricates		t͡s		t͡ʃ d͡ʒ
aspirated		t͡sʰ		t͡ʃʰ
fricatives			s z	ʃ				h
nasals	m		n
approximants	w		l			j
trills/flaps			r

Arabic, Kashmiri

Sample

Usage & history

Basic features

Joining forms

Character index

Letters

Basic consonants

Extended consonants

Vowel letters

Not used for Kashmiri

Combining marks

Vowel marks

Other

Not used for Kashmiri

Numbers

Punctuation

ASCII

Other

To be investigated

Phonology

Vowel sounds

Consonant sounds

Tone

Structure

Vowels

Post-consonant vowels

Vowel components

Precomposed vs. decomposed characters

Nasalisation

Vowel length

Standalone vowels

Characters to avoid

Vowel sounds to characters

Vowel absence

Exceptions

Medials

Syllable-final nasals

Choice of code point

Consonant clusters

Consonants

Consonant summary table

Basic consonants

Aspirated consonants

Additional consonants

Palatalisation

Onsets

Codas

Gemination

Consonant sounds to characters

Other features

Formatting characters

Honorifics

Encoding choices

Canonically equivalent alternatives

Alternatives that are not canonically equivalent

Confusables & spelling errors

Code point sequences

Special rendering rules

Final e

Jazm placement for medials

Numbers, dates, currency, etc

Digits

Text direction

Managing text direction

Expressions & sequences

Glyph shaping & positioning

Cursive script

Cursive joining forms

Managing glyph shaping

Typographic units

Word boundaries

Graphemes

Grapheme clusters

Punctuation & inline features

Phrase & section boundaries

Bracketed text

Mirrored characters

Quotations & citations

Line & paragraph layout