Updated 4 August, 2024
This page brings together basic information about the Arabic script and its use for the Kashmiri language, using the latest orthographic changes. It aims to provide a brief, descriptive summary of the modern, printed orthography and typographic features, and to advise how to write Kashmiri using Unicode.
Richard Ishida, Kashmiri (Nastaliq Arabic) Orthography Notes, 04-Aug-2024, https://r12a.github.io/scripts/arab/ks
سٲری اِنسان چھِ آزاد زامٕتؠ۔ وؠقار تہٕ حۆقوٗق چھِ ہِوی۔ تِمَن چھُ سوچ سَمَج عَطا کَرنہٕ آمُت تہٕ تِمَن پَزِ بٲے بَرادٔری ہٕنٛدِس جَذباتَس تَحَت اکھ أکِس اکار بَکار یُن ۔
Source: Omniglot, article 1 of the UDHR.
Origins of the Arabic script, 6thC – today.
Phoenician
└ Aramaic
└ Nabataean
└ Arabic
For information about the script in general, see the Arabic overview. The Perso-Arabic script is recognised as the official script of Kashmiri language by the Jammu and Kashmir government and the Jammu and Kashmir Academy of Art, Culture and Languages.wkl
کٲشُر
Kashmiri is written in the Devanagari script by Hindus. Muslims use the arabic script.
The Kashmiri Arabic orthography is derived from the Arabic/Persian abjads, where in normal use the script represents only consonant and long vowel sounds. However, the script has been adapted in this orthography in order to cope with the many more vowels sounds in Kashmiri, and this is one of the Arabic orthographies that regularly indicates all vowel sounds, making it an alphabet.wkl,#Writing_system See the table to the right for a brief overview of features for the modern Kashmiri orthography using the Arabic script.
Kashmiri text runs right-to-left in horizontal lines, but numbers and embedded Latin text are read left-to-right.
Kashmiri is principally written using the nasta'liq style of Arabic writing. Glyphs are more drawn out, and the baseline tends to be sloping from word to word. The script is cursive, and some basic letter shapes change radically, depending on what they join to. The nastaliq styling creates diagonal baselines between joined characters, and tends to reduce clarity about where one letter ends and the next starts. (The dots and other diacritics associated with letters become particularly useful for the reader.)
A mandatory ligature is used for combinations of lam + alif.
There is no case distinction. Words are separated by spaces.
Modern Kashmiri represents native sounds using 19 basic consonant letters and 6 aspirated digraphs, but can use 13 more consonants to spell words loaned from Persian, Arabic and Urdu.
A special letter is used to indicate palatalisation, which is common in Kashmiri. Similarly to the other yeh used in Kashmiri, it has a circle below when used in syllable onsets, and a swash with no circle after a syllable coda.
Unlike other Arabic orthographies, 0652 (jazm), normally used to show vowel absence, is placed over the second consonant in an onset cluster (such as tr). That letter may therefore carry both the jazm diacritic and a vowel diacritic, which is quite unusual.
❯ basicV
Kashmiri is an alphabet where 16 vowel sounds (far more than in Arabic or Persian) are written using a mixture of 10 combining marks and 10 letters. Unlike Arabic, Persian, and Urdu, all vowel diacritics are always visible in Kashmiri texts. Representation of 3 vowel sounds is complicated by the use of different code points in medial vs. final position.
The distinction between ijam vs. tashkil has a bearing on several Kashmiri graphemes, and the choice between precomposed and decomposed realisations of a vowel letter can be complicated (see encoding).
Word-initial standalone vowels are preceded by or attached to either 0627 or 0639.
The jazm is used over a word-medial 0646 to indicate nasalisation of a preceding vowel sound. Apart from this use and that for medial consonants, it is not typically used to indicate vowel absence.
Kashmiri uses native digits, and Arabic code points for several of the more common punctuation marks.
Because the Arabic script is 'cursive' (ie. joined-up) writing, letters tend to have different shapes depending on whether they join with adjacent letters or not (see cursive). In addition, vowels can be represented using different characters, depending on where in a word they appear.
In scripts such as Arabic, several characters have no left-joining form. In what follows we'll use the characters ي and د to illustrate shapes. The former can join on both sides, but the latter can only join on the right.
Left-joining glyphs are commonly called initial; dual-joining are called medial; and right-joining are called final. Glyphs that don't join on either side are called isolated. However, these glyph shapes can be found in various places within a single word.
Word-initial characters usually have initial glyph shapes (eg. 064A ). However, characters that only join to the right will use an isolated glyph shape (eg. 062F ). Furthermore, words beginning with a vowel are always preceded by a vowel carrier, which is normally ا (eg. 0627 06CC or 0627 064E ).
Word-medial characters will typically join on both sides (eg. 064A ) but those that only join to the right will use a final glyph (eg. 062F ). However, if either of those is preceded by another character that only joins to the right, the glyph shapes rendered will be initial (eg. 064A ) and isolated (eg. 062F ), respectively.
Word-final characters will typically use a final glyph shape (eg. 064A and 062F ). However, if the previous character joins only to the right, they will use isolated glyph shapes (eg.064A and 062F ).
In all this contextual glyph shaping the basic shapes used for a character can vary significantly in a script like Arabic. This also includes some characters that only have ijam dots in certain contexts.
These are sounds for the Kashmiri language.
Click on the sounds to see where else in the document they are referred to.
Plain vowels
Complex sounds
Kashmiri is not a tonal language.
The most common sylllable structure for Kashmiri is.
C (M) V (C)
C
V
M
Vowels can be short or long, and may be nasalised. Syllable-initial and syllable-final consonants can be palatalised.
A spoken syllable can also begin with a vowel, but word-initial vowels are preceded in writing by 0627 (or sometimes 0639). It is relatively common for a syllable to end with a diphthong, terminated by -j.
Medial consonants are -r and, less common, -j.
The following table summarises the main vowel to character assigments.
Each table cell shows word-initial, word-medial, and word-final forms from right to left. The glyphs shown are illustrative; alternative shapes may occur (see joining_forms). Click/tap on items to see a list of the components for that cell.
i اِ ◌ِ ◌ِ iː ایٖ یٖ ی |
ɨ إ ◌ٕ ◌ٕ ɨː اٟ ◌ٟ ◌ٟ |
u
اُ
◌ُ
◌ُ
uː اوٗ وٗ وٗ |
e
ایٚ
یٚ
ےٚ
eː ای ی ے |
o
اوٚ
وٚ
وٚ
oː او و و |
|
ə
أ
◌ٔ
◌ٔ
əː ٲ ٲ ٲ |
ɔ
اۄ
ۄ
ۄ
ɔː اۄا ۄا ۄا |
|
a
اَ
◌َ
◌َ
aː آ ا ا |
Observation: Several items in the Kashmiri dictionary end with a vowel followed by h. Is this the standard way to write word-final short vowels - and some long ones?
For a question about the ordering of characters in final e, see final_e. For questions about whether to use precomposed or decomposed letters, see encoding.
Kashmiri is an alphabet where 16 vowel sounds (far more than in Arabic or Persian) are written using a mixture of 10 combining marks and 10 letters. Unlike Arabic, Persian, and Urdu, all vowel diacritics are always visible in Kashmiri texts. Representation of 3 vowel sounds is complicated by the use of different code points in medial vs. final position.
The distinction between ijam vs. tashkil has a bearing on several Kashmiri graphemes, and the choice between precomposed and decomposed realisations of a vowel letter can be complicated.
Kashmiri uses the following for basic mappings between vowel sounds and code points.
Six of the above vowel sounds are represented by a combination of letter and diacritic or by a combination of letters.
The sound eː uses different shapes for the medial and final representations, which are encoded in Unicode using separate characters: ی for word-medial vowels, and ے for word-final vowels. The short vowel works in the same way.
The sound iː doesn't use introduce a new code point for medial vs. final representations, but the word-final spelling drops the diacritic 0656.
Kashmiri uses 0654 to represent the vowel ə. Since it represents a vowel, this should normally be typed and encoded separately from the base letter in the encoding. See ijam vs. tashkil.
However, NFC-normalisation will produce atomic characters for the combinations of hamza with a base letter shown in the box below. (NFD-normalisation produces a code point sequence.) In most orthographies, precomposed forms are preferred and more common, but since the hamza is a harakat in Kashmiri it is more logical to encode and type it separately from the base. On the other hand, since the Unicode Standard regards both alternatives as canonically equivalent in this case, it is less important whether they are encoded atomically or as a sequence. The following atomic characters are therefore included in the Kashmiri repertoire for representing vowel sounds.
There are, however, other combinations of base letter with hamza (such as حٔ and ځ) that are not considered canonically equivalent, and these should be encoded with a separate hamza (see deprecated_vowel).
Vowels are commonly nasalised in Kashmiri. Word-internally, a nasalised vowel is followed by 0646 0652.
اَنْگریٖزی
This makes a nasalised vowel indistinguishable from a normal n.
پَنہٕ پونْپُر
At the end of a word, ں is used§, like in Urdu, although this doesn't appear to be very common.
اٟں
Vowel length is indicated by use of different characters or character sequences. See fig_vowelgrid.
Word-initial standalone vowels involves placing one of the following before the normal vowel indicators.
Word-initial aː is written using آ. This is canonically equivalent with the decomposed sequence 0627 0653, but the atomic character is the one that is normally used.
Other word-initial standalone vowels always begin with ا or (for loan words) ع, either as a carrier for a diacritic, or before the other characters that represent the vowel.
Examples:
اوٚنْگٕج
اِنسان
آتھوار
عَکٕس
عٲقٟل
The Unicode character ٳ is explicitly deprecated by the Standard in favour of the decomposed sequence 0627 065F. There is no normalisation equivalence.
The list above contains several other single Unicode code points that look like combinations of Kashmiri letters and vowel diacritics, but they neither decompose nor recompose during normalisation. The Unicode Standard descriptions for these characters indicate that they are intended for use with specific languages, and Kashmiri is not listed amongst those. The hamza in these characters is an ijam, rather than a vowel diacritic, ie. it is an integral part of the letter. See Ijam, tashkil, hamza.
Nevertheless, they may appear in Kashmiri text – for example, ۆ is the default encoding for the vowel o in Wiktionary's list of words.
Content authors should use the decomposed forms, but because that can't be guaranteed, applications need to apply special rules to recognise both precomposed and decomposed forms as equivalent. See non_canonical for more details.
This is an alphabetic script, so there is no inherent vowel to suppress, and consonant clusters in Kashmiri are typically not marked in any way, nor are word-final consonants.
There are, however, 2 exceptions: medial consonants, and syllable-final nasals.
Medials. The letters representing -r and -j in syllable onsets are marked with 0652 (jazm). In an orthography such as Urdu the jazm is attached to the consonant which is not followed by a vowel, however in this case the jazm goes above the medial consonant, not the initial consonant. These medial letters can therefore be associated with both a jazm and a vowel diacritic. This behaviour is explicitly described in Rainamkr,p11-12 and occurs in Wiktionary lemmas.
کرْٕم
See onsets for more details.
Syllable-final nasals. The jazm diacritic is also used with a syllable-final ن when it is immediately followed by a consonant sound. In this case, the jazm sits above the letter representing the nasal sound (unlike the medial case just described).
وانْدُر
کانْتُر
Choice of code point. Note that Kashmiri uses an inverted-v shape for the jazm, rather than the small round circle used for the sukun in Arabic language orthographies. However, the semantics are the same, and so is the code point.
کرْال
Note that this is NOT 065B. That character is used as a vowel diacritic, eg. to write the letter o in Fulfulde. The ARABIC SUKUN code point has the semantic meaning intended here, and is also used for this function in Standard Arabic, Persian, Urdu, etc.. For Kashmiri you should use a font that produces the expected glyph shape. Using a different character that has the same shape but not the same semantics will cause problems for interoperable use of your text, and some fonts may fail to display it correctly (see confusables).
This section maps Kashmiri vowel sounds to common graphemes in the Arabic orthography. The allocation of characters to vowel sounds is somewhat complicated. The complexity arises from the number of vowels in Kashmiri compared to the Arabic language, and the need to represent them all, but also because different sequences are needed for different positional forms. In addition, often more than one character sequence can achieve the same result.
Vowels in word-initial position or written alone are written with a preceding ا, or sometimes ع (we use the former for this table).
The columns run right to left and indicate typical word-initial, word-medial, and word-final usage. The joining forms shown are illustrative; alternatives may occur (see joining_forms).
Click on a grapheme to find other mentions on this page (links appear at the bottom of the page). Click on the character name to see examples and for detailed descriptions of the character(s) shown.
0650
زٲمِیہِ
0650
صِفَر
0627 0650
اِنسان
06CC
زٲمی
06CC 0656
شيٖتھ
0627 06CC 0656
ایٖمان
ٕ
چھِرٕ
ٕ
گَگٕر
0625 Canonically equivalent with 0627 0655.
ٟ
ٟ
تٟر
0627 065F The precomposed character ٳ is not canonically equivalent, and is strongly deprecated by the Unicode Standard.
ُ
دُ ہٲٹھ
ُ
سَرُف
0627 064F
اُجرَتھ
0648 0657
قوبوٗ
0648 0657
نوٗل
0627 0648 0657
اوٗترٕ
06D2 065A
شےٚ
06CC 065A
بیٚنہِ
0627 06CC 065A
06D2
باضے
06CC
شیر
0627 06CC
0648 065A
0648 065A
توٚت
0627 0648 065A
اوٚنْجوٗر
0648
ہیٖرو
0648
پوش
0627 0648
اوش
ٔ
ٔ Several canonically equivalent, precomposed characters are available for use with hamza above. These include:
أ
ؤ
ۂ
ژٔر
0623
أنْز
0672
0672
کٲشُر
0672
ٲس
06C4
سۄ
06C4
کۄہ
0627 06C4
06C4 0627
06C4 0627
سۄاد
0627 06C4 0627
َ
َ
ہَرُد
0627 064E
اَرَب
0627
آرادَنا
0627 when not word-initial.
آتھوار
0622 Canonically equivalent with 0627 0653
آزاد
06BA
اٟں
0646 0652
کَنْگُو
Observation: Final ɨ seems to also be commonly spelled using ہٕ [U+06C1 ARABIC LETTER HEH GOAL + U+0655 ARABIC HAMZA BELOW], eg. طوطہٕ بہٕ
The following table summarises the main consonant to character assigments.
The consonants in the right column map mostly to the same phonemes, but are generally for loan words and to preserve the original spellings in the language of origin.
Basic letters | Unassimilated spellings in loan words | |
---|---|---|
Stops | ||
Affricates | ||
Fricatives | ||
Nasals | ||
Approximants trills & taps |
For additional details see consonant_mappings.
The following constitute a basic set of consonants used for Kashmiri, that cover all standard phonemes for the Kashmiri language.
Whereas the table just above takes you from sounds to letters, the following simply lists the basic consonant letters (however, since the orthography is highly phonetic there is little difference in ordering).
Six additional letters of the alphabet represent aspirated sounds. These are all written by combining a standard character with a following ھ.
The following set of consonants map mostly to the same phonemes, but are generally for loan words and preserve the original spellings in the language of origin.
Palatalisation is a frequent feature of Kashmiri words. It is represented using ؠ after the consonant to be palatalised. Medial forms have a small circle beneath them. These changes follow the same pattern as 06CC, which has 2 dots below initial and medial forms, but no dots below final and isolate forms.
The form with a circle below occurs as part of a syllable onset. The final/isolate form appears with a syllable coda.
بؠنتھٕر
وَہؠکھ
پونؠ
کٲڈؠ
It is common for single lexical items to be split in Kashmiri. When palatalisation is applied to the coda of a syllable within a lexical item, the swash form is always used. To produce this, it can be followed by a space or 200C.
ۂسؠ تِنؠ
ۂسؠتِنؠ
کھٔرؠ پھٕ
کھٔرؠپھٕ
When ر is used as a medial consonant, Kashmiri puts the jazm over the -r rather than over the initial consonant. This leads to situations where the base character carries both a sukun and a vowel diacritic, which is very unusual for Arabic script orthographies.
برَْگ
کرُْہُن
The same applies for the medial -j, written using ی.
بیْٲر
کیْوٚم
There are words where ر and ی follow a consonant but do not have a jazm above. Presumably this is because they form the onset of a new syllable after the coda of the previous syllable. For example:
نَزرانٕہ
With one exception, there are no special letters for syllable-final consonants. They are not marked using the sukun mark.
بادَم
The exception is the use of ں (rather than ن) to indicate word-final nasalisation. See nasalisation for more details.
اٟں
Kashmiri uses 0652 (jazm) to indicate some consonant clusters, as described in novowel. In general, however, clusters are just represented by a sequence of unmarked consonants.
بابتھٕر
The diacritic 0651 doubles the value of the consonant it is attached to.
Observation: It's not clear that this is used for Kashmiri.
This section maps Kashmiri consonant sounds to common graphemes in the Arabic orthography.
Click on a grapheme to find other mentions on this page (links appear at the bottom of the page). Click on the character name to see examples and for detailed descriptions of the character(s) shown.
Sounds listed as 'infrequent' are allophones, or sounds used for foreign words, etc.
067E
پَرُن
067E067E067Eـ
067E 06BE
پھَش
067E 06BE067E 06BE067E 06BEـ
0628
بُڈؠ بَب
062806280628ـ
062A
سَتہٕ تُت
062A062A062Aـ
0637 in loan words.
طوطہٕ
06370637ـ
062A 06BE
تھُج
062A 06BE062A 06BE062A 06BEـ
062F
دۄد
062F062Fـ
0679
ؤٹِل
067906790679ـ
0679 06BE
کَٹھ
0679 06BE0679 06BE0679 06BEـ
0688
ژھانْڈُن
06880688ـ
06A9
کُکِل
06A906A906A9ـ
0642 in loan words.
خلق
064206420642ـ
06A9 06BE
کھۄر
06A9 06BE06A9 06BE06A9 06BEـ
062E in loan words.
خٔرِنؠ
062E062E062Eـ
06AF
زَنْگ
06AF06AF06AFـ
063A in loan words.
مَغرِب
063A063A063Aـ
0698
ژٔر
06980698ـ
0698 06BE
ژھاوٕج
0698 06BE0698 06BE0698 06BEـ
0686
چھِرٕ
068606860686ـ
0686 06BE
چھَلُن
0686 06BE0686 06BE0686 06BEـ
062C
تھُج
062C062C062Cـ
0641 sometimes in loan words.
صِفَر
064106410641ـ
0648
وَدُن
06480648ـ
0633
ساس
063306330633ـ
0635 in loan words.
صِفَر
063506350635ـ
062B in loan words.
062B062B062Bـ
0632
زٲمؠ زٕ
06320632ـ
0630 in loan words.
ذاتھ
06300630ـ
0636 in loan words.
ضٔمیٖر
063606360636ـ
0638 in loan words.
ظٲلِم
063806380638ـ
0634
شیر
063406340634ـ
06C1
ۂہَر
06C106C106C1ـ
062D in loan words.
حاجَتھ
062D062D062Dـ
0645
مول
064506450645ـ
0646
نَنُن
064606460646ـ
0648
گَروول
06480648ـ
0631
ریش
06310631ـ
0691 in loan words.
لٔڑکی
06910691ـ
0644
لالٕپھوٚل
064406440644ـ
06CC
زٲمِیہِ
06CC06CC06CCـ
0620 to express palatalisation.
ۂسؠ تِنؠ
062006200620ـ
Arabic-script text makes use of a relatively large set of invisible formatting characters, especially in plain text, many of which are used to manage text direction. Descriptions of these characters can be found in the following sections:
0614 is a sign placed over the name or nom-de-plume of a poet, or in some writings used to mark all proper names.
The mark is really associated with a word, rather than a character, but the placement is left to the user. The mark is often added somewhere in the middle of a name, but commonly appears towards the end. This depends to some extent on the letter shapes present and the calligraphic style in use, eg.
عطاشادؔ ataː ʃaː Ata Shad (author's name)
In the Kashmiri orthography different sequences of Unicode characters may produce the same visual result. Here we look at those, and raise questions where clarifications are needed.
Normalisation converts the following precomposed to decomposed alternatives, and vice versa.
Precomposed | Decomposed |
---|---|
إ | 0627 0655 |
0623 | 0627 0654 |
آ | 0627 0653 |
0624 | 0648 0654 |
06C2 | 06C1 0654 |
The single code point per vowel-sign is the form preferred by the Unicode Standard and the form in common use for Kashmiri. The parts are separated in Unicode Normalisation Form D (NFD), and recomposed in Unicode Normalisation Form C (NFC), so both approaches are canonically equivalent.
The following alternatives are not converted to each other during normalisation. The diacritics in the precomposed characters are ijam, whereas those in the decomposed sequences are tashkil.
Precomposed | Decomposed | Notes |
---|---|---|
0673 (deprecated!) | 0627 065F | The Unicode Standard indicates that the first precomposed item in the list above is strongly deprecated. There are no such indications, however, for the others. |
06CE | 06CC 065A | Neither alternative on this line currently was supported by older versions of the Noto Nastaliq Urdu font, causing a major problem for writing the sound e in Kashmiri. But it is supported by versions 3.002 and above of that font, and by the Awami Nastaliq font. |
06C6 | 0648 065A | The diacritics in atomic characters without decompositions, like those in this table, are generally intended to represent ijam rather than vowel sounds. (See Ijam, tashkil, hamza.) In a search on a sample that included various Wikpedia pages and 369 Wiktionary lemmas the decomposed sequences on the right side of this table typically scored most hits, and there were zero to 3 of each of the precomposed variants. Except for this vowel o: there were 30 instances of the precomposed character and only 2 of the decomposed. The Unicode Standard says that this precomposed character is for use with Uighur, Kurdish, Kazakh, Azerbaijani, and Bosnian, but doesn't indicate that it should be used for Kashmiri. The precomposed characters listed are associated with particular languages by the annotations in the Unicode Standard. (See list of homographs in Ijam, tashkil, hamza.) The decomposed forms are therefore recommended for use with Kashmiri, with the possible exception of OE. However, both versions have been seen in digital text in Kashmiri, so applications will need to recognise both precomposed and decomposed alternatives as the same grapheme. Input mechanisms, on the other hand, can produce one rather than the other, and that choice should be made with advisement. |
0681 | 062D 0654 | |
076C | 0631 0654 | |
08A1 | 0628 0654 |
The following lists some common errors found in Kashmiri text due to the similarity of Unicode characters, or perhaps sometimes due to problems inputting the correct character. Wikipedia is a rich source of such.
Incorrect | Correct | Notes |
---|---|---|
064A | 06CC | The Arabic YEH doesn't drop the dots below in isolate and final positions. |
0626 | 06CC 0654 | This precomposed form becomes 064A 0654 when the text is decomposed during normalisation, ie. the base character is replaced by U+064A instead of U+06CC. |
0643 | 06A9 | Common fonts tend not to show the difference between these two characters, but the ability to search and compare text is impaired unless the application is aware of and takes counter-measures against this substitution. |
ۍ | ؠ | The letter ۍ is used in Pashto to represent the diphthong əi, but it sometimes appears in Kashmiri texts instead of a word-final KASHMIRI YEH. This usage is incorrect and should be avoided. |
ی࣑ U+06CC LETTER FARSI YEH + U+08D1 LARGE CIRCLE BELOW | ؠ | Sometimes ی࣑ U+06CC LETTER FARSI YEH + U+08D1 LARGE CIRCLE BELOW is used for isolated and final forms of KASHMIRI YEH in naskh style text. Version 15.1 of the Unicode Standard says that it is the normal form for the naskh style of Kashmiri, but this usage is incorrect and should be avoided. The text in the Unicode Standard was updated for version 16.0, and font vendors will be contacted to modify their glyphs. |
066E 06EA | 0620 | This occurs when the KASHMIRI YEH is right-joining or dual-joining, in which case it has the ring below. This usage is also incorrect and should be avoided, but arose from a time when the FARSI YEH character was not available and people were trying to show palatalisation. The incorrect solution doesn't work well with common fonts, as well as corrupting the semantics of the text stream. |
065B | 0652 | The function of this glyph is that of the sukun, so the correct semantic character should be used. Although 065B looks like the Kashmiri jazm, it was introduced to Unicode to serve as a vowel sign for African languages§. In order to produce the correct glyph using a font such as Noto it is essential to indicate that the language of the text is Kashmiri. (In HTML this can be done using the attribute lang="ks" .) Otherwise, the shape is likely to be a small circle. |
ۅU+06C5 LETTER KIRGHIZ OE | 06C4 | The incorrect letter is intended for use with Kirghiz. Some fonts add a loop to the tail, similar to that of the recommended character, but other fonts render it with a bar through the tail. |
In the online term list at Wiktionary there are items containing a final e where the order of code points is 065A 06D2, and others where the order is reversed to give 06D2 065A.
Click on the following examples to see their composition:
شےٚ
It's unclear whether this is simply driven by user preference, or by orthographic rules, or the words are wrongly encoded. If the inverted-v occurs after the consonant in the word for 'six', it would look like this:
شٚے
There is a significant difference in the way jazm is used, compared to other Arabic orthographies, in that it commonly appears above and is stored after the second character in the consonant cluster.
The jazm diacritic is only used in consonant clusters over the letters r and j, when they appear immediately after a consonant (ie. in 'medial' position), and n (including nasalisation) when it occurs immediately before another consonant§. When used with r and j, the base character may be associated with both a vowel diacritic and the jazm. Examples: واریُْل وَنْدٕ
Other consonant clusters occur without the use of the jazm, eg.
ہوٚست
This behaviour is explicitly described in Rainamkr,p11-12 and occurs in Wiktionary lemmas. For more details, see novowel and onsets.
When typing and in storage, combining marks always follow the base character they are associated with.
In principle, if more than one combining mark appears on the same side of the base character, Unicode expects applications to render the marks such that those marks closer to the base character in memory appear closer to the base character when rendered. (This is called the inside-out rule
.) However, due to the reordering applied by the Unicode normalisation forms, some of the Arabic script diacritics end up in an inappropriate order on display.
For example, if a user types the sequence of characters in fig_amtra, the order of the marks will be changed such that applying the inside-out rule would render the shadda above the vowel (which is incorrect). (In fact, most application renderers have special rules to correct this.)
The Unicode Standard formally addresses this anomaly in the Technical Annex Unicode® Arabic Mark Rendering (AMTRA), with a set of rules for how to render sequences of Arabic characters. The rules generally move shadda, hamza, round dots, etc. so that they are close to the base character.
User input | Post-normalisation output |
---|---|
بُّ ب ّ ُ |
بُ͏ّ ب ُ ّ |
In the rare exceptions where the AMTRA rules should not change the rendering, this can be achieved by placing an invisible 034F character between the combining marks. (In fact, this is what was done to simulate the incorrect appearance in fig_amtra, because otherwise the browser rendering engine would have automatically produced the same output as in the first column. Clicking on the example will show the sequence used.)
The Unicode Arabic block has 2 sets of digits, and Kashmiri uses the extended set. The Unicode bidi_class
property for these native digits is European_Number
, which makes them behave and look differently from the digits used for Arabic language text. For more information, see expressions.
In addition, there are differences in glyph shapes. fig_number_shapes shows the different glyph shapes used in Arabic, Persian, Urdu and Sindhi. Kashmiri digits share the same shapes as those for Urdu.u,370
Arabic | |
---|---|
Persian | |
Urdu | |
Sindi |
Kashmiri uses the Arabic percent sign.
I suspect that Kashmiri may use ٫ [U+066B ARABIC DECIMAL SEPARATOR] and ٬ [U+066C ARABIC THOUSANDS SEPARATOR], but need to confirm.
Kashmiri text is written horizontally and right-to-left in the main but, as in most right-to-left scripts, numbers and embedded text in other scripts are written left-to-right (producing 'bidirectional' text).
The Unicode Bidirectional Algorithm automatically takes care of the ordering for all the text in fig_bidi, as long as the 'base direction' is set to RTL. In HTML this can be set using the dir
attribute, or in plain text using formatting controls.
If the base direction is not set appropriately, the directional runs will be ordered incorrectly as shown in fig_bidi_no_base_direction, making it very difficult to get the meaning.
Show default bidi_class
properties for characters in the Kashmiri language.
For other aspects of dealing with right-to-left writing systems see the following sections:
For more information about how directionality and base direction work, see Unicode Bidirectional Algorithm basics. For information about plain text formatting characters see How to use Unicode controls for bidi text. And for working with markup in HTML, see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.
For authoring HTML pages, one of the most important things to remember is to use <html dir="rtl" … >
at the top of the page. Also, use markup to manage direction, and do not use CSS styling.
Unicode provides a set of 10 formatting characters that can be used to control the direction of text when displayed. These characters have no visual form in the rendered text, however text editing applications may have a way to show their location.
202B (RLE), 202A (LRE), and 202C (PDF) are in widespread use to set the base direction of a range of characters. RLE/LRE comes at the start, and PDF at the end of a range of characters for which the base direction is to be set.
In Unicode 6.1, the Unicode Standard added a set of characters which do the same thing but also isolate the content from surrounding characters, in order to avoid spillover effects. They are 2067 (RLI), 2066 (LRI), and 2066 (PDI). The Unicode Standard recommends that these be used instead.
There is also 2068 (FSI), used initially to set the base direction according to the first recognised strongly-directional character.
061C (ALM) is used to produce correct sequencing of numeric data. Click on the character name, and see also expressions for details.
200F (RLM) and 200E (LRM) are invisible characters with strong directional properties that are also sometimes used to produce the correct ordering of text.
For more information about how to use these formatting characters see How to use Unicode controls for bidi text. Note, however, that when writing HTML you should generally use markup rather than these control codes. For information about that, see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.
A sequence of numbers used to express a range of values generally runs right to left in the Arabic language (and languages using the Thaana or Syriac scripts), whereas for Persian language text (and in Hebrew, N’Ko or Adlam scripts) it runs left to right.
For more information, see the section Expressions & sequences in the Arabic script notes.
You can experiment with examples using the Kashmiri character app.
Kashmiri written in the Arabic script is cursive, and there are combining characters and special joining behaviours.
The orthography has no case distinction, and no special transforms are needed to convert between characters.
See the Arabic overview for details.
Arabic script is always cursive, ie. letters in a word are joined up. Fonts need to produce the appropriate glyph for a letter, according to its visual context, but the code point used doesn't change. This results in four different shapes for most letters, however some letters never join to the left. Ligated forms also join with characters alongside them.
In the lists below 30 Kashmiri letters are dual-joining, whereas 17 join only to the right. However, the high frequency of the latter and short word lengths produce text that doesn't usually have long joined sequences (see fig_unjoined).
Most dual-joining characters add or become a swash when they don't join to the left. A number of characters, however, undergo additional shape changes across the joining forms. fig_joining_forms and fig_right_joining_forms show the basic shapes in Kashmiri and what their joining forms look like.
isolated | right-joined | dual-join | left-joined | Kashmiri letters |
---|---|---|---|---|
ب | ـب | ـبـ | بـ | |
ن | ـن | ـنـ | نـ | |
ق | ـق | ـقـ | قـ | |
ف | ـف | ـفـ | فـ | |
س | ـس | ـسـ | سـ | |
ص | ـص | ـصـ | صـ | |
ط | ـط | ـطـ | طـ | |
ک | ـک | ـکـ | کـ | |
ل | ـل | ـلـ | لـ | |
ہ | ـہ | ـہـ | ہـ | |
ھ | ـھ | ـھـ | ھـ | |
م | ـم | ـمـ | مـ | |
ع | ـع | ـعـ | عـ | |
ح | ـح | ـحـ | حـ | |
ی | ـی | ـیـ | یـ | |
ؠ | ـؠ | ـؠـ | ؠـ |
isolated | right-joined | Kashmiri letters |
---|---|---|
ا | ـا | |
ر | ـر | |
د | ـد | |
و | ـو | |
ے | ـے |
200D (ZWJ) and 200C (ZWNJ) are used to control the joining behaviour of cursive glyphs. They are particularly useful in educational contexts, but also have real world applications.
ZWJ permits a letter to form a cursive connection without a visible neighbour. For example, the marker for hijri dates in Arabic language text is an initial form of heh, even though it doesn't join to the left, ie. ه. For this, use ZWJ immediately after the heh, eg. الاثنين 10 رجب 1415 ه..
ZWNJ prevents two adjacent letters forming a cursive connection with each other when rendered. For example, it is used in Persian for plural suffixes, some proper names, and Ottoman Turkish vowels. Ignoring or removing the ZWNJ will result in text with a different meaning or meaningless text, eg, تنها is the plural of body, whereas تنها is the adjective alone.2 The only difference is the presence or absence of ZWNJ after noon.
034F is used in Arabic-script text to produce special ordering of diacritics. The name is a misnomer, as it is generally used to break the normal sequence of diacritics.
tbd
Words are separated by spaces.
tbd
Observation: Aspirated stops are represented by a combination of the stop letter plus ھ [U+06BE ARABIC LETTER HEH DOACHASHMEE]. This constitutes 2 grapheme clusters, which presumably should always be treated as a single typographic unit. Examples (click to see the structure): اَتھٕ پیٚچھَنؠ
Kashmiri uses a mixture of ASCII and Arabic punctuation.
phrase | ، ؛ : |
---|---|
sentence | ۔ ؟ ! |
Kashimiri commonly uses ASCII parentheses to insert parenthetical information into text.
start | end | |
---|---|---|
standard | ( |
) |
The words 'left' and 'right' in the Unicode names for parentheses, brackets, and other paired characters should be ignored. LEFT should be read as if it said START, and RIGHT as END. The direction in which the glyphs point will be automatically determined according to the base direction of the text.
The number of characters that are mirrored in this way is around 550, most of which are mathematical symbols. Some are single characters, rather than pairs. The following are some more common ones.
See type samples.
The following type of quotation mark can be found in Kashmiri texts. (Of course, depending on ease of input, quotations may also be surrounded by ASCII double and single quote marks.)
start | end | |
---|---|---|
primary | ” |
“ |
Unlike brackets, these quote marks are not mirrored during display. As a result, LEFT means use on the left, and RIGHT means use on the right.
Lines are normally broken at word boundaries. They are not broken at the small gaps that appear where a character doesn't join on the left.
Like most writing systems, certain characters are expected not to start or end a line. For example, periods and commas shouldn't start a line, and opening parentheses shouldn't end a line.
As in almost all writing systems, certain punctuation characters should not appear at the end or the start of a line. The Unicode line-break properties help applications decide whether a character should appear at the start or end of a line.
Show default line-breaking properties for characters in this orthography.
The following list gives examples of typical behaviours for characters affected by these rules. Context may affect the behaviour of some of these and other characters.
When a line break occurs in the middle of an embedded left-to-right sequence, the items in that sequence need to be rearranged visually so that it isn't necessary to read lines upwards.
latin-line-breaks shows how two Latin words are apparently reordered in the flow of text to accommodate this rule. Of course, the rearragement is only that of the visual glyphs: nothing affects the order of the characters in memory.
tbd
The nastaliq writing style uses arrangements of joined glyphs that cascade downwards from right to left, and ressemble a strongly sloping baseline.
An obvious consequence is that the height of inline text in Kashimiri travel much further from the baseline than is usual in Latin script text. Allowances for this need to be made for line height settings on a page, but also it can be problematic when combining Latin and Arabic text on the same line using different fonts for each.
If the Arabic font supports the needed Latin letters, the font design will already take into account the relative sizes of the letters, and their placement relative to the baselines of each script. If different fonts are used, though, it's important to match the baselines and harmonise the font sizes used.
Kashmiri books, magazines, etc., are bound on the right-hand side, and pages progress from right to left.
Columns are vertical but run right-to-left across the page.
Tables, grids, and other 2-dimensional arrangements progress from right to left across a page.
Form controls should display Kashmiri text from right to left, starting at the right side of the input field. Form controls should also usually be arranged from right to left.
fig_form shows some form fields from an Arabic language web page. The same principles apply for Kashmiri. Note the position of the labels relative to the input fields and the checkbox, mirror-imaging a similar page in English. Note also that the input text in the first field appears to the right of the box.
The position of a scrollbar should depend on the user's environment, not on the content of a page. A non-Arab user viewing a web page in Arabic shouldn't have to look for the scroll bar on the left side of the window. In a system that is set up for an Arab user, however, the scrollbar can appear on the left.