Updated 25 May, 2023
This page brings together basic information about the Arabic script and its use for the Kashmiri language, using the latest orthographic changes. It aims to provide a brief, descriptive summary of the modern, printed orthography and typographic features, and to advise how to write Kashmiri using Unicode.
See the Arabic page for most of the information about how the Arabic script works, and the orthography used for the Arabic language. This page aims to provide Kashmiri-specific information.
سٲری اِنسان چھِ آزاد زامٕتؠ۔ وؠقار تہٕ حۆقوٗق چھِ ہِوی۔ تِمَن چھُ سوچ سَمَج عَطا کَرنہٕ آمُت تہٕ تِمَن پَزِ بٲے بَرادٔری ہٕنٛدِس جَذباتَس تَحَت اکھ أکِس اکار بَکار یُن ۔
For information about the script in general, see the Arabic overview. The Perso-Arabic script is recognised as the official script of Kashmiri language by the Jammu and Kashmir government and the Jammu and Kashmir Academy of Art, Culture and Languages.wkl
Kashmiri is written in the Devanagari script by Hindus. Muslims use the arabic script.
The Kashmiri Arabic orthography is derived from the Arabic/Persian abjads, where in normal use the script represents only consonant and long vowel sounds. However, Kashmiri is one of the Arabic orthographies that regularly indicates all vowel sounds, making it more like an alphabet.wkl,#Writing_system See the table to the right for a brief overview of features for the modern Kashmiri orthography using the Arabic script.
The Arabic alphabet has been adapted in this orthography in order to cope with the many more vowels sounds in languages such as Kashmiri.
Kashmiri is principally written using the nasta'liq style of Arabic writing. Glyphs are more drawn out, and the baseline tends to be sloping from word to word.
Kashmiri text runs right-to-left in horizontal lines, but numbers and embedded Latin text are read left-to-right.
The script is cursive, and some basic letter shapes change radically, depending on what they join to. The nastaliq styling creates diagonal baselines between joined characters, and tends to reduce clarity about where one letter ends and the next starts. (The dots and other diacritics associated with letters become particularly useful for the reader.) ❯ cursive
There is no case distinction.
Words are separated by spaces.
Modern Kashmiri has 19 basic consonant letters and 6 aspirated digraphs in its alphabet to represent native sounds, but tends to spell words loaned from Persian, Arabic and Urdu using 13 more. ❯ consonants
Letters are used to indicate nasalisation and palatalisation (which is common in Kashmiri). ❯ nasalisation ❯ palatalisation
Kashmiri is an alphabet where vowel sounds are written using a mixture of combining marks and letters. Unlike Arabic, Persian, and Urdu, all vowel diacritics are normally shown when writing Kashmiri.
Representation of vowel sounds is complicated (a) because there are far more vowel sounds than in Arabic (or Persian), and (b) because the code points used to represent a given vowel typically differ according to whether this is an isolate, initial, medial, or final joining form. The choice between precomposed and decomposed realisations of a vowel letter is also complicated. The script draws on combinations of 21 code points in order to write 16 vowel sounds. ❯ vowels
The distinction between ijam vs. tashkil has a bearing on several Kashmiri graphemes.
Word-initial vowels are preceded by or attached to ا [U+0627 ARABIC LETTER ALEF].
Additional diacritics indicate the absence of a vowel in consonant clusters, and gemination. Unlike other Arabic orthographies the jazm (sukun), used to show vowel absence, is commonly placed over the second consonant in a cluster, rather than the first – which means that that consonant may be associated with both a vowel diacritic and jazm diacritic. ❯ clusters
A mandatory ligature is used for combinations of lam + alif.
Kashmiri uses native digits, and Arabic code points for several of the more common punctuation marks. ❯ numbers
Because the Arabic script is 'cursive' (ie. joined-up) writing, letters tend to have different shapes depending on whether they join with adjacent letters or not (see cursive). Here we clarify some of the terminology used in this page to refer to these different forms.
Several characters have no left-joining form. This has an effect on the following letter shape.
When we say 'initial' forms, we generally refer to glyphs that only join to the left. Consonants that don't have a left-joining form use the unjoined glyph at the beginning of a word. Initial forms occur in word-medial position if they follow a glyph that doesn't join to the left.
Where we illustrate 'initial' forms of a vowel we typically show the word-initial form, which is always attached to or preceded by an aleph, eg. اَ or ایـ. If an initial form is immediately preceded by a consonant, the consonant takes the place of the aleph, eg. رَـ or ریـ.
In illustrations of shaping forms we normally show the 'isolated' form of a vowel as preceded by aleph, as it would be if written alone, eg. ای. In use following another letter, however, the aleph is dropped.
Word-final vowel forms come in two types. A vowel that can join with the preceding character uses the right-joining glyph, eg. بی. One that follows a letter that doesn't join to the left uses the isolated form, eg. ری. When we refer to the 'final' form, we are usually referring to the former, ie. the right-joined form.
These are sounds for the Kashmiri language.
Click on the sounds to see where else in the document they are referred to.
|High||i iː||ɨ ɨː||u uː|
|Mid||e eː||ə əː||o oː|
|stops||p b||t d||ʈ ɖ||k ɡ|
Kashmiri is not a tonal language.
The following table summarises the main vowel to character assigments. More information can be found by clicking on the characters above, or in the section vowel_mappings.
Each table cell shows word-initial, dual-joining, and word-final forms from right to left. Where the word-final form differs according to whether it is right-joining or unjoined, both are shown. Click/tap on items to see a list of the components for that cell.
|i اِ ◌ِ ◌ِ
iː ایٖ یٖ ی
|ɨ إ ◌ٕ ◌ٕ
ɨː اٟ ◌ٟ ◌ٟ
uː اوٗ وٗ وٗ
◌ٚے or ےٚ
eː ای ی ے
oː او و و
əː ٲ ٲ ٲ
ɔː – – ۄآ
aː آ ا ا
For a question about the ordering of characters in final e, see final_e. For questions about whether to use precomposed or decomposed letters, see encoding.
This is the set of characters needed to represent the Kashmiri vowels, including decompositions and compositions:
Several of the characters are used in combinations to represent a given vowel sound, in a given position. Note that whether a vowel is represented using a precomposed or decomposed form is not straightforward (see encoding).
The characters just above are vowels or consonants followed by a vowel that Unicode represents in NFC-normalised text as atomic characters, but that normalise in NFD to a code point sequence. Generally, the precomposed form is preferred, and more common, but they are canonically equivalent. These are therefore included in the Kashmiri repertoire for representing those sound combinations.
These are the dedicated vowel letters.
All word-initial vowels begin with ا, which either carries a vowel diacritic or precedes the characters that indicate the vowel (see standalone). Word-medially, it also functions alone as a vowel in its own right, to represent the sound aː.
Kashmiri uses the following combining marks for vowels.
0653 only occurs in decomposed text. All the rest appear as separate code points, although a few may also, in some cases, be combined with a base as an atomic character. Where a vowel is indicated in the IPA, the mark can produce that vowel on its own, whereas the others only occur in combination with other letters.
Kashmiri uses the following consonant characters to write vowels, generally in combination with diacritics, but also alone when representing eː and oː in non-initial positions.
In a standard Arabic orthography these characters would be regarded as matres lectionis, but since Kashmirii shows all vowel diacritics they don't have the same role here.
A majority of vowels in Kashmiri are represented by composite vowels in at least one positional form.
Word-initial vowels are nearly all composite in normal text.
In decomposed text, 3 more word-initial forms are also composites:
Only ٲ doesn't decompose.
In other locations the following composites also occur:
Word-initial standalone vowels always begin with ا, either as a carrier for a diacritic, or before the other characters that represent the vowel.
The Unicode character ٳ [U+0673 ARABIC LETTER ALEF WITH WAVY HAMZA BELOW] is explicitly deprecated by the Standard in favour of the decomposed sequence اٟ [U+0627 ARABIC LETTER ALEF + U+065F ARABIC WAVY HAMZA BELOW]. There is no normalisation equivalence.
The list above contains several other single Unicode code points that look like combinations of Kashmiri letters and vowel diacritics, but they neither decompose nor recompose during normalisation. The Unicode Standard descriptions for these characters indicate that they are intended for use with specific languages, and Kashmiri is not listed amongst those. The hamza in these characters is an ijam, rather than a vowel diacritic, ie. it is an integral part of the letter. See Ijam, tashkil, hamza.
Nevertheless, they may appear in Kashmiri text – for example, ۆ [U+06C6 ARABIC LETTER OE] is the default encoding for the vowel o in Wiktionary's list of words.
Content authors should use the decomposed forms, but because that can't be guaranteed, applications need to apply special rules to recognise both precomposed and decomposed forms as equivalent. See non_canonical for more details.
Vowel length is indicated by use of different characters or character sequences. See fig_vowelgrid.
Vowels are commonly nasalised in Kashmiri. A nasalised vowel is normally written using ن [U+0646 ARABIC LETTER NOON].
This makes a nasalised vowel indistinguishable from a vowel followed by n.
At the end of a word, ں [U+06BA ARABIC LETTER NOON GHUNNA] is used.§
Kashmiri is not a tonal language.
This section maps Kashmiri vowel sounds to common graphemes in the Arabic orthography, grouped by word-initial ( i ), medial ( m ), and final ( f ). Click on a grapheme to find other mentions on this page (links appear at the bottom of the page). Click on the character name to see examples and for detailed descriptions of the character(s) shown.
The allocation of characters to vowel sounds is somewhat complicated. The complexity arises from the number of vowels in Kashmiri compared to the Arabic language, and the need to represent them all, but also because different sequences are needed for different positional forms. In addition, often more than one character sequence can achieve the same result.
Vowels in initial or isolated position are written with a preceding ا [U+0627 ARABIC LETTER ALEF].
◌ِ [U+0650 ARABIC KASRA]
◌ِ [U+0650 ARABIC KASRA]
إ [U+0625 ARABIC LETTER ALEF WITH HAMZA BELOW] Decomposes to إ [U+0627 ARABIC LETTER ALEF + U+0655 ARABIC HAMZA BELOW] and recomposes.
اٟ [U+0627 ARABIC LETTER ALEF + U+065F ARABIC WAVY HAMZA BELOW] There is a precomposed character, ٳ [U+0673 ARABIC LETTER ALEF WITH WAVY HAMZA BELOW], but it neither composes nor decomposes in normalisation, and is strongly deprecated by the Unicode Standard.
◌ُ [U+064F ARABIC DAMMA]
◌ُ [U+064F ARABIC DAMMA]
ـیـ [U+06CC ARABIC LETTER FARSI YEH]
ۄآ [U+06C4 ARABIC LETTER WAW WITH RING + U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE]. Used only in isolated and final forms. The isolated form has an ALEF before it.
◌ٔ [U+0654 ARABIC HAMZA ABOVE] There are several precomposed characters available for use with hamza above. These include:
All of the above decompose and recompose during normalisation.
◌ٔ [U+0654 ARABIC HAMZA ABOVE] (see note about precomposed characters just above).
◌َ [U+064E ARABIC FATHA]
◌َ [U+064E ARABIC FATHA]
آ [U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE] Decomposes to آ [U+0627 ARABIC LETTER ALEF + U+0653 ARABIC MADDAH ABOVE] and recomposes.
ں [U+06BA ARABIC LETTER NOON GHUNNA] when word-final.
Observation: Final ɨ seems to also be commonly spelled using ہٕ [U+06C1 ARABIC LETTER HEH GOAL + U+0655 ARABIC HAMZA BELOW], eg. طوطہٕ بہٕ
Click on the characters in the lists for detailed information.
The following constitute a basic set of consonants used for Kashmiri, that cover all standard phonemes for the Kashmiri language.
Six additional letters of the alphabet represent aspirated sounds. These are all written by combining a standard character with a following ھ [U+06BE ARABIC LETTER HEH DOACHASHMEE].
The following set of consonants map mostly to the same phonemes, but are generally for loan words and preserve the original spellings in the language of origin.
Palatalisation is a frequent feature of Kashmiri words. It is represented using ؠ [U+0620 ARABIC LETTER KASHMIRI YEH] after the consonant to be palatalised. Initial and medial forms have a small circle beneath them.
Kashmiri uses ْ [U+0652 ARABIC SUKUN] (jazm) to indicate a consonant that is not followed by a vowel, except when the consonant is word-final. Note that Kashmiri uses an inverted-v shape for the jazm, rather than the small round circle used for Arabic sukun.
Note that this is NOT ٛ [U+065B ARABIC VOWEL SIGN INVERTED SMALL V ABOVE]. That character is used as a vowel diacritic, eg. to write the letter o in Fulfulde. The ARABIC SUKUN code point has the semantic meaning intended here, and is also used for this function in Standard Arabic, Persian, Urdu, etc.. For Kashmiri you should use a font that produces the expected glyph shape. Using a different character that has the same shape but not the same semantics will cause problems for interoperable use of your text, and some fonts may fail to display it correctly (see confusables).
Observation: There appears to be a significant difference in the way jazm is used, compared to other Arabic orthographies, in that it appears above and is stored after the second character in the consonant cluster. This means that the base character may associated with both a vowel diacritic and the jazm in memory, eg. واریُْل
This behaviour is explicitly described in Rainamkr,p11-12 and occurs in Wiktionary lemmas.
The exception is a cluster with an initial ن [U+0646 ARABIC LETTER NOON], representing either n or nasalisation. In this case, the jazm goes over the first letter in the cluster, eg. وَنْدٕ
No special mechanisms other than the use of the sukun, as described in novowel.برَْگ
No special mechanisms, except that the sukun mark is generally not used for a word-final consonant.بادَم
Kashmiri uses ْ [U+0652 ARABIC SUKUN] (jazm) to indicate a consonant cluster, as described in novowel.
The diacritic ّ [U+0651 ARABIC SHADDA] doubles the value of the consonant it is attached to.
This section maps Kashmiri consonant sounds to common graphemes in the Arabic orthography, grouped by word-initial ( i ), medial ( m ), and final ( f ) types. Click on a grapheme to find other mentions on this page (links appear at the bottom of the page). Click on the character name to see examples and for detailed descriptions of the character(s) shown.
Sounds listed as 'infrequent' are allophones, or sounds used for foreign words, etc.
پھ [U+067E ARABIC LETTER PEH + U+06BE ARABIC LETTER HEH DOACHASHMEE
ف [U+0641 ARABIC LETTER FEH] sometimes in loan words.
ط [U+0637 ARABIC LETTER TAH] in loan words.
ک [U+06A9 ARABIC LETTER KEHEH]
ق [U+0642 ARABIC LETTER QAF] in loan words.
غ [U+063A ARABIC LETTER GHAIN] in loan words.
کھ [U+06A9 ARABIC LETTER KEHEH + U+06BE ARABIC LETTER HEH DOACHASHMEE]
خ [U+062E ARABIC LETTER KHAH] in loan words.
و [U+0648 ARABIC LETTER WAW] in some words.
ص [U+0635 ARABIC LETTER SAD] in loan words.
ث [U+062B ARABIC LETTER THEH] in loan words.
ذ [U+0630 ARABIC LETTER THAL] in loan words.
ض [U+0636 ARABIC LETTER DAD] in loan words.
ظ [U+0638 ARABIC LETTER ZAH] in loan words.
ہ [U+06C1 ARABIC LETTER HEH GOAL]
ح [U+062D ARABIC LETTER HAH] in loan words.
ڑ [U+0691 ARABIC LETTER RREH] in loan words.
Sources: Wikipedia, and Google Translate.
Arabic-script text makes use of a relatively large set of invisible formatting characters, especially in plain text, many of which are used to manage text direction. Descriptions of these characters can be found in the following sections:
In the Kashmiri orthography different sequences of Unicode characters may produce the same visual result. Here we look at those, and raise questions where clarifications are needed.
Normalisation converts the following precomposed to decomposed alternatives, and vice versa.
|إ [U+0625 ARABIC LETTER ALEF WITH HAMZA BELOW]||إ [U+0627 ARABIC LETTER ALEF + U+0655 ARABIC HAMZA BELOW]|
|أ [U+0623 ARABIC LETTER ALEF WITH HAMZA ABOVE]||أ [U+0627 ARABIC LETTER ALEF + U+0654 ARABIC HAMZA ABOVE]|
|آ [U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE]||آ [U+0627 ARABIC LETTER ALEF + U+0653 ARABIC MADDAH ABOVE]|
|ؤ [U+0624 ARABIC LETTER WAW WITH HAMZA ABOVE]||ؤ [U+0648 ARABIC LETTER WAW + U+0654 ARABIC HAMZA ABOVE]|
|ۂ [U+06C2 ARABIC LETTER HEH GOAL WITH HAMZA ABOVE]||ۂ [U+06C1 ARABIC LETTER HEH GOAL + U+0654 ARABIC HAMZA ABOVE]|
The single code point per vowel-sign is the form preferred by the Unicode Standard and the form in common use for Kashmiri. The parts are separated in Unicode Normalisation Form D (NFD), and recomposed in Unicode Normalisation Form C (NFC), so both approaches are canonically equivalent.
The following alternatives are not converted to each other during normalisation. The diacritics in the precomposed characters are ijam, whereas those in the decomposed sequences are tashkil.
|0673 (deprecated!)||0627 065F||The Unicode Standard indicates that the first precomposed item in the list above is strongly deprecated. There are no such indications, however, for the others.|
|06CE||06CC 065A||Neither alternative on this line currently was supported by older versions of the Noto Nastaliq Urdu font, causing a major problem for writing the sound e in Kashmiri. But it is supported by the latest version of that font, and by the Awami Nastaliq font.|
The diacritics in atomic characters without decompositions, like those in this table, are generally intended to represent ijam rather than vowel sounds. In a search on a sample that included various Wikpedia pages and 369 Wiktionary lemmas the decomposed sequences on the right side of this table typically scored most hits, and there were zero to 3 of each of the precomposed variants. Except for this vowel o: there were 30 instances of the precomposed character and only 2 of the decomposed. The Unicode Standard says that this precomposed character is for use with Uighur, Kurdish, Kazakh, Azerbaijani, and Bosnian, but doesn't indicate that it should be used for Kashmiri.
The precomposed characters listed are associated with particular languages by the annotations in the Unicode Standard. The decomposed forms are therefore recommended for use with Kashmiri, with the possible exception of OE. However, both versions have been seen in digital text in Kashmiri, so applications will need to recognise both precomposed and decomposed alternatives as the same grapheme. Input mechanisms, on the other hand, can produce one rather than the other, and that choice should be made with advisement.
The following lists some common errors found in Kashmiri text due to the similarity of Unicode characters, or perhaps sometimes due to problems inputting the correct character. Wikipedia is a rich source of such.
|064A||06CC||The Arabic YEH doesn't drop the dots below in isolate and final positions.|
|0626||06CC 0654||This precomposed form becomes ئ [U+064A ARABIC LETTER YEH + U+0654 ARABIC HAMZA ABOVE] when the text is decomposed during normalisation, ie. the base character is replaced by U+064A instead of U+06CC.|
|0643||06A9||Common fonts tend not to show the difference between these two characters, but the ability to search and compare text is impaired unless the application is aware of and takes counter-measures against this substitution.|
|066E 06EA||0620 0640||This occurs when the KASHMIRI YEH is right-joining or dual-joining, in which case it has the ring below. The incorrect solution doesn't work well with common fonts, as well as corrupting the semantics of the text stream.|
|065B||0652||The function of this glyph is that of the sukun, so the correct semantic character should be used. Although ٛ [U+065B ARABIC VOWEL SIGN INVERTED SMALL V ABOVE] looks like the Kashmiri jazm, it was introduced to Unicode to serve as a vowel sign for African languages (as described in the name of the character) §. In order to produce the correct glyph using a font such as Noto it is essential to indicate that the language of the text is Kashmiri. (In HTML this can be done using the attribute
|ۅU+06C5 ARABIC LETTER KIRGHIZ OE||06C4||The incorrect letter is intended for use with Kirghiz. Some fonts add a loop to the tail, similar to that of the recommended character, but other fonts render it with a bar through the tail.|
In the online term list at Wiktionary there are items containing a final e where the order of code points is ٚے [U+065A ARABIC VOWEL SIGN SMALL V ABOVE + U+06D2 ARABIC LETTER YEH BARREE], and others where the order is reversed to give ےٚ [U+06D2 ARABIC LETTER YEH BARREE + U+065A ARABIC VOWEL SIGN SMALL V ABOVE]. Click on the following examples to see their composition: تْرٚے شےٚ
It's unclear whether this is simply driven by user preference, or by orthographic rules, or the words are wrongly encoded. If the inverted-v occurs after the consonant in the word for 'six', it would look like this: شٚے
There appears to be a significant difference in the way jazm is used, compared to other Arabic orthographies, in that it commonly appears above and is stored after the second character in the consonant cluster.
It appears that the jazm diacritic is only used in consonant clusters over the letters r and j, when they appear immediately after a consonant (ie. in 'medial' position), and n (including nasalisation) when it occurs immediately before another consonant§. When used with r and j, the base character may be associated with both a vowel diacritic and the jazm. Examples: واریُْل وَنْدٕ
Other consonant clusters can occur without the use of the jazm, eg. ہۆست
This behaviour is explicitly described in Rainamkr,p11-12 and occurs in Wiktionary lemmas.
However, one case in the Wiktionary lemmas an identical base cluster shows the jazm in different places, eg. compare ترْٕے تْرٚے
Again, it isn't clear whether this changing order of code points is normal, or whether the ordering should be fixed and the font produce the appropriate placement.
The Unicode Arabic block has 2 sets of digits, and Kashmiri uses the extended set. The Unicode
bidi_class property for these native digits is
European_Number, which makes them behave and look differently from the digits used for Arabic language text. For more information, see expressions.
In addition, there are differences in glyph shapes. fig_number_shapes shows the different glyph shapes used in Arabic, Persian, Urdu and Sindhi. Kashmiri digits share the same shapes as those for Urdu.u,370
I suspect that Kashmiri may use ٫ [U+066B ARABIC DECIMAL SEPARATOR] and ٬ [U+066C ARABIC THOUSANDS SEPARATOR], but need to confirm.
Kashmiri text is written horizontally and right-to-left in the main but, as in most right-to-left scripts, numbers and embedded text in other scripts are written left-to-right (producing 'bidirectional' text).
The Unicode Bidirectional Algorithm automatically takes care of the ordering for all the text in fig_bidi, as long as the 'base direction' is set to RTL. In HTML this can be set using the
dir attribute, or in plain text using formatting controls.
If the base direction is not set appropriately, the directional runs will be ordered incorrectly as shown in fig_bidi_no_base_direction, making it very difficult to get the meaning.
bidi_class properties for characters in the Kashmiri language.
For other aspects of dealing with right-to-left writing systems see the following sections:
For more information about how directionality and base direction work, see Unicode Bidirectional Algorithm basics. For information about plain text formatting characters see How to use Unicode controls for bidi text. And for working with markup in HTML, see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.
For authoring HTML pages, one of the most important things to remember is to use
<html dir="rtl" … > at the top of the page. Also, use markup to manage direction, and do not use CSS styling.
Unicode provides a set of 10 formatting characters that can be used to control the direction of text when displayed. These characters have no visual form in the rendered text, however text editing applications may have a way to show their location.
U+202B RIGHT-TO-LEFT EMBEDDING] ( [RLE), U+202A LEFT-TO-RIGHT EMBEDDING] ( [LRE), and U+202C POP DIRECTIONAL FORMATTING] ( [PDF) are in widespread use to set the base direction of a range of characters. RLE/LRE comes at the start, and PDF at the end of a range of characters for which the base direction is to be set.
In Unicode 6.1, the Unicode Standard added a set of characters which do the same thing but also isolate the content from surrounding characters, in order to avoid spillover effects. They are U+2067 RIGHT-TO-LEFT ISOLATE] ( [RLI), U+2066 LEFT-TO-RIGHT ISOLATE] ( [LRI), and U+2069 POP DIRECTIONAL ISOLATE] ( [PDI). The Unicode Standard recommends that these be used instead.
There is also U+2068 FIRST STRONG ISOLATE] ( [FSI), used initially to set the base direction according to the first recognised strongly-directional character.
U+061C ARABIC LETTER MARK] ( [ALM) is used to produce correct sequencing of numeric data. Follow the link and see expressions for details.
U+200F RIGHT-TO-LEFT MARK] ( [RLM) and U+200E LEFT-TO-RIGHT MARK] ( [LRM) are invisible characters with strong directional properties that are also sometimes used to produce the correct ordering of text.
For more information about how to use these formatting characters see How to use Unicode controls for bidi text. Note, however, that when writing HTML you should generally use markup rather than these control codes. For information about that, see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.
A sequence of numbers used to express a range of values generally runs right to left in the Arabic language (and languages using the Thaana or Syriac scripts), whereas for Persian language text (and in Hebrew, N’Ko or Adlam scripts) it runs left to right.
For more information, see the section Expressions & sequences in the Arabic script notes.
This section brings together information about the following topics: writing styles; cursive text; context-based shaping; context-based positioning; baselines, line height, etc.; font styles; case & other character transforms.
You can experiment with examples using the Kashmiri character app.
Kashmiri written in the Arabic script is cursive, and there are combining characters and special joining behaviours.
The orthography has no case distinction, and no special transforms are needed to convert between characters.
See the Arabic overview for details.
Arabic script is always cursive, ie. letters in a word are joined up. Fonts need to produce the appropriate glyph for a letter, according to its visual context, but the code point used doesn't change. This results in four different shapes for most letters, however some letters never join to the left. Ligated forms also join with characters alongside them.
In the lists below 30 Kashmiri letters are dual-joining, whereas 17 join only to the right. However, the high frequency of the latter and short word lengths produce text that doesn't usually have long joined sequences (see fig_unjoined).
Most dual-joining characters add or become a swash when they don't join to the left. A number of characters, however, undergo additional shape changes across the joining forms. fig_joining_forms and fig_right_joining_forms show the basic shapes in Kashmiri and what their joining forms look like.
U+200D ZERO WIDTH JOINER] ( [ZWJ) and U+200C ZERO WIDTH NON-JOINER] ( [ZWNJ) are used to control the joining behaviour of cursive glyphs. They are particularly useful in educational contexts, but also have real world applications.
ZWJ permits a letter to form a cursive connection without a visible neighbour. For example, the marker for hijri dates in Arabic language text is an initial form of heh, even though it doesn't join to the left, ie. ه. For this, use ZWJ immediately after the heh, eg. الاثنين 10 رجب 1415 ه..
ZWNJ prevents two adjacent letters forming a cursive connection with each other when rendered. For example, it is used in Persian for plural suffixes, some proper names, and Ottoman Turkish vowels. Ignoring or removing the ZWNJ will result in text with a different meaning or meaningless text, eg, تنها is the plural of body, whereas تنها is the adjective alone.2 The only difference is the presence or absence of ZWNJ after noon.
U+034F COMBINING GRAPHEME JOINER] is used in Arabic-script text to produce special ordering of diacritics. The name is a misnomer, as it is generally used to break the normal sequence of diacritics. [
Observation: Aspirated stops are represented by a combination of the stop letter plus ھ [U+06BE ARABIC LETTER HEH DOACHASHMEE]. This constitutes 2 grapheme clusters, which presumably should always be treated as a single typographic unit. Examples (click to see the structure): اَتھٕ پیٚچھَنؠ
Words are separated by spaces.
Kashmiri uses a mixture of ASCII and Arabic punctuation.
: [U+003A COLON]
؟ [U+061F ARABIC QUESTION MARK]
Kashimiri commonly uses ASCII parentheses to insert parenthetical information into text.
( [U+0028 LEFT PARENTHESIS] and ) [U+0029 RIGHT PARENTHESIS] are used for parentheses.
The words 'left' and 'right' in the Unicode names for parentheses, brackets, and other paired characters should be ignored. LEFT should be read as if it said START, and RIGHT as END. The direction in which the glyphs point will be automatically determined according to the base direction of the text.
The number of characters that are mirrored in this way is around 550, most of which are mathematical symbols. Some are single characters, rather than pairs. The following are some more common ones.
Kashmiri texts use quotation marks around quotations. Of course, due to keyboard design, quotations may also be surrounded by ASCII double and single quote marks. Note, however, that the order of use is different from that in LTR text, because they are not automatically mirrored.
Quote marks are ” [U+201D RIGHT DOUBLE QUOTATION MARK] at the start, and “ [U+201C LEFT DOUBLE QUOTATION MARK] at the end.
Unlike the bracketing quotation marks, these characters are not mirrored during display. This means that LEFT means use on the left, and RIGHT means use on the right.
Lines are normally broken at word boundaries. They are not broken at the small gaps that appear where a character doesn't join on the left.
Like most writing systems, certain characters are expected not to start or end a line. For example, periods and commas shouldn't start a line, and opening parentheses shouldn't end a line.
Show (default) line-breaking properties for characters in the modern Kashmiri orthography.
When a line break occurs in the middle of an embedded left-to-right sequence, the items in that sequence need to be rearranged visually so that it isn't necessary to read lines from top to bottom.
latin-line-breaks shows how two Latin words are apparently reordered in the flow of text to accommodate this rule. Of course, the rearragement is only that of the visual glyphs: nothing affects the order of the characters in memory.
The nastaliq writing style uses arrangements of joined glyphs that cascade downwards from right to left, and ressemble a strongly sloping baseline.
An obvious consequence is that the height of inline text in Kashimiri travel much further from the baseline than is usual in Latin script text. Allowances for this need to be made for line height settings on a page, but also it can be problematic when combining Latin and Arabic text on the same line using different fonts for each.
If the Arabic font supports the needed Latin letters, the font design will already take into account the relative sizes of the letters, and their placement relative to the baselines of each script. If different fonts are used, though, it's important to match the baselines and harmonise the font sizes used.
This section is for any features that are specific to Kashmiri in Arabic script and that relate to the following topics: general page layout & progression; grids & tables; notes, footnotes, etc; forms & user interaction; page numbering, running headers, etc.
Kashmiri books, magazines, etc., are bound on the right-hand side, and pages progress from right to left.
Columns are vertical but run right-to-left across the page.
Tables, grids, and other 2-dimensional arrangements progress from right to left across a page.
Form controls should display Kashmiri text from right to left, starting at the right side of the input field. Form controls should also usually be arranged from right to left.
fig_form shows some form fields from an Arabic language web page. The same principles apply for Kashmiri. Note the position of the labels relative to the input fields and the checkbox, mirror-imaging a similar page in English. Note also that the input text in the first field appears to the right of the box.
The position of a scrollbar should depend on the user's environment, not on the content of a page. A non-Arab user viewing a web page in Arabic shouldn't have to look for the scroll bar on the left side of the window. In a system that is set up for an Arab user, however, the scrollbar can appear on the left.