Updated 17 May, 2025
This page brings together basic information about the XXXX script and its use for the XXXX language. It aims to provide a brief, descriptive summary of the modern, printed orthography and typographic features, and to advise how to write XXXX using Unicode.
Richard Ishida, Bangla (Bengali) Orthography Notes, 17-May-2025, https://r12a.github.io/scripts/beng/bn
A typical character panel looks like:
The source text is:
<figure class="characterBox auto" data-cols="ipa,trans,transc">𖼁␣𖼋␣𖼏␣𖼟␣𖼢␣𖼸␣𖼯␣𖼛␣𖼅␣𖼑␣𖼕</figure>
Class: characterBox produces a yellow background. Other coloured backgrounds are only used occasionally; they include auxiliaryBox (mauve), etc.
class:small renders the character at 80% size. Good for long lists of conjuncts, list counters, etc.
class:hideCh renders the character at 20% size with a border. Good for invisible characters that are represented by images (see data-notes).
class:noexpansion prevents the production of the downwards arrow that shows details for all characters.
class:nolist prevents production of the arrow that lists all characters.
class:ipaplus adds the contents of the ipa+ column in the spreadsheet (usually inherent vowel).
data-ignore a (single) character that should be ignored when listing the hex code point values; typically used for dotted circles.
data-ipa value is a comma-separated list of ipa transcriptions that will be aligned with the items displayed. You should normally remove 'ipa' from the data-cols field.
data-translit tbd.
data-links associates a link with each item.
data-extra tbd.
data-notes a comma-separated list of texts that are displayed below the main item. This is particularly useful for invisible characters because it can be used to pull in images, eg.
data-notes="<img src='../../c/Miao/large/16F8F.png' alt='ZWNJ' style='vertical-align:middle; height:2rem'>, ..."
data-last indicates that the hex code point value refers to the last item; used mostly for bicameral listings.
For cased scripts, display lists side-by-side using <div class="cased">
, eg.
A typical reference to a code point or sequence of code points looks like the following 4 examples. The page converts simple markup to more complicated markup during rendering.
𞋀
𞋖𞋭
𞋮
200B
Basic markup for these is, respectively:
<span class="ch">𞋀</span>
or <span class="hx">1E2C0</span>
<span class="ch">𞋖𞋭</span>
or <span class="hx">1E2D6 1E2ED</span>
<span class="ch">𞋮</span>
or <span class="hx">1E2EE</span>
<span class="hx img">200B</span>
The generated markup will look something like this:
<span class="codepoint" translate="no"><bdi lang="th">𞋀</bdi><a href="javascript:void(0)" target=""><span class="uname">U+1E2C0 WANCHO LETTER AA</span></span>
Each orthography notes page globally declares a language tag to be used with example text. The markup automatically adds a lang attribute for that language to the generated markup. If you want a different lang attribute to be applied, simply add lang="xx"
to the span tag (where xx is the lang value).
Every span tag must contain one of the following classes:
ch The span element content contains one or more characters (with no separators).
hx The span element content contains one or more hex code point values, separated by a space.
Optional, additional class names are as follows. Most of these can be combined with others.
img A PNG image will be used to show the character.
svg An SVG image will be used to show the character.
noname Show the character but suppress the name & codepoint value.
split Renders a sequence as 𞋖𞋭, rather than 𞋖𞋭.
circle Adds a dotted circle before the listed character(s) on display.
coda Adds a dotted circle after the listed character(s) on display. This is useful for showing character sequences from scripts like Thai, where the pronunciation is different in syllables with and without a coda.
init Adds ZWJ after the character(s), when displayed.
medi Adds ZWJ before and after the character(s), when displayed.
fina Adds ZWJ before the character(s), when displayed.
skip Adds ZWJ before the second character(s), when displayed. This is useful for cases where, for a cursive script, you want to show a combining mark followed by a joining letter.
noindex Prevents this item being picked up for inclusion in the index.
References can be added to the markup using <tt>...</tt>.
To indicate a source link that is listed in the refs.js file, use <tt> with the key for that resource. After a comma you can add a number for a page, or '#fragid', eg.<tt>ul,22</tt>
To indicate a source link that is NOT listed in the refs.js file, use <tt> containing '@shortname,url', eg. <tt>@Github,https://github.com/w3c/sealreq/issues/45</tt>
If this is something the reader should also look at, rather than just a source link, add class="more"
to the tt
element.
On devices that don't have hover any page number will be shown alongside the reference number.
General pattern for use in figures and outside figures.
ক␣্␣ত␣ক্ত |
ঞ␣্␣চ␣ঞ্চ |
ক␣্␣ষ␣্␣ম␣ক্ষ্ম |
Used alongside an example term.
গ্রাম
গ্রা␣গ␣্␣র␣া |
Sample_text_goes_here
Source: Universal Declaration of Human Rights - Chakma, articles 1 & 2
Origins of the Oriya script, 1051 – today.
Phoenician
└ Aramaic
└ Brahmi
└ Tamil-Brahmi
└ Pallava
└ Grantha
└ Malayalam
+ Tigalari
+ Dhives Akuru
+ Saurashtra
Speakers of the Eastern Cham language number about 132,000 in Bình Thuận, Ninh Thuận, and Đồng Nai provinces in southern Vietnam, as well as in Hồ Chí Minh City. The Ethnologue estimates the L1 literacy rate to be 5%–10%. The Cham script is the primary orthography for the Eastern Cham, but the largely muslim Western Cham peoples of Cambodia prefer the Arabic script.eth Historically, the Eastern Cham script was learned by boys once they reached a certain age, but not by women and girls.ws
ꨀꨇꩉ ꨌꩌ
Coming to Southeast Asia with the expansion of Indian religions, Cham was one of the first scripts to develop from the Pallava script some time around 200 CE.ws
More information: Wikipedia • Endangered Alphabets
The XXXX script is an abugida, ie. each consonant contains an inherent vowel sound. See the table to the right for a brief overview of features for the modern ZZZZ orthography.
The XXXX script is an alphabet, ie. consonants and vowels are written separately. See the table to the right for a brief overview of features for the ZZZZ language.
The XXXX script is an abjad, ie. short vowels are not normally written. See the table to the right for a brief overview of features for the ZZZZ language.
The ZZZZ XXXX orthography is derived from the Arabic/Persian abjads, where in normal use the script represents only consonant and long vowel sounds. However, the script has been adapted in this orthography in order to cope with the many more vowels sounds in Kashmiri, and this is one of the Arabic orthographies that regularly indicates all vowel sounds, making it an alphabet. See the table to the right for a brief overview of features for the modern Kashmiri orthography using the Arabic script.
XXXX text runs left-to-right in horizontal lines, but numbers and embedded Latin text are read left-to-right. There is no case distinction. Words are separated by spaces.
Modern XXXX represents native consonant sounds using 19 basic letters and 6 aspirated digraphs, but can use 13 more consonants to spell words loaned from Persian, Arabic and Urdu.
A special letter is used to indicate palatalisation, which is common in Kashmiri. Similarly to the other yeh used in Kashmiri, it has a circle below when used in syllable onsets, and a swash with no circle after a syllable coda.
Unlike other Arabic orthographies, 0652 (jazm), normally used to show vowel absence, is placed over the second consonant in an onset cluster (such as tr). That letter may therefore carry both the jazm diacritic and a vowel diacritic, which is quite unusual.
❯ basicV
Kashmiri is an alphabet where 16 vowel sounds (far more than in Arabic or Persian) are written using a mixture of 10 combining marks and 10 letters. Unlike Arabic, Persian, and Urdu, all vowel diacritics are always visible in Kashmiri texts. Representation of vowel sounds is complicated because the code points used to represent a given vowel typically differ according to its position within a word.
The distinction between ijam vs. tashkil has a bearing on several Kashmiri graphemes, and the choice between precomposed and decomposed realisations of a vowel letter can be complicated (see encoding).
Word-initial standalone vowels are preceded by or attached to either 0627 or 0639.
The jazm is used over a word-medial 0646 to indicate nasalisation of a preceding vowel sound. Apart from this use and that for medial consonants, it is not typically used to indicate vowel absence.
Kashmiri uses native digits, and Arabic code points for several of the more common punctuation marks.
The following represents the repertoire of the ZZZZ language.
Click on the sounds to reveal locations in this document where they are mentioned.
Phones in a lighter colour are non-native or allophones. Source Wikipedia.
ia iɯʔ iə | ua uə | |
ea | oa | |
ɛə | ɔə | |
auʔ aɪ aʊ | ||
Thai is a contour tone language, with 5 tones: high, mid, low, falling, & rising.
tbd
tbd
Click on the characters to find where they are mentioned in this page.
The ZZZZ alphabet has 23 consonants and 5 vowels. Each has upper and lowercase forms; shown above and below, respectively.
This table only summarises basic vowel to character assignments.
ⓘ represents the inherent vowel. Diacritics are added to the vowels to indicate nasalisation (not shown here).
post-consonant | word-final standalone | word-medial standalone | word-initial standalone | |
---|---|---|---|---|
Simple: | ||||
Vocalics: |
For additional details see vowel_mappings.
ᨠ ka U+1A20 LETTER HIGH KA
The inherent vowel for XXXXXX is pronounced a or ʌ. So ka is written by simply using the consonant letter.
The inherent vowel is not usually pronounced at the end of a word.
The suppression of the inherent vowel in a syllable coda is indicated using the 𑜫 diacritic (see the examples above).
ᬓᬶ ki U+1B13 BALINESE LETTER KA + U+1B36 BALINESE VOWEL SIGN ULU
Balinese uses the following dedicated combining marks for vowels. They may be used on their own, or in combination with others (see composite_vowels). They are all vowel signs.
To represent the sounds rə or lə, Balinese uses vocalic letters. A sequence such as *ᬭᭂ [U+1B2D BALINESE LETTER RA + U+1B42 BALINESE VOWEL SIGN PEPET] is not used. See vocalics.
Six of the vowel signs are spacing marks, meaning that they consume horizontal space when added to a base consonant.
All vowel signs are typed and stored after the base consonant, and the glyph rendering system takes care of the positioning at display time. The glyphs used to represent vowels, whether alone or in multipart vowels, are arranged around a syllable onset, which may be 2 consonants, rather than just around the immediately preceding consonant. See prebase and circumgraphs.
something
These are the dedicated vowel letters.
Kashmiri uses the following consonant characters to write vowels, generally in combination with diacritics, but also alone when representing eː and oː in non-initial positions.
Multipart vowels are only produced when text is decomposed; 5 of the circumgraphs split off the 1B35 glyph, to create the following pairs:
The short i sound is written using 𑐶 [U+11436 NEWA VOWEL SIGN I], which appears to the left of the base consonant letter or cluster.
ꤷꥉꤼꥉꤿ꥓
This combining mark is always typed and stored after the base consonant. The rendering process places the glyph before the base consonant at the time of display.
When an orthographic syllable begins with a consonant cluster that is rendered as a conjunct, the vowel sign is rendered before the start of the syllable, eg. here are 3 sets of consonant clusters, each followed by i when spoken, but the vowel sign appears to the left of each cluster.𑐗𑑂𑐏𑐶 𑐳𑑂𑐟𑐶 𑐧𑑂𑐬𑐶 jkhi sti bri
When an orthographic syllable begins with a consonant cluster, the vowel sign is actually placed before the start of the onset, ie. to the left of the syllable as a whole. See fig_prebase.
ꨚꨴꨯꨱꩃ
However, note that if the cluster is split by a visible virama, this creates two orthographic syllables and the pre-base vowel sign appears after the consonant with the virama.
ᬓᭀ ko U+1B13 BALINESE LETTER KA + U+1B40 BALINESE VOWEL SIGN TALING TEDUNG
Five vowel signs are usually produced by a single combining character with visually separate parts, that appear on different sides of the consonant onset.
This section includes some vowel signs described in the section vocalics.
Like pre-base glyphs, these are combining marks that are always stored after the base consonant. When rendered, the single code point produces multiple glyphs, which are placed on different sides of the base consonant. Click on fig_circumgraphs to see the sequence of characters in storage.
ꤷꥇꥆꥋꥆ
Glyphs can appear on up to 3 sides of the base. Some of the glyphs merge with the base character's glyph (see context).
These circumgraphs have canonically equivalent decomposed forms (see vs_encoding).
tbd
tbd
Multipart vowels are only produced when text is decomposed; 5 of the circumgraphs split off the 1B35 glyph, to create the following pairs:
Kirat Rai has no independent vowel letters, but instead uses as a carrier for standalone vowels. This can represent a zero or glottal stop onset. The vowel to be pronounced is indicated by attaching a vowel sign, eg.
Used alone, this letter represents the standalone vowel a.
The following list shows how to write each of the standalone vowels.
Standalone vowels are written using the normal vowel signs with no special additional mechanisms, eg.
𞋀𞋊𞋞
tbd
This section maps Fula vowel sounds to common graphemes in the Adlam orthography.
Important notes.
Sounds listed as 'infrequent' are allophones, or sounds used for foreign words, etc. Light coloured characters occur infrequently.
mixed ີ
mixed ິ
1752
ᝐᝒᝆ
1741
ᝁᝇᝓ
1753
ᝎᝓᝆᝓ
1742
ᝂᝇᝓ
description
This table only summarises basic consonant to character assignments.
The left column is lowercase, and the right uppercase. A number of letters have allophones which are not shown here, and consonant clusters can be unpredictable in pronunciation. See the following sections for details. Normal letters are used as final consonants, but the right-hand column lists some additional, dedicated finals.
post-consonant | word-final standalone | word-medial standalone | word-initial standalone | |
---|---|---|---|---|
Onsets | ||||
Finals: |
For additional details see consonant_mappings.
Basic consonant sounds in Bengali are written using the following letters
Click on each letter for usage notes, alternative pronunciations, and for examples of usage.
tbd
tbd
tbd
The absence of a vowel sound between two or more consonants is visually indicated in one of the following ways.
In Unicode, the conjunct formation is achieved by adding ୍ [U+0B4D ORIYA SIGN VIRAMA] between the consonants. The font hides the virama glyph automatically when a conjunct is formed.
See also finals and gemination.
tbd
tbd
tbd
tbd
tbd
tbd
tbd
This section maps Fula consonant sounds to common graphemes in the Adlam orthography.
Important notes.
Sounds listed as 'infrequent' are allophones, or sounds used for foreign words, etc. Light coloured characters occur infrequently.
onset ပ
onset ဖ
onset ဘ sometimes at the beginning of words or particles.
tbd
tbd
tbd
This section offers advice about characters or character sequences to avoid, and what to use instead. It takes into account the relevance of Unicode Normalisation Form D (NFD) and Unicode Normalisation Form C (NFC)..
Although usage is recommended here, content authors may well be unaware of such recommendations. Therefore, applications should look out for the non-recommended approach and treat it the same as the recommended approach wherever possible.
One letter only can be represented as an atomic character (the norm), or as a sequence of base letter plus combining mark. The parts are separated in Unicode Normalisation Form D (NFD), and recomposed in Unicode Normalisation Form C (NFC), so both approaches should be treated as canonically equivalent.
Atomic (recommended) | Decomposed ( NOT recommended ) |
---|---|
ێ | 064A 0654 |
Note that the base character in the decomposed sequence is 064A, and not 06CC, which is used elsewhere for yeh in Kurmanji. 064A is only used for this specific decomposed sequence; it is inappropriate to use it elsewhere in Kurmanji text.
A couple of Sorani letters are encoded in different ways in different texts, and in fact can be mixed within the same text. At the moment there doesn't appear to be a clear ruling on which is expected. This section lists the alternatives.
Alternative 1 | Alternative 2 | Notes |
---|---|---|
06BE | 0647 | Wikipedia and Wiktionary represent the sound h using only HEH DOACHASHMEE, whereas gov.krd uses only ARABIC HEH. Other online resources examined are not completely one or the other. Note that Uighur uses HEH DOACHASHMEE to represent h. |
06D5 | 0647 200E |
The majority usage seems to favour AE, although again some resources mix both to some degree (although they tend to mostly use AE). This makes sense, since the use of HEH plus ZWNJ has the appearance of a hack intended to prevent HEH joining to the left, whereas AE will do this naturally, without any formatting code point. Use of AE also gets around practical difficulties that arise because the ZWNJ character is invisible and is not readily accessible from many keyboards – difficulties that are amplified by the fact that this vowel letter is one of the most commonly used letters in the Sorani alphabet. Note that Uighur also uses AE to represent a vowel, and a different character for the sound h. |
This section lists characters that may be mistakenly rendered using sequences that look the same as or similar to the atomic code points available for Bengali.
This table lists characters that are often mistakenly used because they look the same as or similar to the code points used for Kurmanji, or perhaps because the correct character is not available on the user's keyboard.
Incorrect | Correct | Notes |
---|---|---|
064A | 06CC | The Arabic YEH doesn't drop the dots below in isolate and final positions. |
0643 | 06A9 | Common fonts tend not to show the difference between these two characters, but the ability to search and compare text is impaired unless the application is aware of and takes counter-measures against this substitution. |
This information draws on the DoNotEmit tables.
The following atomic characters look as if they could be composed of parts, but in fact there is no equivalence during normalisation, and so the atomic characters only should be used.
Atomic | Sequence ( DO NOT use! ) |
---|---|
ڵ | 0644 065A |
ێ | 06CC 065A |
ۆ | 0648 065A |
Combining marks always follow the based character, however for Sorani, which doesn't normally use combining marks, this is only relevant for ێ when it occurs in decomposed text.
Where present, characters in a syllable should always occur in the following order.
U+200C ZERO WIDTH NON-JOINER (ZWNJ) can be used to force the production of a visible virama, rather than a half-form (see visiblevirama). It can also be used to prevent the formation of vowel ligatures (see vowelligatures).
U+200D ZERO WIDTH JOINER (ZWJ) is used to produce special joining forms for YA (see consonant_syllable).
@ XXXX has a set of native, decimal digits
The CLDR standard-decimal pattern is #,##,##0.###
. The standard-percent pattern is #,##,##0%
.c
However, unlike other right-to-left scripts such as Arabic, Hebrew, Thaana, the numbers are displayed right-to-left, with the most significant digit first. This means that numbers don't produce bidirectional text in N'Ko.
ߝߌߟߊߣߊ߲ ߕߋ߬ߟߋ߫ ߂߇-߂߈/߂߀߁߀ ߕߊ߬ߡߌ߲߬ߣߍ߲ ߠߊ߫߸
Ordinal numbers are produced using diacritics.
The first ordinal is produced using ߭ [U+07ED NKO COMBINING SHORT RISING TONE], eg. ߁߭ first.
Others use ߲ [U+07F2 NKO COMBINING NASALIZATION MARK], eg. ߂߲ second. When there are multiple digits in a number, the diacritic appears only under the last in sequence, eg. ߁߂߃߲ 123rd.
The CLDR standard format for currency is ¤#,##0.00;¤-#,##0.00
, and the symbol for the Lao currency, Kip, is ₭ [U+20AD KIP SIGN].
@ XXXX text runs left to right in horizontal lines.
Show default bidi_class
properties for characters in the XXXXX orthography described here.
Experiment with examples using the XXXX character app.
Wancho letters don't interact, so no special shaping is needed.
Base characters carry only a single combining mark.
Some small adjustments may be needed for the placement of tone marks relative to the base letter, but these are mostly horizontal, since Wancho letters tend to have the same height. fig_gpos shows adjustments to the horizontal position of the tone mark, depending on the shape of the base character. It also shows kerning of the letters.
Bengali fonts need to adapt glyphs based on their context.
Principal areas where context-sensitive shaping is required include conjunct formation and vowel ligation. Positioning of glyphs is also sometimes context-sensitive, particularly where multiple diacritics are applied to a single base.
The rest of this section provides examples of context-sensitive shaping and positioning.
Words are separated by spaces.
Some words are hyphenated.
Graphemes in XXXX consist of single letters or letters with one or two combining marks. This means that text can be segmented into typographic units using grapheme clusters.
XXXX doesn't use word boundaries for text segmentation, relying instead on grapheme boundaries (sometimes called orthographic syllables).
Graphemes in XXXX consist of single letters or letters with a single combining mark.
When applied to XXXX, Unicode grapheme clusters split text into base characters plus any combining marks that follow. Therefore, grapheme clusters can be used to segment XXXX into typographic units.
Phrase, sentence, and section delimiters are described in phrase.
XXXX uses a mixture of ASCII and Arabic punctuation. Other languages using the Arabic script may use different punctuation, such as the full stop in Urdu.
XXX uses ASCII punctuation marks.
phrase |
, ; : |
---|---|
sentence |
. ? ! |
paragraph |
A9CB |
section | Ditto. |
general divider |
A9CA |
See type samples.
Wancho commonly uses ASCII parentheses to insert parenthetical information into text.
start | end | |
---|---|---|
standard | ( |
) |
Lines are generally broken between words.
As in almost all writing systems, certain punctuation characters should not appear at the end or the start of a line. The Unicode line-break properties help applications decide whether a character should appear at the start or end of a line.
Show line-breaking properties for characters in the modern XXXXXX orthography.
The following list gives examples of typical behaviours for certain characters. Context may affect the behaviour of some of these.
Click/tap on the characters to show what they are.
Line breaking should not move a danda or double danda to the beginning of a new line even if they are preceded by a space character.
tbd