Key
The table is intended to provide a general indication only. There are things that could be disputed.
Total characters. This figure is based on the data in the Character Usage Lookup app. When there are 2 figures separated by a + sign, the first number indicates how many characters are in standard use, the second relates to additional infrequently used characters for that language.
Character counts do not include ASCII characters. It is assumed that those are available.
Note also that the character counts reflect the characters needed to represent both precomposed and decomposed versions of content. For example, use of a character such as â would add 3 characters to the total count: â, a, and the combining circumflex. Use of ô would then add a further 2 (because the circumflex is already counted).
Clicking on the button Toggle character detail reveals 5 columns to the right that give character counts for specific types of character, per the Unicode general property assignment. These include:
- letters: Most alphabetic and other basic letters.
- marks: The subset of characters that are combining characters. No attempt is made to indicate how many of the base characters each combining character can combine with. In some cases, this will be limited, but in most cases a combining character will combine with a fair number of base characters.
- punctuation: Characters with the general property of punctuation.
- native digits: Non-ASCII numeric digits, but also other numeric characters where applicable (eg. characters for 20 and 30 in Amharic).
- other: Characters with the general property 'other'. These are almost always formatting characters, such as those used for bidi control, or joining/non-joining. (These figures currently need to be updated.)
Like the other character counts, these figures exclude ASCII characters, and include characters for any compositions and decompositions that may be applied (unless they are deprecated by the Unicode Standard).
Type. The type column indicates whether the orthography is one of the following.
- alpha: An alphabet, ie. vowels are written as letters.
- abjad: An abjad, ie. vowels are normally not written.
- abug: An abugida, ie. consonants carry an inherent vowel which is overridden by vowel-signs to express different vowel sounds.
- syll: A syllabary, ie. characters generally represent a combination of consonant+vowel.
Clicking on the button Toggle type details reveals a number of related columns with a different-coloured background that indicate how various features are represented in the orthography. They cover the following:
- Inherent vowel: The consonants in the orthography carry an inherent vowel sound, which, when needed, can be change using vowel-signs or nullified by a particular character. The number represents the number of sounds the inherent vowel represents.
- Vowel letters: The orthography represents vowels pronounced after consonants using standalone characters.
- l signifies that vowels are represented by letters.
- iv indicates that the orthography uses independent vowels for vowel sounds that are not preceded by a consonant.
- ml indicates that the vowel letters are matres lectionis (which typically means that they are consonants too).
- Vowel marks: The orthography uses combining marks to represent vowel sounds after a consonant.
- d indicates that these vowel sounds are represented by diacritics that are always visible.
- (d) is used where languages such as Arabic and Hebrew can, but normally don't write the diacritics.
- vs indicates that the combining marks are vowel-signs such as those used in Brahmi-derived scripts. These tend to be larger, and have more complex behaviours than simple diacritics.
- Vowel base: An entry in this column indicates that the orthography represents standalone vowels using a base character plus vowel-sign. The column shows the character(s) and the Unicode name.
- Pre-base vletters: If the orthography uses ordinary spacing letters before a consonant to represent a sound that occurs after the consonant, you will see the number of those characters here. A number in this column indicates that we are dealing with a script such as Thai or Lao, which uses visual ordering for prescript vowels, rather than combining characters.
- Pre-base vmarks: In this case, the orthography uses combining characters after the base consonant to represent vowel sounds, but the glyph for that character appears to the left of the consonant itself, eg. the short i in Hindi.
- Circumgraphs: The number of characters representing vowels by a single combining character, but where the orthography displays multiple glyphs simultaneously on different sides of the base consonant, eg. certain Tamil vowel signs.
- Composite vowels: How many vowels are represented by using more than one vowel-related character with a single base character. This is often seen in Southeast Asian scripts, such as Thai.
- Vocalics: How many vocalic sounds are used by the script in common, modern-day usage. This number is not doubled when both an independent vowel and a vowel-sign exist for the same vocalic letter. It represents the number of sounds for which there are special letters.
- Clusters: Where consonant clusters are represented in a special way by the orthography, this column indicates the more common strategies, including the following:
- s stacked consonants.
- c conjoined consonants.
- t touching consonants.
- l consonants that ligate.
- v a visible virama (when it is the common way of indicating clusters).
- d a diacritic other than a visible virama is used, eg. the sukun for Arabic.
- r indicates that there are special rules for the handling of 'r'.
See the next column for viramas that produce conjuncts but are invisible.
- Invisible vkiller: Indicates that the orthography uses a virama or similar character to signal conjunct clusters, but the character disappears in the process. (The consonants are stacked, or merged in some other way to indicate that they are a cluster without the need for another mark.) This is really a special case of the previous column.
- Medials: The orthography uses dedicated combining or other characters to represent the second consonant in a syllable-initial cluster. Medials represented by simple letters or conjuncts are not included here. Click on 'yes' to see detailed types in the bottom right corner of the window. Abbreviations have the following meanings:
- cm combining mark.
- sj subjoined letter.
- let other, dedicated letter.
- Finals: The orthography uses dedicated combining or other characters to represent syllable or word final consonants. Click on 'yes' to see detailed types in the bottom right corner of the window. Abbreviations have the following meanings:
- cm combining mark.
- let dedicated letter.
- vk a final letter that has a vowel-killer diacritic attached.
- ss superscripts, eg. in Canadian syllabics.
Bicameral Whether or not the script makes case distinctions.
If the result is inside parentheses, it indicates that something similar to case conversion applies, however it operates in a slightly different way. See below:
- Javanese has something that approximates case alternatives for some characters only, but there are no algorithms to convert from one 'case' to another. See more detail.
- Georgian case forms are used as normal vs all-caps, and all-caps is applied to a whole word. However, Unicode has data to enable algorithms to convert between the 'cases'. See more detail.
Cursive script. Do the letters in this script join up, eg. as in Arabic, N'Ko, or Mongolian?
Text direction. Is this a right-to-left script (which actually usually means that bidirectional behaviour needs to be supported, for numbers and embedded foreign text)? Is it used in a vertical orientation?
A value of rtl* indicates that number digits are read RTL.
Baseline. The baseline for Latin text is labelled 'mid'. Scripts designed like Indic scripts that hang from a high baseline, are labelled 'high'. Scripts like Chinese are labelled 'low'.
Word separator. A word is a unit of segmentation between the grapheme and the phrase. This column asks whether, as a general rule, there are explicit delimiters for word boundaries. The alternatives are:
- space: Words are separated by spaces, eg. Hebrew.
- wb: Words are visually separated, but by a non-space character, eg. Amharic.
- no: No explicit delimiters, eg. Chinese. When followed by an asterisk this language allows stacking of word-final consonants and following word-initial consonants (ie. separating words for line-breaks or highlighting doesn't work well, since stacks can't be split).
- zwsp: There is no visual delimiter, but a zero-width space may be used, eg. Khmer.
- syllable: Spaces are used, but they separate syllables, not words, eg. Vietnamese or Lisu.
- sb: Again, syllables are separated rather than words, but using a non-space character, eg. Tibetan.
Text wrap. Indicates the primary break point for wrapping lines. It is useful to compare this column with the 'Word separator' column just described. Note that nearly all scripts have rules about which punctuation characters can appear at the end or start of a line. The alternatives are:
- word: Text wraps at word boundaries.
- syllable: Text wraps at syllable boundaries, regardless of whether word boundaries are delimited.
- char: Text wraps immediately after the last character that fits on a line, regardless of word or syllable boundaries.
Hyphenation. Whether or not hyphenation is used with the script. Hyphenation here means, having initially broken lines at word boundaries, then splitting words at the end of a line as a secondary mechanism for line-breaking, in order to make justified paragraphs look better. Scripts may use other visual cues than a hyphen, and may sometimes use no visual indicator that the word was broken. Values include:
- yes <char>: Hyphenation occurs, using the character indicated as a visual marker of the work break.
- (yes) <char>: Hyphenation occurs but is rare.
- yes ∅: Words are broken to fit at the line end, but no visual indicator is added to indicate that the word continues on the next line.
- no: The primary line-break algorithm involves word boundaries, but words are not broken at the end of a line.
- n/a: The primary line-break algorithm takes no account of word boundaries (eg. Japanese, Thai, etc.).
↵ indicates the line-break. For example, the cell for Mongolian shows "↵᠆", which indicates that the Todo soft hyphen appears at the start of the second line, rather than at the end of the first. If Polish were in the list, you would see -↵-, which indicates that the hyphen appears both at the end and beginning of the line.
* indicates that although the visual marker looks like a hyphen, it is actually a different character.
Justification. Indicates the principal method(s) for full justification of text. Higher-end typographic systems will typically apply more than one method, and across whole paragraphs rather than just a single line. Here we simply aim to give an idea of the most common approach, or approaches, where there is a mixture. Alternatives include:
- sp: Spaces between words or syllables are stretched, eg. Russian. In some orthographies, eg. Thai, the stretched spaces are phrase delimiters, rather than around words.
- ic: Characters are separated by equal amounts of space across a line, eg. Chinese. (In practice, some characters tend to attract this spacing before others.)
- ig: Space is introduced between unconnected glyphs, eg. Thai not only adds space around base characters, but also between those base characters and associated vowel-signs that are not combining marks. In Tamil, vowel-signs that don't interact with the base character may be separated in narrow column text when there is only one word on a line, even though the base character and vowel-sign together make a single grapheme cluster.
- str: Connections between letters are stretched in cursive scripts. The orthography may also introduce elongated forms for certain characters, eg. swash forms in Arabic. (In fact, Arabic may also introduce ligatures to fit more words on a line.)
- sw: Some letters are give lengthened glyph shapes to fill up space, such as in Arabic.
- pad: Characters are repeated to pad out remaining space at the end of a line, eg. multiple tseks at the line end in Tibetan.
- none: Full justification is not a feature of the language, eg. Balinese.
Region. This rough grouping places the script in the region where it originated, so English is in Europe, and Arabic is in the West Asia. It serves to get a very rough idea of how things stack up on a regional basis. Regions are one of the following:
- nam (Northern America),
- sam (South America),
- cam (Central America),
- carib (Caribbean)
- eur (Europe - includes Russia to Urals and Georgia, but not Armenia or Azerbaijan)
- easia (East Asia - includes China, Mongolia, Japan, Korea)
- nasia (Northern Asia - Russia east of Urals)
- seasia (Southeast Asia - including Indonesia, Philippines
- casia (Central Asia - north of Iran, S of Russia, W of China)
- wasia (Western Asia - includes Armenian, Azerbaijan, Turkey, & middle east)
- afr (Africa)
- oce (Oceania - includes Australia, NZ, and Pacific Islands)