The table is intended to provide a general indication only. There are things that could be disputed.
Total characters, etc. This figure is based on the data in the Character Usage Lookup app. When there are 2 figures separated by a + sign, the first number indicates how many characters are in standard use, the second relates to additional infrequently used characters for that language.
Character counts do not include ASCII characters. It is assumed that those are available.
Note also that the character counts reflect the characters needed to represent both precomposed and decomposed versions of content. For example, use of a character such as â would add 3 characters to the total count: â, a, and the combining circumflex. Use of ô would then add a further 2 (because the circumflex is already counted).
The 5 columns to the right give character counts for specific types of character, per the Unicode general property assignment. These include:
- letters: Most alphabetic and other basic letters.
- marks: The subset of characters that are combining characters. No attempt is made to indicate how many of the base characters each combining character can combine with. In some cases, this will be limited, but in most cases a combining character will combine with a fair number of base characters.
- punctuation: Characters with the general property of punctuation.
- native digits: Non-ASCII numeric digits, but also other numeric characters where applicable (eg. characters for 20 and 30 in Amharic).
- format chars: Characters with the general property 'other'. These are almost always formatting characters, such as those used for bidi control, or joining/non-joining. (These figures currently need to be updated.)
As for the other character counts, these figures exclude ASCII characters, and include characters for any compositions and decompositions that may be applied (unless they are deprecated by the Unicode Standard).
Type, etc. The type column indicates whether the orthography is one of the following.
- alpha: An alphabet, ie. vowels are written as letters.
- abjad: An abjad, ie. vowels are normally not written.
- abug: An abugida, ie. consonants carry an inherent vowel which is overridden by vowel-signs to express different vowel sounds.
- syll: A syllabary, ie. characters generally represent a combination of consonant+vowel.
Immediately after the type column there are 4 related columns with a different background that indicate how vowels are represented in the orthography. They cover the following:
- Vowel letters: The orthography represents vowels that follow consonants using standalone letters. The default, l signifies that vowels are represesented by letters after the consonant. An abugida with the value 'l' here and the value 'p' in the column for prescript vowels indicates that we are dealing with an orthography that uses the visual model, eg. Thai. iv indicates that the orthography uses independent vowels for vowel sounds that are not preceded by a consonant. (iv) indicates that there is an incomplete set of independent vowels, so the orthography will use a base character and vowel-signs to represent standalone vowels, too. An abugida with vowel-signs but no independent vowels also indicates that a combination of base and vowel-sign are used for standalone vowel sounds. ml indicates that the vowel letters are matres lectionis (which typically means that they are consonants too).
- Vowel marks: The orthography uses combining marks to represent vowel sounds after a consonant. d indicates that these vowel sounds are represented by diacritics. Where languages such as Arabic and Hebrew normally don't write the diacritics, this column will contain (d). Otherwise you should assume that diacritics normally remain visible. vs indicates that the combining marks are vowel-signs.
- Circumgraphs: Indicates the number of vowels for which the orthography displays vowel glyphs simultaneously on different sides of the base consonant, eg. certain Tamil vowel signs. The number increments for each combination of glyphs used to represent a vowel sound.
- Prescript vowels: Indicates the number of vowels for which vowel signs, or parts of a circumgraph, appear to the left of the base consonant that they modify, eg. the short i in Hindi. The number increments for each combination of glyphs used to represent a vowel sound.
Contextual placement. This is typically related to combining characters, and indicates that a typical font uses OpenType rules to position of a glyph according to the glyphs that surround it, eg. tone marks in Thai, or vowel signs in Arabic (if used). Nearly all scripts with combining characters will need some positioning rules to take account of where the combining character should be placed. This indicator is more concerned with whether that location varies significantly, depending on the surrounding context.
Contextual shaping. Whether different glyph shapes have to be used for a character depending on the visual context, eg. the RA in Myanmar that grows and shrinks to fit around the character is surrounds. Note that this does not include shaping for cursive scripts.
Case sensitive Whether or not the script makes case distinctions.
Cursive script. Do the letters in this script join up, eg. as in Arabic, N'Ko, or Mongolian?
Text direction. Is this a right-to-left script (which actually usually means that bidirectional behaviour needs to be supported, for numbers and embedded foreign text.) Is it used in a vertical orientation?
A value of rtl* indicates that numbers run RTL.
Baseline. The baseline for Latin text is labelled 'mid'. Scripts designed like Indic scripts that hang from a high baseline, are labelled 'high'. Scripts like Chinese are labelled 'low'.
Word separator. A word is a unit of segmentation between the grapheme and the phrase. This column asks whether, as a general rule, there are explicit delimiters for word boundaries. The alternatives are:
- space: Words are separated by spaces, eg. Hebrew.
- wb<char>: Words are visually separated, but by a non-space character, eg. Amharic.
- no: No explicit delimiters, eg. Chinese.
- zwsp: There is no visual delimiter, but a zero-width space may be used, eg. Khmer.
- syllable: Spaces are used, but they separate syllables, not words, eg. Vietnamese or Lisu.
- sb<char>: Again, syllables are separated rather than words, but using a non-space character, eg. Tibetan.
Text wrap. Indicates the primary break point for wrapping lines. It is useful to compare this column with the 'Word separator' column just described. Note that nearly all scripts have rules about which punctuation characters can appear at the end or start of a line. The alternatives are:
- word: Text wraps at word boundaries.
- syllable: Text wraps at syllable boundaries, regardless of whether word boundaries are delimited.
- char: Text wraps immediately after the last character that fits on a line, regardless of word or syllable boundaries.
Hyphenation. Whether or not hyphenation is used with the script – by which is meant the addition of a mark at the end or beginning of a line when a word is broken at line end. Scripts that simply break text at syllable or character boundaries are not classed here as hyphenating. Values include:
- yes: Hyphenation occurs, using a hyphen-like character.
- sp <char>: A similar process to western hyphenation occurs, but using a different character than one that looks like a hyphen.
- n/a: Words are not broken at the end of a line.
- no: Words are broken at the end of a line, but nothing is added to indicate that the word continues on the next line.
Justification. Indicates the principal method(s) for full justification of text. Higher-end typographic systems will typically apply more than one method, and across whole paragraphs rather than just a single line. Here we simply aim to give an idea of the most common approach, or approaches, where there is a mixture. Alternatives include:
- spaces: Spaces between words or syllables are stretched, eg. Russian. In some orthographies, eg. Thai, the stretched spaces are phrase delimiters, rather than around words.
- characters: Characters are separated by equal amounts of space across a line, eg. Chinese. (In practice, some characters tend to attract this spacing before others.)
- glyphs: Space is introduced between unconnected glyphs, eg. Thai not only adds space around base characters, but also between those base characters and associated vowel-signs that are not combining marks. In Tamil, vowel-signs that don't interact with the base character may be separated in narrow column text when there is only one word on a line, even though the base character and vowel-sign together make a single grapheme cluster.
- elongation: Connections between letters are stretched in cursive scripts. The orthography may also introduce elongated forms for certain characters, eg. swash forms in Arabic. (In fact, Arabic may also introduce ligatures to fit more words on a line.)
- other: Some other mechanism is used, eg. multiple tseks at the line end in Tibetan.
- none: Full justification is not a feature of the language, eg. Balinese.
Region. This rough grouping places the script in the region where it originated, so English is in Europe, and Arabic is in the West Asia. It serves to get a very rough idea of how things stack up on a regional basis. Regions are one of the following:
- nam (Northern America),
- sam (South America),
- cam (Central America),
- carib (Caribbean)
- eur (Europe - includes Russia to Urals and Georgia, but not Armenia or Azerbaijan)
- easia (East Asia - includes China, Mongolia, Japan, Korea)
- nasia (Northern Asia - Russia east of Urals)
- seasia (Southeast Asia - including Indonesia, Philippines
- casia (Central Asia - north of Iran, S of Russia, W of China)
- wasia (Western Asia - includes Armenian, Azerbaijan, Turkey, & middle east)
- afr (Africa)
- oce (Oceania - includes Australia, NZ, and Pacific Islands)