Script comparison table

Updated 13-Jun-2020 • tags scriptnotes.

This page provides information about the orthographic and typographic characteristics of a number of languages. It is not intended to be exhaustively scientific – merely to give a basic idea of what languages require what type of feature support. The symbol details after a script name points to a page that gives a quick summary of the script.

Click on the column headings to sort by that column. See the key to look up abbreviations. If you can no longer see the headings, mousing over or clicking /tapping on cells will show the column names.

 

Key

The table is intended to provide a general indication only. There are things that could be disputed.

Total characters, etc. This figure is based on the data in the Character Usage Lookup app. When there are 2 figures separated by a + sign, the first number indicates how many characters are in standard use, the second relates to additional infrequently used characters for that language.

Character counts do not include ASCII characters. It is assumed that those are available.

Note also that the character counts reflect the characters needed to represent both precomposed and decomposed versions of content. For example, use of a character such as â would add 3 characters to the total count: â, a, and the combining circumflex. Use of ô would then add a further 2 (because the circumflex is already counted).

The 5 columns to the right give character counts for specific types of character, per the Unicode general property assignment. These include:

As for the other character counts, these figures exclude ASCII characters, and include characters for any compositions and decompositions that may be applied (unless they are deprecated by the Unicode Standard).

Type, etc. The type column indicates whether the orthography is one of the following.

Immediately after the type column there are 4 related columns with a different background that indicate how vowels are represented in the orthography. They cover the following:

Contextual placement. This is typically related to combining characters, and indicates that a typical font uses OpenType rules to position of a glyph according to the glyphs that surround it, eg. tone marks in Thai, or vowel signs in Arabic (if used). Nearly all scripts with combining characters will need some positioning rules to take account of where the combining character should be placed. This indicator is more concerned with whether that location varies significantly, depending on the surrounding context.

Contextual shaping. Whether different glyph shapes have to be used for a character depending on the visual context, eg. the RA in Myanmar that grows and shrinks to fit around the character is surrounds. Note that this does not include shaping for cursive scripts.

Case sensitive Whether or not the script makes case distinctions.

Cursive script. Do the letters in this script join up, eg. as in Arabic, N'Ko, or Mongolian?

Text direction. Is this a right-to-left script (which actually usually means that bidirectional behaviour needs to be supported, for numbers and embedded foreign text.) Is it used in a vertical orientation?

A value of rtl* indicates that numbers run RTL.

Baseline. The baseline for Latin text is labelled 'mid'. Scripts designed like Indic scripts that hang from a high baseline, are labelled 'high'. Scripts like Chinese are labelled 'low'.

Word separator. A word is a unit of segmentation between the grapheme and the phrase. This column asks whether, as a general rule, there are explicit delimiters for word boundaries. The alternatives are:

Text wrap. Indicates the primary break point for wrapping lines. It is useful to compare this column with the 'Word separator' column just described. Note that nearly all scripts have rules about which punctuation characters can appear at the end or start of a line. The alternatives are:

Hyphenation. Whether or not hyphenation is used with the script – by which is meant the addition of a mark at the end or beginning of a line when a word is broken at line end. Scripts that simply break text at syllable or character boundaries are not classed here as hyphenating. Values include:

Justification. Indicates the principal method(s) for full justification of text. Higher-end typographic systems will typically apply more than one method, and across whole paragraphs rather than just a single line. Here we simply aim to give an idea of the most common approach, or approaches, where there is a mixture. Alternatives include:

Region. This rough grouping places the script in the region where it originated, so English is in Europe, and Arabic is in the West Asia. It serves to get a very rough idea of how things stack up on a regional basis. Regions are one of the following:

Changed 2020-06-13 11:13 GMT.  •  Send feedback.  •  Licence CC-By © r12a.