Script comparison table

Updated 21 Oct 2014 • tags scriptnotes.

This page provides information about characteristics of a number of scripts. It is not intended to be exhaustively scientific – merely to give a basic idea of what languages require what type of feature support. The symbol details after a script name points to a page that gives a quick summary of the script.

Click on the column headings to sort by that column.

Notes

The table is intended to provide a general indication only. There are things that could be disputed.

Number of characters This figure is based on a character count for Unicode blocks related to that script. It is approximate, however, for a number of reasons. Blocks such as punctuation are not included – this is just a figure for the main block or set of blocks dedicated to that script.

Very often a particular language will use only a small number of the total characters available in a script (think, for example, how many characters are used for English out of the 1,286 Latin characters). The figures also include archaic characters.

Combining characters. This shows the subset of the number of characters that are combining characters. No attempt is made to indicate how many of the base characters each combining character can combine with. In some cases, this will be limited, but in most cases a combining character will combine with a fair number of base characters.

Multiple combining characters. Whether more than one combining character can be associated with a give base character.

Case sensitive? Whether or not the script makes case distinctions.

Contextual positioning. This is typically related to combining characters, and indicates that a typical font uses OpenType rules to position of a glyph according to the glyphs that surround it, eg. tone marks in Thai, or vowel signs in Arabic (if used). Nearly all scripts with combining characters will need some positioning rules to take account of where the combining character should be placed. This indicator is more concerned with whether that location varies significantly, depending on the surrounding context.

Contextual shaping. Whether different glyph shapes have to be used for a character depending on the visual context, eg. the RA in Myanmar that grows and shrinks to fit around the character is surrounds. Note that this does not include shaping for cursive scripts (see below).

Cursive script. Do the letters in this script join up, eg. as in Arabic?

Text direction. Is this a right-to-left script (which actually usually means that bidirectional behaviour needs to be supported, for numbers and embedded foreign text.) Is it used in a vertical orientation?

Word separator. Is this a script like Thai, where spaces are used to separate phrases, not words, or like Japanese and Chinese, that don't use spaces, or Ethiopic, that has its own word separator?

Baseline. The baseline for Latin text is labelled 'mid'. Scripts designed like Indic scripts that hang from a high baseline, are labelled 'high'. Scripts like Chinese are labelled 'low'.

Text wrap. At the end of a line, where is the typically break point? Is it between words, or characters? Entries labelled 'special' wrap at a character that is not a space, eg. Tibetan, which uses a tsheg between words, rather than a space.

Justification. What is the basic starting point for justification of text on a line? Typically this is related to the spaces between words. Here are the other alternatives listed: 'char' is typical of Chinese and Japanese, where justification starts with inter-character spaces; 'cluster' refers to scripts such as in South East Asia, where word boundaries are taken into account, but spaces are used as phrase separators; 'word' is used for arabic-based scripts, where justification is commonly achieved by stretching the baseline or using ligatures.

Region. This rough grouping places the script in the region where it originated, so English is in Europe, and Arabic is in the Middle East. It serves to get a very rough idea of how things stack up on a regional basis.

Digits. Does the script have a set of native digits? Note that in some cases these may not be used for a particular language.

Feature count. This is a very simplistic indicator that simply awards one point for each column after the first three columns that doesn't read 'no', 'mid' or '0'.

First published 29 August, 2010. This version 2017-07-12 7:14 GMT.  •  Copyright r12a@w3.org. Licence CC-By.