Script features by language

Updated 21 Oct 2014 • tags scriptnotes.

This page provides information about script characteristics for a number of languages. The characteristics described are based on the exemplarCharacters lists in CLDR, ie. just the core characters needed to represent the language. It is not intended to be exhaustively scientific – merely to give a basic idea of what languages require what type of feature support. The symbol details after a script name points to a page that gives a quick summary of the script used for that language.

Click on the column headings to sort by that column.


The table is intended to provide a general indication only. There are things that could be disputed, and sometimes that goes back to the CLDR data. .

Number of characters This figure is based on the simplest list of exemplarCharacters in CLDR. It is therefore the set of core characters needed to represent the language, and typically doesn't include common punctuation, currency symbols, etc. Nor does it include the additional characters that you may find in publications. For example, the English set of characters doesn't include é, which you might use for 'résumé'. It also doesn't include combinations regarded as letters that are made up of groups of characters.

Note that CLDR is not necessarily correct or complete, but the figure is intended to give an idea of the general size of the character set, rather than an absolutely accurate figure.

Where a language uses a case sensitive script, uppercase versions of letters are included in this figure.

The characteristics described below are also based on this set of characters only.

Case sensitive? Whether or not the script makes case distinctions.

Combining characters. This shows the subset of the number of characters that are combining characters. No attempt is made to indicate how many of the base characters each combining character can combine with. In some cases, this will be limited, but in most cases a combining character will combine with a fair number of base characters.

Contextual positioning. This is typically related to combining characters, and indicates that a typical font uses OpenType rules to position of a glyph according to the glyphs that surround it, eg. tone marks in Thai, or vowel signs in Arabic (if used).

Multiple combining characters. Whether more than one combining character can be associated with a give base character.

Contextual shaping. Whether different glyph shapes have to be used for a character depending on the visual context, eg. the RA in Myanmar that grows and shrinks to fit around the character is surrounds. Note that this does not include shaping for cursive scripts (see below).

Cursive script. Do the letters in this script join up, eg. as in Arabic?

Ligatures. Does the script require certain ligatures, ie. a single glyph for more than one underlying character.

Right-to-left. Is this a right-to-left script (which actually usually means that bidirectional behaviour needs to be supported, for numbers and embedded foreign text.)

Space not word separator. Is this a script like Thai, where spaces are used to separate phrases, not words, or like Japanese and Chinese, that don't use spaces, or Ethiopic, that has its own word separator?

Baseline. The baseline for Latin text is labelled 'mid'. Scripts designed like Indic scripts that hang from a high baseline, are labelled 'high'. Scripts like Chinese are labelled 'low'.

Text wrap. At the end of a line, where is the typically break point? Is it between words, or characters? Entries labelled 'special' wrap at a character that is not a space, eg. Tibetan, which uses a tsheg between words, rather than a space.

Justification. What is the basic starting point for justification of text on a line? Typically this is related to the spaces between words. Here are the other alternatives listed: 'char' is typical of Chinese and Japanese, where justification starts with inter-character spaces; 'cluster' refers to scripts such as in South East Asia, where word boundaries are taken into account, but spaces are used as phrase separators; 'word' is used for arabic-based scripts, where justification is commonly achieved by stretching the baseline or using ligatures.

Region. This rough grouping places the language in the region where it originated, so English is in Europe, and Arabic is in the Middle East. It serves to get a very rough idea of how things stack up on a regional basis.

Feature count. This is a very simplistic indicator that simply awards one point for each column after the first three columns that doesn't read 'no', 'mid' or '0'.

First published 29 August, 2010. This version 2015-02-15 11:45 GMT.  •  Copyright Licence CC-By.