Orthographies table

Updated 11 September, 2021 • recent changes • leave a comment

This page provides contrastive information about orthographies. It is not intended to be exhaustively scientific – merely to give a basic idea of what languages require what type of feature support. The categorisations are fairly rough and ready, but the symbol details after the script name on each line points to a page that gives more details.

Click on the column headings to sort by that column. See the key to look up abbreviations. If you can no longer see the headings, mousing over or clicking/tapping on cells will show the column names, and in some cases will provide more long-form information.

 

Key

The table is intended to provide a general indication only. There are things that could be disputed.

Total characters. This figure is based on the data in the Character Usage Lookup app. When there are 2 figures separated by a + sign, the first number indicates how many characters are in standard use, the second relates to additional infrequently used characters for that language.

Character counts do not include ASCII characters. It is assumed that those are available.

Note also that the character counts reflect the characters needed to represent both precomposed and decomposed versions of content. For example, use of a character such as â would add 3 characters to the total count: â, a, and the combining circumflex. Use of ô would then add a further 2 (because the circumflex is already counted).

Clicking on the button Toggle character detail reveals 5 columns to the right that give character counts for specific types of character, per the Unicode general property assignment. These include:

Like the other character counts, these figures exclude ASCII characters, and include characters for any compositions and decompositions that may be applied (unless they are deprecated by the Unicode Standard).

Type. The type column indicates whether the orthography is one of the following.

Clicking on the button Toggle type details reveals a number of related columns with a different-coloured background that indicate how various features are represented in the orthography. They cover the following:

Bicameral Whether or not the script makes case distinctions.

If the result is inside parentheses, it indicates that something similar to case conversion applies, however it operates in a slightly different way. See below:

Cursive script. Do the letters in this script join up, eg. as in Arabic, N'Ko, or Mongolian?

Text direction. Is this a right-to-left script (which actually usually means that bidirectional behaviour needs to be supported, for numbers and embedded foreign text)? Is it used in a vertical orientation?

A value of rtl* indicates that number digits are read RTL.

Baseline. The baseline for Latin text is labelled 'mid'. Scripts designed like Indic scripts that hang from a high baseline, are labelled 'high'. Scripts like Chinese are labelled 'low'.

Word separator. A word is a unit of segmentation between the grapheme and the phrase. This column asks whether, as a general rule, there are explicit delimiters for word boundaries. The alternatives are:

Text wrap. Indicates the primary break point for wrapping lines. It is useful to compare this column with the 'Word separator' column just described. Note that nearly all scripts have rules about which punctuation characters can appear at the end or start of a line. The alternatives are:

Hyphenation. Whether or not hyphenation is used with the script. Hyphenation here means, having initially broken lines at word boundaries, then splitting words at the end of a line as a secondary mechanism for line-breaking, in order to make justified paragraphs look better. Scripts may use other visual cues than a hyphen, and may sometimes use no visual indicator that the word was broken. Values include:

↵ indicates the line-break. For example, the cell for Mongolian shows "↵᠆", which indicates that the Todo soft hyphen appears at the start of the second line, rather than at the end of the first. If Polish were in the list, you would see -↵-, which indicates that the hyphen appears both at the end and beginning of the line.

* indicates that although the visual marker looks like a hyphen, it is actually a different character.

Justification. Indicates the principal method(s) for full justification of text. Higher-end typographic systems will typically apply more than one method, and across whole paragraphs rather than just a single line. Here we simply aim to give an idea of the most common approach, or approaches, where there is a mixture. Alternatives include:

Region. This rough grouping places the script in the region where it originated, so English is in Europe, and Arabic is in the West Asia. It serves to get a very rough idea of how things stack up on a regional basis. Regions are one of the following:

Changed 2021-09-11 13:22 GMT.  •  Send feedback.  •  Licence CC-By © r12a.