Script comparison table

Updated 1 December, 2024 • recent changes • leave a comment

This page provides contrastive information about scripts. It is not intended to be exhaustively scientific – merely to give a basic idea of what scripts require what type of feature support. The categorisations are fairly rough and ready, but clicking on link in the third column takes you to resources with more details.

Click on the column headings to sort by that column. Mousing over or clicking/tapping on cells will show detailed information at the bottom of the window.

A number of columns are initially hidden. Clicking on the buttons below toggles one or more of those additional columns in to or out of view.

Show/hide detail for

Key

The table is intended to provide a general indication only. There are things that could be disputed. Note also that different orthographies using the same script may use slightly different features.

Duplicate script names. Some scripts are used in more than one way, in which case they occupy multiple lines. These include the following:

The major divisions are displayed like buttons below. Any text following the button indicates a URL parameter that will cause that section to be automatically displayed.

Blocks. The number of Unicode blocks dedicated to the script. Blocks containing common characters, such as punctuation and many accent marks, are not included in the count. (This is significant for CJK and Latin scripts, in particular.)

?show=characters

Total chars. The number of characters listed in the dedicated Unicode blocks. This is typically more characters than are needed for a particular orthography, but excludes many characters used by orthographies that are not in the dedicated script blocks, such as punctuation, common combining marks, formatting characters, ASCII letters, etc.

The 6 other (initially hidden) columns give character counts for specific types of character, per the Unicode general property assignment. These include:

The typical direction(s) in which the text flows. Columns indicate:

?show=vowels

Writing system. This column indicates whether the orthography is one of the following.

The other (initially hidden) columns cover the following types of character used to write vowel sounds.

'Shaping' here means that glyph shapes change according to the context (gsub), whereas 'positioning' refers to the need to position glyphs differently according to context (gpos).

?show=cclusters

These items are shown by clicking on the 'Consonant clusters' button.

Word separator. Indicates how and if words are separated. The following alternatives are called out:

Linebreak. Indicates the primary break point for wrapping lines. It is useful to compare this column with the 'Word separator' column just described. Note that nearly all scripts have rules about which punctuation characters can appear at the end or start of a line. The alternatives are:

Hyphenation. Whether or not hyphenation is used with the script. Hyphenation here means, having initially broken lines at word boundaries, then splitting words at the end of a line as a secondary mechanism for line-breaking, in order to make justified paragraphs look better. Scripts may use other visual cues than a hyphen, and may sometimes use no visual indicator that the word was broken. Values include:

↵ indicates the line-break. For example, the cell for Mongolian shows "↵᠆", which indicates that the Todo soft hyphen appears at the start of the second line, rather than at the end of the first. If Polish were in the list, you would see -↵-, which indicates that the hyphen appears both at the end and beginning of the line.

* indicates that although the visual marker looks like a hyphen, it is actually a different character.

Word-spanning conjuncts. In some scripts conjuncts (usually stacks) may include the last consonant of one word and the first consonant of a following word. This prevents words being wrapped at word boundaries, since conjuncts cannot be split. Many scripts have conjuncts, but only a few have this feature.

Grapheme clusters. Whether the text units typically used for line-breaking conform to Unicode grapheme clusters. Many indic scripts which stack characters are not adequately served by graph clusters, since a grapheme cluster will end after a virama.

Justification. Indicates the principal method(s) for full justification of text. Higher-end typographic systems will typically apply more than one method, and across whole paragraphs rather than just a single line. Here we simply aim to give an idea of the most common approach, or approaches, where there is a mixture. Alternatives include:

Baseline. The baseline for Latin text is labelled 'romn'. Scripts designed like Indic scripts that hang from a high baseline, are labelled 'hang'. Scripts like Chinese are labelled 'ideo'. Scripts that use a centre baseline

This rough grouping places the script in the region where it originated, so English is in Europe, and Arabic is in the West Asia. It serves to get a very rough idea of how things stack up on a regional basis. Regions are one of the following:

The following map shows how Africa, Europe, and Asia are roughly divided up along the lines of these regional labels.

Map of Europe, Africa, & Asia.

Changed 2024-12-01 7:35 GMT.  •  Send feedback.  •  Licence CC-By © r12a.