This page provides contrastive information about orthographies. It is not intended to be exhaustively scientific – merely to give a basic idea of what languages require what type of feature support. The categorisations are fairly rough and ready, but clicking on the data in the cells takes you to pages that give more details.
The many columns are divided into sections. Clicking on the buttons below toggles one or more of those sections in to or out of view.
Click on the column headings to sort by that column. Mousing over or clicking/tapping on cells will show detailed information at the bottom of the window.
Show/hide detail for
Key
The table is intended to provide a general indication only. There are things that could be disputed.
The major sections are displayed like buttons below. The text following the button indicates the URL parameter that will cause that section to be automatically displayed.
The figures in the next column, preceded by a + sign indicate how many additional characters are currently under investigation. These currently contain many characters that may not be relevant, but that are awaiting assessment before they can be removed.
Character counts do not include ASCII characters. It is assumed that those characters are always available.
Note also that the character counts reflect the characters needed to represent both precomposed and decomposed versions of content. For example, use of a character such as â would add 3 characters to the total count: â, a, and the combining circumflex. Use of ô would then add a further 2 (because the circumflex is already counted).
The 5 other (initially hidden) columns give character counts for specific types of character, per the Unicode general property assignment. These include:
letters: Most alphabetic and other basic letters.
marks: The subset of characters that are combining characters. No attempt is made to indicate how many of the base characters each combining character can combine with. In some cases, this will be limited, but in most cases a combining character will combine with a fair number of base characters.
punctuation: Characters with the general property of punctuation.
native digits: Non-ASCII numeric digits, but also other numeric characters where applicable (eg. characters for 20 and 30 in Amharic).
other: Characters with the general property 'other'. These are almost always formatting characters, such as those used for bidi control, or joining/non-joining. (These figures currently need to be updated.)
Like the other character counts, these figures exclude ASCII characters, and include characters for any compositions and decompositions that may be applied (unless they are deprecated by the Unicode Standard).
?show=vowels
Writing system. This column indicates whether the orthography is one of the following.
alphabet: An alphabet, ie. vowels are written as letters.
abjad: An abjad, ie. vowels are normally not written.
abugida: An abugida, ie. consonants carry an inherent vowel which is overridden by vowel-signs to express different vowel sounds.
syllabic: A syllabary, ie. characters generally represent a combination of consonant+vowel.
The other (initially hidden) columns cover the following types of character used to write vowel sounds.
Inherent vowel: The consonants in the orthography carry an inherent vowel sound, which, when needed, can be change using vowel-signs or nullified by a particular character. The number represents the number of sounds the inherent vowel represents.
Dedicated letters: Letter characters dedicated to representation of vowels and used after a consonant.
Vowel marks: Combining marks dedicated to representation of vowels and used after a consonant.
Vowels, other: Consonants and other characters that are also used to represent vowel sounds, either alone or as part of a multipart vowel sequence.
Multipart vowels: How many vowels have been found that are represented by using more than one vowel-related character with a single base character. This is often seen in Southeast Asian scripts, such as Thai. Some counts include glides that form part of a diphthong, while other don't. (This should be clarified at some point.)
Vowels hidden: Whether the orthography generally hides the vowel diacritics. This typically applies to abjads such as are used by Arabic, Hebrew, Urdu, etc.
Vowel signs: Whether combining marks used to indicate vowels are referred to as vowel signs. These are usually used in Brahmi-derived scripts, and tend to be larger, and have more complex behaviours than simple diacritics.
Matres lectionis: Whether some or all of the vowel characters are referred to as matres lectionis. This tends to apply to orthographies that hide vowel diacritics but use consonants where long vowels occur.
Standalone letters: The number of letter characters that are used to represent vowels that are not preceded by a consonant. If the orthography doesn't have dedicated characters for this purpose (eg. most Latin script based languages), the cell is left blank.
Standalone carrier: An entry in this column indicates that the orthography represents standalone vowels using a base character plus vowel-sign. The column shows the character(s) and the Unicode name.
Prebase letters: If the orthography uses ordinary spacing letters before a consonant to represent a sound that occurs after the consonant, you will see the number of those characters here. A number in this column indicates that we are dealing with a script such as Thai or Lao, which uses visual ordering for prescript vowels, rather than combining characters.
Prebase marks: In this case, the orthography uses combining characters after the base consonant to represent vowel sounds, but the glyph for that character appears to the left of the consonant itself, eg. the short i in Hindi.
Circumgraphs: The number of characters representing vowels by a single combining character, but where the orthography displays multiple glyphs simultaneously on different sides of the base consonant, eg. certain Tamil vowel signs.
?show=consonants
Vocalics: How many vocalic sounds are used by the script in common, modern-day usage. This number is not doubled when both an independent vowel and a vowel-sign exist for the same vocalic letter. It represents the number of sounds for which there are special letters.
Medials: The orthography uses dedicated combining or other characters to represent the second consonant in a syllable-initial cluster. Medials represented by simple letters or conjuncts are not included here. Mouse over the cell to see detailed types at the bottom of the window. Abbreviations have the following meanings:
cm combining mark.
sj subjoined letter.
let other, dedicated letter.
Finals: The orthography uses dedicated combining or other characters to represent syllable or word final consonants. Mouse over the cell to see detailed types at the bottom of the window. Abbreviations have the following meanings:
cm combining mark.
let dedicated letter.
vk a final letter that has a vowel-killer diacritic attached.
ss superscripts, eg. in Canadian syllabics.
?show=cclusters
These columns are initially hidden. Where consonant clusters are represented in a special way by the orthography, this column indicates the more common strategies. Cells indicate:
Stacks: Consonants are stacked. This may or may not involve merging the consonant shapes.
Conjoined: Consonant shapes are merged horizontally, rather than stacked.
Ligated: Consonant shapes are merged in a way that is more complex than simply stacking or conjoining. Often it is difficult to see the component parts of the conjunct that is formed.
Touching: Consonant shapes are closer to each other than normal, but no shapes are altered.
Virama: A dedicated and visible character is used to indicate a cluster. The character may not always be visible, but a check mark here indicates that use of a visible marker is a relatively common approach.
Diacritic: Consonant clusters are or can be indicated by another diacritic, such as the sukun in Arabic. In orthographies that hide vowel diacritics, these are often hidden in 'unvocalised' text.
Killer type: Where a special character is introduced to 'kill' the inherent vowel, this column indicates the type of killer. Options include:
v a virama that is usually invisible while creating a conjunct, but may be visible if the font doesn't support the conjunct glyph.
i an 'invisible stacker' that never has a visible glyph (and may create a conjunct in other ways than just stacking).
k a 'pure killer', ie. a glyph that is always visible.
?show=direction
The typical direction(s) in which the text flows. Columns indicate:
Text direction: The general direction(s) of text flow. Options include:
ltr horizontal and left to right.
rtl horizontal and right to left.
tbrl vertically set, with lines progressing from right to left.
tblr vertically set, with lines progressing from left to right.
RTL numbers: In most orthographies where the text is read right to left numbers are read left to right. In these orthographies, however, number are also read right to left.
?show=shaping
'Shaping' here means that glyph shapes change according to the context (gsub), whereas 'positioning' refers to the need to position glyphs differently according to context (gpos). The columns look at two specific properties of shaping.
Bicameral: Whether and how there is a mapping from one case to another. Options include:
✓ regular uppercase and lowercase mappings.
allcaps uppercase forms are only used for whole words or phrases. Georgian case forms are used as normal vs all-caps, and all-caps is applied to a whole word. However, Unicode has data to enable algorithms to convert between the 'cases'. See more detail.
partial there is not a strict upper- vs lowercase mapping, but certain characters behave a little like uppercase forms. For example, Javanese has something that approximates case alternatives for some characters only, but there are no algorithms to convert from one 'case' to another. See more detail.
Cursive: Whether letters are joined with adjacent letters, eg. as in Arabic, N'Ko, or Mongolian.
?show=inline
Currently this section only indicates how and if words are separated. The following alternatives are called out:
space: Words are separated by spaces, eg. Hebrew.
wb: Words are visually separated, but by a non-space character, eg. Amharic.
no: No explicit delimiters, eg. Chinese. When followed by an asterisk this language allows stacking of word-final consonants and following word-initial consonants (ie. separating words for line-breaks or highlighting doesn't work well, since stacks can't be split).
zwsp: There is no visual delimiter, but a zero-width space may be used, eg. Khmer.
syllable: Spaces are used, but they separate syllables, not words, eg. Vietnamese or Lisu.
sb: Again, syllables are separated rather than words, but using a non-space character, eg. Tibetan.
?show=para
Columns currently cover:
Linebreak. Indicates the primary break point for wrapping lines. It is useful to compare this column with the 'Word separator' column just described. Note that nearly all scripts have rules about which punctuation characters can appear at the end or start of a line. The alternatives are:
word: Text wraps at word boundaries.
syllable: Text wraps at syllable boundaries, regardless of whether word boundaries are delimited.
char: Text wraps immediately after the last character that fits on a line, regardless of word or syllable boundaries.
Hyphenation. Whether or not hyphenation is used with the script. Hyphenation here means, having initially broken lines at word boundaries, then splitting words at the end of a line as a secondary mechanism for line-breaking, in order to make justified paragraphs look better. Scripts may use other visual cues than a hyphen, and may sometimes use no visual indicator that the word was broken. Values include:
yes <char>: Hyphenation occurs, using the character indicated as a visual marker of the work break.
(yes) <char>: Hyphenation occurs but is rare.
yes ∅: Words are broken to fit at the line end, but no visual indicator is added to indicate that the word continues on the next line.
no: The primary line-break algorithm involves word boundaries, but words are not broken at the end of a line.
n/a: The primary line-break algorithm takes no account of word boundaries (eg. Japanese, Thai, etc.).
↵ indicates the line-break. For example, the cell for Mongolian shows "↵᠆", which indicates that the Todo soft hyphen appears at the start of the second line, rather than at the end of the first. If Polish were in the list, you would see -↵-, which indicates that the hyphen appears both at the end and beginning of the line.
* indicates that although the visual marker looks like a hyphen, it is actually a different character.
Justification. Indicates the principal method(s) for full justification of text. Higher-end typographic systems will typically apply more than one method, and across whole paragraphs rather than just a single line. Here we simply aim to give an idea of the most common approach, or approaches, where there is a mixture. Alternatives include:
sp: Spaces between words or syllables are stretched, eg. Russian. In some orthographies, eg. Thai, the stretched spaces are phrase delimiters, rather than around words.
ic: Characters are separated by equal amounts of space across a line, eg. Chinese. (In practice, some characters tend to attract this spacing before others.)
ig: Space is introduced between unconnected glyphs, eg. Thai not only adds space around base characters, but also between those base characters and associated vowel-signs that are not combining marks. In Tamil, vowel-signs that don't interact with the base character may be separated in narrow column text when there is only one word on a line, even though the base character and vowel-sign together make a single grapheme cluster.
str: Connections between letters are stretched in cursive scripts. The orthography may also introduce elongated forms for certain characters, eg. swash forms in Arabic. (In fact, Arabic may also introduce ligatures to fit more words on a line.)
sw: Some letters are give lengthened glyph shapes to fill up space, such as in Arabic.
pad: Characters are repeated to pad out remaining space at the end of a line, eg. multiple tseks at the line end in Tibetan.
none: Full justification is not a feature of the language, eg. Balinese.
Text space. Text spacing looks at ways in which spacing is applied between characters over and above that which is introduced during justification. Alternatives include:
sp regular tracking space is introduced between each letter.
base: the baseline is stretched
Baseline. The baseline for Latin text is labelled 'romn'. Scripts designed like Indic scripts that hang from a high baseline, are labelled 'hang'. Scripts like Chinese are labelled 'ideo'. Scripts that use a centre baseline
romn: the baseline used for Latin script.
hang: a hanging baseline, such as used by several Indic scripts.
ideo: a low baseline, such as used by Chinese.
cntr: a vertial baseline that runs through the centre of the character glyphs, such as used by vertical Japanese or by Mongolian.
?show=more
This rough grouping places the script in the region where it originated, so English is in Europe, and Arabic is in the West Asia. It serves to get a very rough idea of how things stack up on a regional basis. Regions are one of the following:
nam (Northern America),
sam (South America),
cam (Central America),
carib (Caribbean)
eur (Europe - includes Russia to Urals and Georgia, but not Armenia or Azerbaijan)
easia (East Asia - includes China, Mongolia, Japan, Korea)
nasia (Northern Asia - Russia east of Urals)
seasia (Southeast Asia - including Indonesia, Philippines
casia (Central Asia - north of Iran, S of Russia, W of China)
wasia (Western Asia - includes Armenian, Azerbaijan, Turkey, & middle east)
afr (Africa)
oce (Oceania - includes Australia, NZ, and Pacific Islands)