This page provides contrastive information about scripts. It is not intended to be exhaustively scientific – merely to give a basic idea of what scripts require what type of feature support. The categorisations are fairly rough and ready, but clicking on link in the third column takes you to resources with more details.
Click on the column headings to sort by that column. Mousing over or clicking/tapping on cells will show detailed information at the bottom of the window.
A number of columns are initially hidden. Clicking on the buttons below toggles one or more of those additional columns in to or out of view.
Show/hide detail for
Key
The table is intended to provide a general indication only. There are things that could be disputed. Note also that different orthographies using the same script may use slightly different features.
Duplicate script names. Some scripts are used in more than one way, in which case they occupy multiple lines. These include the following:
Arabic: The original usage is as an abjad. This applies to Modern Standard Arabic. Another significantly different approach mandates the use of all diacritics, which makes it an alphabet, since all the consonants and vowels are written. This approach is used for languages such as Kashmiri, or African ajami orthographies. A third approach is also alphabetic, but uses letters rather than diacritics to write vowels. This approach is used, for example, by Uighur.
Sunuwar: The original usage is as an abugida, but in Nepal the script is used as an alphabet. In this case we kept a single line but listed both alpha and abug in the type column.
The major divisions are displayed like buttons below. Any text following the button indicates a URL parameter that will cause that section to be automatically displayed.
Blocks. The number of Unicode blocks dedicated to the script. Blocks containing common characters, such as punctuation and many accent marks, are not included in the count. (This is significant for CJK and Latin scripts, in particular.)
?show=characters
Total chars. The number of characters listed in the dedicated Unicode blocks. This is typically more characters than are needed for a particular orthography, but excludes many characters used by orthographies that are not in the dedicated script blocks, such as punctuation, common combining marks, formatting characters, ASCII letters, etc.
The 6 other (initially hidden) columns give character counts for specific types of character, per the Unicode general property assignment. These include:
letter: Most alphabetic and other basic letters.
mark: The subset of characters that are combining characters. No attempt is made to indicate how many of the base characters each combining character can combine with. In some cases, this will be limited, but in most cases a combining character will combine with a fair number of base characters.
number: Non-ASCII numeric digits, but also other numeric characters where applicable (eg. characters for 20 and 30 in Amharic).
punctuation: Characters with the general property of punctuation in the blocks concerned. Most orthographies also use other punctuation, such as that in the ASCII range.
symbol: Any characters with the general property of symbol in the blocks examined.
other: Characters with the general property 'other'. These are almost always formatting characters, such as those used for bidi control, or joining/non-joining. Many of these characters are in a common Unicode block, rather than in the dedicated blocks for which the figures are assembled, and those numbers are not reflected in this column.
The typical direction(s) in which the text flows. Columns indicate:
Text direction: The general direction(s) of text flow. Options include:
ltr horizontal and left to right.
rtl horizontal and right to left.
tbrl vertically set, with lines progressing from right to left.
tblr vertically set, with lines progressing from left to right.
bt vertically set, with characters stacked from bottom to top.
bous boustrophedon.
RTL digits: In most orthographies where the text is read right to left numbers are read left to right. In these orthographies, however, numbers are also read right to left.
?show=vowels
Writing system. This column indicates whether the orthography is one of the following.
alpha: An alphabet, ie. vowels are written separately from the consonants. This includes scripts such as Latin, Cyrillic, Armenian, etc, but also scripts where the vowels are written using combining marks rather than letters, such as Arabic used for Kashmiri, Tai Viet, Thaana, etc.
abjad: An abjad, ie. short vowels are normally not written. This applies to scripts such as Arabic (when used for the Arabic language), Syriac, Aramaic, etc.
abug: An abugida, ie. consonants carry an inherent vowel which is overridden by vowel signs to represent other vowel sounds. Typical scripts include Devanagari, Thai, Buginese, etc.
syll: A syllabary, ie. characters generally represent a combination of consonant+vowel. These scripts include Cherokee, Vai, etc, but also Chinese, Japanese, etc.
feat: A featural syllabary, ie. characters are syllabic, but specific, standardised features of the glyphs carry information about the vowel represented. These scripts include Korean, Canadian Syllabics, and Ethiopic.
The other (initially hidden) columns cover the following types of character used to write vowel sounds.
Inherent vowel: The consonants in the orthography carry an inherent vowel sound, which, when needed, can be changed using vowel-signs or nullified by a particular character. The number indicates the number of ways in which the inherent vowel is typically pronounced in languages using this orthography.
PCV marks: The script represents post-consonant vowel sounds using combining marks. Note that this doesn't refer to marks that combine with letters to indicate nasalisation, etc, but rather to marks that on their own constitute the basic representation of a vowel sound, such as for many vowel signs.
PCV letters: The orthography uses letter characters to represent vowels pronounced after a consonant. This does NOT include letters used to represent standalone vowels (see below).
Hides vowels: Orthographies using this script generally hide vowel diacritics. This typically applies to abjads such as are used by Arabic, Hebrew, Urdu, etc. In the case of Arabic some orthographies always show all vowel diacritics, or show all vowels as letters; these alphabetic uses of the Arabic script appear on separate lines.
Visual order: The orthography uses ordinary spacing letters before a consonant to represent a vowel sound that is pronounced after the consonant (it could be part of a composite vowel). This includes scripts such as Thai, Lao & Tai Viet. (See prebase marks for the corresponding use of combining marks.)
Prebase marks: Scripts where orthographies use combining characters after the base consonant to represent vowel sounds, but the glyph for that character appears before the consonant itself when displayed, eg. the short i in Hindi. (See visual order for the corresponding use of letters.)
Circumgraphs: A vowel is represented by a single combining character, but the orthography displays multiple glyphs simultaneously on different sides of the base consonant, eg. certain Tamil vowel signs.
Composite vowels: Orthographies using this script express a single vowel sound using multiple code points (the opposite of a circumgraph). This is a common feature of Southeast Asian scripts, such as Thai and Lao. It doesn't generally include glides that form part of a diphthong, and it ignores marks used to indicate nasalisation or vowel length.
SA letters: Letter characters are used to represent standalone vowels, ie. that are NOT preceded by a consonant.
SA carrier: Standalone vowels are written using a base character plus vowel-sign. The column shows the character(s) and part of the Unicode name.
Vocalics: The script has single characters that represent a consonant plus vowel, as in Sanskrit and other indic scripts.
'Shaping' here means that glyph shapes change according to the context (gsub), whereas 'positioning' refers to the need to position glyphs differently according to context (gpos).
Case: Whether and how there is a mapping from one set of characters to another. Options include:
✓ regular uppercase and lowercase mappings.
allcaps uppercase forms are only used for whole words or phrases. Georgian case forms are used as normal vs all-caps, and all-caps is applied to a whole word. However, Unicode has data to enable algorithms to convert between the 'cases'. See more detail.
similar there is not a strict upper- vs lowercase mapping, but certain characters behave a little like uppercase forms. For example, Javanese has something that approximates case alternatives for some characters only, but there are no algorithms to convert from one 'case' to another. See more detail.
Cursive script: Whether letters are joined with adjacent letters, eg. as in Arabic, N'Ko, or Mongolian, but also indic scripts that have a joining headstroke, such as Devanagari, Bengali, etc.
Comb. marks: Whether the script uses combining marks. m indicates that the script attaches multiple combining marks to a single base.
Clusters marked: Consonant glyphs are merged or a diacritic is used to indicate consonant clusters. Merging is most likely to happen when the script has inherent vowels, since the conjunct indicates a killed vowel between consonants. Diacritics include things such as the Arabic sukun.
Other ligs: Scripts where characters are ligated, other than for conjuncts, on a regular basis. For example, this applies to consonants followed by /u/ in Tamil, to the lam-alif ligature in Arabic, and to certain glyph combinations in Syriac, etc.
?show=cclusters
These items are shown by clicking on the 'Consonant clusters' button.
Medials: The orthography uses dedicated combining or other characters to represent the second consonant in a syllable-initial cluster. Arrangements that use ordinary conjunct forms, or ordinary letters do not count here; these must be code points specially created for medial positions. Abbreviations have the following meanings:
cm a dedicated combining mark.
let a dedicated letter.
Finals: The orthography uses dedicated combining or other characters to represent syllable or word final consonants. Arrangements that use ordinary conjunct forms, or ordinary letters do not count here; these must be code points specially created for syllable coda positions. Abbreviations have the following meanings:
cm a dedicated combining mark.
let a dedicated letter.
vk a final letter that has a vowel-killer diacritic attached.
ss superscripts, eg. in Canadian syllabics.
Stacked: Consonants are stacked to indicate consonant clusters. This may or may not involve merging the consonant shapes.
Conjoined: Consonant shapes are merged horizontally, rather than stacked.
Ligated: Consonant shapes are merged in a way that is more complex than simply stacking or conjoining. Often it is difficult to see the component parts of the conjunct that is formed.
Touching: Consonant shapes are closer to each other than normal, but no shapes are altered.
Visible killer: A dedicated and visible character is used to indicate a cluster. The character may not always be visible, but a check mark here indicates that use of a visible marker is a relatively common approach.
Diacritic: Consonant clusters are indicated by a diacritic other than a virama, such as the sukun in Arabic. In orthographies that hide vowel diacritics, these are often also hidden in 'unvocalised' text.
Killer type: Where a special character is introduced to 'kill' the inherent vowel, this column indicates the type of killer. Options include:
v a virama that is usually invisible while creating a conjunct, but may be visible if the font doesn't support the conjunct glyph.
i an 'invisible stacker' that never has a visible glyph (and may create a conjunct in other ways than just stacking).
k a 'pure killer', ie. a glyph that is always visible.
Word separator. Indicates how and if words are separated. The following alternatives are called out:
space: Words are separated by spaces, eg. Hebrew.
wb: Words are visually separated, but by a non-space character, eg. Amharic. The character used is usually shown.
no: No explicit delimiters, eg. Chinese. When followed by an asterisk this language allows stacking of word-final consonants and following word-initial consonants (ie. separating words for line-breaks or highlighting doesn't work well, since stacks can't be split).
zwsp: There is no visual delimiter, but a zero-width space may be used, eg. Khmer.
syllable: Spaces are used, but they separate syllables, not words, eg. Vietnamese or Lisu.
sb: Again, syllables are separated rather than words, but using a non-space character, eg. Tibetan.
Linebreak. Indicates the primary break point for wrapping lines. It is useful to compare this column with the 'Word separator' column just described. Note that nearly all scripts have rules about which punctuation characters can appear at the end or start of a line. The alternatives are:
word: Text wraps at word boundaries.
syllable: Text wraps at syllable boundaries, regardless of whether word boundaries are delimited.
char: Text wraps immediately after the last character that fits on a line, regardless of word or syllable boundaries.
Hyphenation. Whether or not hyphenation is used with the script. Hyphenation here means, having initially broken lines at word boundaries, then splitting words at the end of a line as a secondary mechanism for line-breaking, in order to make justified paragraphs look better. Scripts may use other visual cues than a hyphen, and may sometimes use no visual indicator that the word was broken. Values include:
yes <char>: Hyphenation occurs, using the character indicated as a visual marker of the work break.
(yes) <char>: Hyphenation occurs but is rare.
yes ∅: Words are broken to fit at the line end, but no visual indicator is added to indicate that the word continues on the next line.
no: The primary line-break algorithm involves word boundaries, but words are not broken at the end of a line.
n/a: The primary line-break algorithm takes no account of word boundaries (eg. Japanese, Thai, etc.).
↵ indicates the line-break. For example, the cell for Mongolian shows "↵᠆", which indicates that the Todo soft hyphen appears at the start of the second line, rather than at the end of the first. If Polish were in the list, you would see -↵-, which indicates that the hyphen appears both at the end and beginning of the line.
* indicates that although the visual marker looks like a hyphen, it is actually a different character.
Word-spanning conjuncts. In some scripts conjuncts (usually stacks) may include the last consonant of one word and the first consonant of a following word. This prevents words being wrapped at word boundaries, since conjuncts cannot be split. Many scripts have conjuncts, but only a few have this feature.
Grapheme clusters. Whether the text units typically used for line-breaking conform to Unicode grapheme clusters. Many indic scripts which stack characters are not adequately served by graph clusters, since a grapheme cluster will end after a virama.
Justification. Indicates the principal method(s) for full justification of text. Higher-end typographic systems will typically apply more than one method, and across whole paragraphs rather than just a single line. Here we simply aim to give an idea of the most common approach, or approaches, where there is a mixture. Alternatives include:
sp: Spaces between words or syllables are stretched, eg. Ukrainian. In some orthographies, eg. Thai, the stretched spaces are phrase delimiters, rather than around words.
ic: Characters are separated by equal amounts of space across a line, eg. Chinese. (In practice, some characters tend to attract this spacing before others.)
ig: Space is introduced between unconnected glyphs, eg. Thai not only adds space around base characters, but also between those base characters and associated vowel-signs that are not combining marks. In Tamil, vowel-signs that don't interact with the base character may be separated in narrow column text when there is only one word on a line, even though the base character and vowel-sign together make a single grapheme cluster.
str: Connections between letters are stretched in cursive scripts. The orthography may also introduce elongated forms for certain characters, eg. swash forms in Arabic. (In fact, Arabic may also introduce ligatures to fit more words on a line.)
sw: Some letters are give lengthened swash shapes to fill up space, such as in Arabic.
pad: Characters are repeated to pad out remaining space at the end of a line, eg. multiple tseks at the line end in Tibetan.
none: Full justification is not a feature of the language, eg. Balinese.
Baseline. The baseline for Latin text is labelled 'romn'. Scripts designed like Indic scripts that hang from a high baseline, are labelled 'hang'. Scripts like Chinese are labelled 'ideo'. Scripts that use a centre baseline
romn: the baseline used for Latin script.
hang: a hanging baseline, such as used by several Indic scripts.
ideo: a low baseline, such as used by Chinese.
cntr: a vertial baseline that runs through the centre of the character glyphs, such as used by vertical Japanese or by Mongolian.
This rough grouping places the script in the region where it originated, so English is in Europe, and Arabic is in the West Asia. It serves to get a very rough idea of how things stack up on a regional basis. Regions are one of the following:
nam (Northern America),
sam (South America),
cam (Central America),
carib (Caribbean)
eur (Europe - includes Russia to Urals and Georgia, but not Armenia or Azerbaijan)
easia (East Asia - includes China, Mongolia, Japan, Korea)
nasia (Northern Asia - Russia east of Urals)
seasia (Southeast Asia - including Indonesia, Philippines
casia (Central Asia - north of Iran, S of Russia, W of China)
wasia (Western Asia - includes Armenian, Azerbaijan, Turkey, & middle east)
afr (Africa)
oce (Oceania - includes Australia, NZ, and Pacific Islands)
The following map shows how Africa, Europe, and Asia are roughly divided up along the lines of these regional labels.