Glossary of script-related terms

Updated 5 May, 2022

This page provides explanations of terms used in the articles about writing systems and scripts. Text in italics is cited from elsewhere (most often the Unicode glossary).


A writing system in which consonants are indicated by the base letters that have an inherent vowel, and in which other vowels are indicated by additional distinguishing marks of some kind modifying the base letter. The term “abugida” is derived from the first four letters of the Ethiopic script in the Semitic order: alf, bet, gaml, dant. (See Section 6.1, Writing Systems.)
A writing system in which only consonants are indicated. The term “abjad” is derived from the first four letters of the traditional order of the Arabic script: alef, beh, jeem, dal. (See Section 6.1, Writing Systems.)
A writing system in which both consonants and vowels are indicated. The term “alphabet” is derived from the first two letters of the Greek script: alpha, beta. (See Section 6.1, Writing Systems.)


BCCS (base & combining character sequence)
A sequence of characters following the pattern Base (Combining_mark | ZWJ | ZWNJ)*. A base character that is a letter or digit, followed by zero or more combining characters, zero width joiners, and/or zero width non-joiners. This commonly reflects the minimal typographic unit used for operating on text.
brahmi phonetic matrix
It is common to arrange consonants in writing systems descended from the archaic Brahmi script in tabular form. The first 5 rows of the matrix normally represent plosives and nasals, and are ordered by place of articulation, as shown in this table.
retroflex ʈ ʈʰ ɖ ɖʰ ɳ
dental t d n
bilabial p b m

The remaining two rows contain liquids and fricatives.


A checked syllable ends in a stop sound, usually -p, -t, -k, or . Unchecked syllables end in sonorants m, n, ŋ, w, j or vowels.
When a single vowel-sign code point produces glyphs on more than one side of the consonant base, it is referred to as a circumgraph. For example, the following single codepoint representing a vowel-sign is rendered on both sides of the base consonant.
coda (syllable)
See syllable structure.
combining character
Unicode characters such as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras. They normally never appear alone unless they are being described, but are combined with a preceding base character. More than one combining character may be associated with the same base character. Many combining characters appear above or below or inside the base character, however some consume space along the baseline, either before or after the base character, and are referred to as spacing marks, or spacing combining characters.
combining character sequence (CCS)
A maximal sequence of characters following the pattern Base? (Combining_mark | ZWJ | ZWNJ)+. Usually a base character that is a letter or digit, followed by one or more combining characters, zero width joiners , and/or zero width non-joiners. See also BCCS.
composite vowel
A composite vowel is a single vowel sound or diphthong that is represented by more than one code point from the set of vowel-signs, repurposed consonants, and diacritics available. For example, the following Tai Tham syllable kɤʔ is represented using 4 vowel-signs (orange) attached to the single base consonant.
In Brahmi-derived scripts it is common to find consonant clusters, ie. a sequence of consonants without intervening vowels. It is also common that the absence of intervening vowels is indicated visually by merging or changing the glyphs for the sequence in some way. This is what is referred to as a conjunct. Examples include ककक्क, or কষক্ষ. Conjunct behaviour is triggered in Unicode by adding a virama between the consonants.
consonant cluster
A consonant cluster is a sequence of consonants with no intervening vowels.
contour tone language
A contour tone language has patterns where the pitch moves up and down over the course of a syllable. Compare with register tone language.


decomposed text
Decomposed text is usually the result of applying Unicode normalization form D (NFD), which splits Unicode characters into component parts, typically a base character plus one or more diacritics. However, a decomposed sequence of code points may also be intentionally (or unintentionally) used by a content author where a precomposed alternative exists. (Example)
dependent vowel
A symbol or sign that represents a vowel and that is attached or combined with another symbol, usually one that represents a consonant.. u In Semitic and Indic writing systems, vowels are normally represented by dependent vowel-signs. Dependent vowels are usually combining characters, but may also be standalone (eg. in Thai, or New Tai Lue, which has no combining characters). (Example)


A grapheme is a user-perceived unit of text. Graphemes generally include base characters with their combining diacritics. In some orthographies they may also include groupings of two or more base characters, with their diacritics, such as for Hindi, where the following 6 characters resolve to just 2 text units: हिन्दी → हि+न्दी.
grapheme cluster
The Unicode Standard uses generalised rules to define 'grapheme clusters', which approximate the likely grapheme boundaries in a writing system. Grapheme clusters are used as a basic unit of text for operations including forwards/backwards deletion, cursor movement & selection, character counts, searching & matching, text insertion, line-breaking, justification, case conversions, and sorting. The grapheme cluster definition may need to be tailored for some orthographies, such as conjuncts like न्दी in Indic scripts.


Hyphenation refers to an extra set of rules applied after the basic line-break algorithm to split words at syllable or morphological boundaries in order to improve the layout of a paragraph. Hyphenation may or may not be indicated using a visual marker at the end or start of a line, however it is commonly marked by a hyphen or other glyph.


Arabic script diacritical marks considered to be part of a basic letter form. Unicode encodes letter+ijam combinations as separate, atomic characters, which are never given decompositions in the standard. Ijam generally take the form of one-, two-, three- or four-dot markings above or below the basic letter skeleton, although other diacritic forms occur in extensions of the Arabic script in Central and South Asia and in Africa. Compare with tashkil.
independent vowel
In Indic scripts, certain vowels are depicted using independent letter symbols that stand on their own. This is often true when a word starts with a vowel or a word consists of only a vowel.
inherent vowel
An inherent vowelis a vowel sound that is automatically pronounced after a consonant letter, unless specifically suppressed. In writing systems based on a script in the Brahmi family of Indic scripts, a consonant letter symbol normally has an inherent vowel, unless otherwise indicated. The phonetic value of this vowel differs among the various languages written with these writing systems. An inherent vowel is overridden either by indicating another vowel with an explicit vowel sign or by using virama to create a dead consonant. u (Example)


The essence of justification (unlike letter-spacing) is that text is arranged to fit within a given distance, usually a line width.


letter spacing
Unlike justification, which fits text within a fixed space, letter-spacing adds (typically regular) amounts of space between letters, and the resulting length is a by-product of that.


mater lectionis
In the spelling of Arabic, Hebrew, and other Semitic languages, mater lectionis refers to a consonant that is used to indicate the location or length of a vowel. See also Wikipedia.
moraic syllabary
In a moraic syllabary long vowels and syllables with a final consonant are written with two symbols. For example, the Japanese word for Japan (日本) contains 2 syllables, ni-hon, but 3 morae, ni-ho-n. Similarly, the name of the city of Osaka has 3 syllables, oo-sa-ka, but 4 morae, o-o-sa-ka. The spelling of these words in the hiragana syllabic script has symbols for each mora, ie. にほん and おおさか. Morae tend to be counted as units in poetry, such as haiku, and in some languages play a role in allocation of tone.


Normalisation is the process of removing alternate representations of equivalent sequences from textual data, to convert the data into a form that can be binary-compared for equivalence.
normalization form C (NFC)
See also normalisation. A normalization form that erases any canonical differences, and generally produces a composed result. For example, a + umlaut is converted to ä in this form. This form most closely matches legacy usage. The formal definition is D120 in Section 3.11, Normalization Forms.
normalization form D (NFD)
See also normalisation. A normalization form that erases any canonical differences, and produces a decomposed result. For example, ä is converted to a + umlaut in this form. This form is most often used in internal processing, such as in collation. The formal definition is D118 in Section 3.11, Normalization Forms.
nucleus (syllable)
See syllable structure.


onset (syllable)
See syllable structure.


The glyph of a pre-base (or prescript) vowel-sign is displayed to the left of the consonant or orthographic syllable after which it is pronounced. It is still typed and stored, however, in pronunciation order.
precomposed text/characters
A precomposed character is one that can also be broken down into separate code points representing its component parts (decomposition). Typically this will include base characters plus diacritics, such as accented Latin characters, or Indic characters with nuktas. Normalisation Form C (NFC) produces precomposed characters from many decomposed sequences. (example)


register tone language
A register tone language contrasts only relative pitch levels. It does not have patterns where the pitch moves up and down over the course of a syllable. Compare with contour tone language.
rhyme (syllable)
See syllable structure.


spacing mark
Combining characters that consume space along the baseline, either before or after the base character. (example)
standalone vowel
Standalone vowels are not preceded by a consonant sound, and may appear at the beginning or in the middle of a word.
A type of writing system in which each symbol typically represents both a consonant and a vowel, or in some instances more than one consonant and a vowel.
syllable structure
A model of syllable structure divides the syllable into an onset followed by a rhyme. The rhyme is typically composed of a nucleus and an optional coda. The nucleus is the most sonorous part of the syllable. A syllable always has a nucleus, but syllables may have no onset and/or coda (eg. compare 'but', 'an', 'the', 'a').


Arabic script marks functioning to indicate vocalization of text, as well as other types of phonetic guides to correct pronunciation. They are separately encoded as combining marks. These include several subtypes: harakat (short vowel marks), tanwin (postnasalized or long vowel marks), shaddah (consonant gemination mark), and sukun (to mark lack of a following vowel). A basic Arabic letter plus any of these types of marks is never encoded as a separate, precomposed character, but must always be represented as a sequence of letter plus combining mark. Additional marks invented to indicate non-Arabic vowels, used in extensions of the Arabic script, are also encoded as separate combining marks. Compare with ijam.
Variations in pitch used to convey lexical contrast. Tonal systems usually distinguish 2-5 levels (register tone languages), and many supplement those with combinations of tone on a given syllable (contour tone languages). Tone mobility is also common, whereby the tone of a particular syllable can affect or create the tones of syllables alongside it.
In a transliteration each native character is associated with an equivalent and unique Latin-script character. The transliteration may not accurately represent pronunciation, but does allow straightforward and reversible conversion between the two scripts. Compare with transcription.
A transcription is likely to be more phonetically accurate than a transliteration (though usually still only reflects an approximation to the actual sound), and, in particular, does not usually allow completely reversible conversions.
typographic unit
A unit of text that is not normally split during a particular text operation. The actual constituents of a typographic unit vary according to the process being applied to the text. For example, in complex scripts cursor movement may use smaller typographic units than line-breaking, for the same text. See also BCCS.


Vocalics are letters derived from Sanskrit that generally behave like vowels, but represent r or l followed by a vowel. They are often available both as vowel-signs and independent vowel letters.
vowel sign
A glyph representing a vowel, used in indic scripts to overwrite the inherent vowel. Sometimes vowel-signs have multiple parts, which are displayed on different sides of the base consonant or cluster. In other cases, multiple vowel-signs are attached to a base to produce a particular vowel sound. Vowel-signs are usually combining characters, but may sometimes be a combination of combining characters and free-standing characters (eg. Thai, Lao), or just free-standing characters (eg. New Tai Lue). Vowel signs are typically attached to the onset of an orthographic syllable, rather than to a particular consonant, even when the syllable begins with a consonant cluster. (example)


word boundary
The concept of 'word' is difficult to define in any language (see What is a word?). Here, a word is an often vaguely-defined, but recognisable semantic unit that is typically smaller than a phrase and may comprise one or more syllables. Word boundaries are typically important for text operations such as line-breaking, and for prosodic and phonetic rules.
Last changed 2022-05-05 10:39 GMT.