This page provides explanations of terms used in the articles about writing systems and scripts. Text in italics is cited from elsewhere (most often the Unicode glossary).
A writing system in which
consonants are indicated by the base letters that have an inherent
vowel, and in which other vowels are indicated by additional
distinguishing marks of some kind modifying the base letter. The
term “abugida” is derived from the first four letters of the
Ethiopic script in the Semitic order: alf, bet, gaml, dant. (See
Section 6.1, Writing Systems.)
A writing system in which only
consonants are indicated. The term “abjad” is derived from the first
four letters of the traditional order of the Arabic script: alef,
beh, jeem, dal. (See
Section 6.1, Writing Systems.)
A writing system in which both consonants and vowels are
indicated. The term “alphabet” is derived from the first two letters
of the Greek script: alpha, beta. (See
Section 6.1, Writing Systems.)
brahmi phonetic matrix
It is common to arrange consonants in writing systems descended from the archaic Brahmi script in tabular form. The first 5 rows of the matrix normally represent plosives and nasals, and are ordered by place of articulation, as shown in this table.
The remaining two rows contain liquids and fricatives.
Unicode characters such as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras. They normally never appear alone unless they are being described, but are combined with a preceding base character. More than one combining character may be associated with the same base character. Many combining characters appear above or below or inside the base character, however some consume space along the baseline, either before or after the base character, and are referred to as spacing marks, or spacing combining characters.
When a single vowel-sign code point produces glyphs on more than one side of the consonant base, it is referred to as a circumgraph. For example, the following single codepoint representing a vowel-sign is rendered on both sides of the base consonant.
A composite vowel is a single vowel sound or diphthong that is represented by more than one code point from the set of vowel-signs, repurposed consonants, and diacritics available. For example, the following Tai Tham syllable kɤʔ is represented using 4 vowel-signs (orange) attached to the single base consonant.
In Brahmi-derived scripts it is common to find consonant clusters, ie. a sequence of consonants without intervening vowels. It is also common that the absence of intervening vowels is indicated visually by merging or changing the glyphs for the sequence in some way. This is what is referred to as a conjunct. Examples include कक → क्क, or কষ → ক্ষ. Conjunct behaviour is triggered in Unicode by adding a virama between the consonants.
Decomposed text is usually the result of applying Unicode normalization form D (NFD), which splits Unicode characters into component parts, typically a base character plus one or more diacritics. However, a decomposed sequence of code points may also be intentionally (or unintentionally) used by a content author where a precomposed alternative exists. (Example)
A symbol or sign that represents a vowel and that is attached or combined with another symbol, usually one that represents a consonant..u In Semitic and Indic writing systems, vowels are normally represented by dependent vowel-signs. Dependent vowels are usually combining characters, but may also be standalone (eg. in Thai, or New Tai Lue, which has no combining characters). (Example)
A grapheme is a user-perceived unit of text. Graphemes generally include base characters with their combining diacritics. In some orthographies they may also include groupings of two or more base characters, with their diacritics, such as for Hindi, where the following 6 characters resolve to just 2 text units: हिन्दी → हि+न्दी.
The Unicode Standard uses generalised rules to define 'grapheme clusters', which approximate the likely grapheme boundaries in a writing system. Grapheme clusters are used as a basic unit of text for operations including forwards/backwards deletion, cursor movement & selection, character counts, searching & matching, text insertion, line-breaking, justification, case conversions, and sorting. The grapheme cluster definition may need to be tailored for some orthographies, such as conjuncts like न्दी in Indic scripts.
Arabic script diacritical marks considered to be part of a basic letter form. Unicode encodes letter+ijam combinations as separate, atomic characters, which are never given decompositions in the standard. Ijam generally take the form of one-, two-, three- or four-dot markings above or below the basic letter skeleton, although other diacritic forms occur in extensions of the Arabic script in Central and South Asia and in Africa. Compare with tashkil.
In Indic scripts, certain vowels are depicted using independent letter symbols that stand on their own. This is often true when a word starts with a vowel or a word consists of only a vowel.
In writing systems based on a script in the Brahmi family of Indic scripts, a consonant letter symbol normally has an inherent vowel, unless otherwise indicated. The phonetic value of this vowel differs among the various languages written with these writing systems. An inherent vowel is overridden either by indicating another vowel with an explicit vowel sign or by using virama to create a dead consonant.u (Example)
Normalisation is the process of removing alternate representations of equivalent sequences from textual data, to convert the data into a form that can be binary-compared for equivalence.
normalization form C (NFC)
See also normalisation. A normalization form that erases any canonical differences, and generally produces a composed result. For example, a + umlaut is
converted to ä in this form. This form most closely matches legacy
usage. The formal definition is D120 in
Section 3.11, Normalization Forms.
normalization form D (NFD)
See also normalisation. A normalization form that erases any canonical differences, and produces a decomposed result. For example, ä is converted to a + umlaut in this form. This form is most often used in internal processing, such as in collation. The formal definition is D118 in Section 3.11, Normalization Forms.
A precomposed character is one that can also be broken down into separate code points representing its component parts (decomposition). Typically this will include base characters plus diacritics, such as accented Latin characters, or Indic characters with nuktas. Normalisation Form C (NFC) produces precomposed characters from many decomposed sequences. (example)
Combining characters that consume space along the baseline, either before or after the base character. (example)
Arabic script marks functioning to indicate vocalization of text, as well as other types of phonetic guides to correct pronunciation. They are separately encoded as combining marks. These include several subtypes: harakat (short vowel marks), tanwin (postnasalized or long vowel marks), shaddah (consonant gemination mark), and sukun (to mark lack of a following vowel). A basic Arabic letter plus any of these types of marks is never encoded as a separate, precomposed character, but must always be represented as a sequence of letter plus combining mark. Additional marks invented to indicate non-Arabic vowels, used in extensions of the Arabic script, are also encoded as separate combining marks. Compare with ijam.
In a transliteration each native character is associated with an equivalent and unique Latin-script character. The transliteration may not accurately represent pronunciation, but does allow straightforward and reversible conversion between the two scripts. Compare with transcription.
A transcription is likely to be more phonetically accurate than a transliteration (though usually still only reflects an approximation to the actual sound), and, in particular, does not usually allow completely reversible conversions.
Vocalics are letters derived from Sanskrit that generally behave like vowels, but represent r or l followed by a vowel. They are often available both as vowel-signs and independent vowel letters.
A glyph representing a vowel, used in indic scripts to overwrite the inherent vowel. Sometimes vowel-signs have multiple parts, which are displayed on different sides of the base consonant or cluster. In other cases, multiple vowel-signs are attached to a consonant to produce a particular vowel sound. Vowel-signs are usually combining characters, but may sometimes be a combination of combining characters and free-standing characters (eg. Thai, Lao), or just free-standing characters (eg. New Tai Lue). (example)