This page provides explanations of terms used in the articles about writing systems and scripts. Text in italics is cited from elsewhere (most often the Unicode glossary).
A writing system in which
consonants are indicated by the base letters that have an inherent
vowel, and in which other vowels are indicated by additional
distinguishing marks of some kind modifying the base letter. The
term “abugida” is derived from the first four letters of the
Ethiopic script in the Semitic order: alf, bet, gaml, dant. (See
Section 6.1, Writing Systems.)
A writing system in which only
consonants are indicated. The term “abjad” is derived from the first
four letters of the traditional order of the Arabic script: alef,
beh, jeem, dal. (See
Section 6.1, Writing Systems.)
A writing system in which both consonants and vowels are
indicated. The term “alphabet” is derived from the first two letters
of the Greek script: alpha, beta. (See
Section 6.1, Writing Systems.)
brahmi phonetic matrix
It is common to arrange consonants in writing systems descended from the archaic Brahmi script in tabular form. The first 5 rows of the matrix normally represent plosives and nasals, and are ordered by place of articulation, as shown in this table.
The remaining two rows contain liquids and fricatives.
Unicode characters such as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras. They normally never appear alone unless they are being described, but are combined with a preceding base character. More than one combining character may be associated with the same base character. Many combining characters appear above or below or inside the base character, however some consume space along the baseline, either before or after the base character, and are referred to as spacing marks, or spacing combining characters.
In Brahmi-derived scripts it is common to find consonant clusters, ie. a sequence of consonants without intervening vowels. It is also common that the absence of intervening vowels is indicated visually by merging or changing the glyphs for the sequence in some way. This is what is referred to as a conjunct. Examples include कक → क्क, or কষ → ক্ষ. Conjunct behaviour is triggered in Unicode by adding a virama between the consonants.
A symbol or sign that represents a vowel and that is attached or combined with another symbol, usually one that represents a consonant..u In Semitic and Indic writing systems, vowels are normally represented by dependent vowel-signs. Dependent vowels are usually combining characters, but may also be standalone (eg. in Thai, or New Tai Lue, which has no combining characters). (Example)
A grapheme is a user-perceived unit of text. Graphemes generally include base characters with their combining diacritics. In some orthographies they may also include groupings of two or more base characters, with their diacritics, such as for Hindi, where the following 6 characters resolve to just 2 text units: हिन्दी → हि+न्दी.
The Unicode Standard uses generalised rules to define 'grapheme clusters', which approximate the likely grapheme boundaries in a writing system. Grapheme clusters are used as a basic unit of text for operations including forwards/backwards deletion, cursor movement & selection, character counts, searching & matching, text insertion, line-breaking, justification, case conversions, and sorting. The grapheme cluster definition may need to be tailored for some orthographies, such as conjuncts like न्दी in Indic scripts.
In Indic scripts, certain vowels are depicted using independent letter symbols that stand on their own. This is often true when a word starts with a vowel or a word consists of only a vowel.
In writing systems based on a script in the Brahmi family of Indic scripts, a consonant letter symbol normally has an inherent vowel, unless otherwise indicated. The phonetic value of this vowel differs among the various languages written with these writing systems. An inherent vowel is overridden either by indicating another vowel with an explicit vowel sign or by using virama to create a dead consonant.u (Example)
Normalisation is the process of removing alternate representations of equivalent sequences from textual data, to convert the data into a form that can be binary-compared for equivalence.
normalization form C (NFC)
See also normalisation. A normalization form that erases any canonical differences, and generally produces a composed result. For example, a + umlaut is
converted to ä in this form. This form most closely matches legacy
usage. The formal definition is D120 in
Section 3.11, Normalization Forms.
normalization form D (NFD)
See also normalisation. A normalization form that erases any canonical differences, and produces a decomposed result. For example, ä is converted to a + umlaut in this form. This form is most often used in internal processing, such as in collation. The formal definition is D118 in Section 3.11, Normalization Forms.
Combining characters that consume space along the baseline, either before or after the base character. (example)
In a transliteration each native character is associated with an equivalent and unique Latin-script character. The transliteration may not accurately represent pronunciation, but does allow straightforward and reversible conversion between the two scripts. Compare with transcription.
A transcription is likely to be more phonetically accurate than a transliteration (though usually still only reflects an approximation to the actual sound), and, in particular, does not usually allow completely reversible conversions.
A glyph representing a vowel, used in indic scripts to overwrite the inherent vowel. Sometimes vowel-signs have multiple parts, which are displayed on different sides of the base consonant or cluster. In other cases, multiple vowel-signs are attached to a consonant to produce a particular vowel sound. Vowel-signs are usually combining characters, but may sometimes be a combination of combining characters and free-standing characters (eg. Thai, Lao), or just free-standing characters (eg. New Tai Lue). (example)