Script glossary

A

abugida: A writing system in which consonant letters have an inherent vowel. Other post-consonant vowel sounds are indicated by associating one or more letters or marks with the consonant and they override the inherent vowel. These are often referred to as vowel signs. Most abugidas derive from the ancient Brahmi script. The term “abugida” is derived from the first four letters of the Ethiopic script in the Semitic order: alf, bet, gaml, dant. (See also Section 6.1, Writing Systems. Note: In resources on the r12a site, modern Ethiopic is regarded as a featural syllabary, rather than an abugida, since it doesn't require a sequence of code points to represent a CV syllable.)
abjad: A writing system in which consonant sounds are written but short vowel sounds are usually not. Usually, diacritics can be used represent the missing vowels in order to disambiguate words or help in education, but they are not used in normal text. Abjads include the Arabic and Hebrew scripts. The term “abjad” is derived from the first four letters of the traditional order of the Arabic script: alef, beh, jeem, dal. (See also Section 6.1, Writing Systems.)
alphabet: A writing system in which both consonants and vowels are indicated. Vowels may be indicated using dedicated letters or combining marks. For example, several orthographies using the Arabic script always show all diacritics, making them alphabetic in nature. The term “alphabet” is derived from the first two letters of the Greek script: alpha, beta. (See also Section 6.1, Writing Systems.)

B

BCCS (base & combining character sequence)

A sequence of characters following the pattern Base (Combining_mark | ZWJ | ZWNJ)*. A base character that is a letter or digit, followed by zero or more combining characters, zero width joiners, and/or zero width non-joiners. This commonly reflects the minimal typographic unit used for operating on text.

brahmi phonetic matrix

It is common to arrange consonants in writing systems descended from the archaic Brahmi script in tabular form. The first 5 rows of the matrix normally represent plosives and nasals, and are ordered by place of articulation, as shown in this table.

velar	k	kʰ	g	gʰ	ŋ
palatal	ʧ	ʧʰ	ʤ	ʤʰ	ɳ
retroflex	ʈ	ʈʰ	ɖ	ɖʰ	ɳ
dental	t	tʰ	d	dʰ	n
bilabial	p	pʰ	b	bʰ	m

The remaining two rows contain liquids and fricatives.

C

checked syllable: A syllable that ends in a stop sound, usually -p, -t, -k, or -ʔ. Unchecked syllables end in one of the sonorants -m, -n, -ŋ, -w, -j, w, or vowels.
circumgraph: A single vowel code point that produces glyphs on more than one side of its consonant base. (Vowels that are written using multiple code points are composite vowels, rather than circumgraphs.)
coda (syllable): See syllable structure.
combining mark / combining character: Unicode characters such as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras. They normally never appear alone unless they are being described, but are combined with a preceding base character. More than one combining mark may be associated with the same base character. Many combining marks appear above or below or inside the base character, however some consume space along the baseline, either before or after the base character, and are referred to as spacing marks, or spacing combining characters.
combining character sequence (CCS): Unicode definition: A maximal sequence of characters following the pattern Base? (Combining_mark | ZWJ | ZWNJ)+. Usually a base character that is a letter or digit, followed by one or more combining characters, zero width joiners, and/or zero width non-joiners.
composite vowel: A single vowel sound or diphthong that is represented by more than one code point from the available set of vowel marks, repurposed consonants, and diacritics.
conjunct: A way of indicating consonant clusters, common in Brahmi-derived scripts, by visually merging or changing the glyphs for the sequence in some way. Conjunct behaviour is generally triggered in Unicode-encoded text by adding a virama character between the consonant code points. The virama is not normally visible if a conjunct is formed.
consonant cluster: A sequence of consonants with no intervening vowels. See also conjunct.
contour tone language: A contour tone language has patterns where the pitch moves up and down over the course of a syllable. Compare with register tone language.
cursive: In this context, this is applied to scripts where letters are typically joined at the baseline (although some scripts have a few letters that only join on one side). Usually the font needs to support differences in glyph shape for the various joining contexts, which range from slight to radically different. Cursive scripts include Adlam, Arabic, Hanifi Rohingya, Mongolian, N'Ko, and Syriac. Letters in other scripts may also join, often at a hanging baseline, but they are not usually referred to as 'cursive', eg. Devanagari, Bengali, Gurmukhi, etc.

D

decomposed text: Decomposed text is usually the result of applying Unicode normalization form D (NFD), which splits Unicode characters into component parts, typically a base character plus one or more diacritics. However, a decomposed sequence of code points may also be intentionally (or unintentionally) used by a content author where a precomposed alternative exists.
dependent vowel: Vowel_dependent is one of the categories in the Indic_Syllabic_Category property set (see a list). The Unicode Standard definition says: A symbol or sign that represents a vowel and that is attached or combined with another symbol, usually one that represents a consonant. Dependent vowels are usually combining marks, but may also be letters (eg. in Thai, or New Tai Lue, which has no combining characters).

G

geumchik rules: A set of rules in Korean that determine which punctuation marks can appear at the end or the start of a line, when line-breaking occurs. (See also kinsoku rules and jìnzé rules.)
grapheme: A grapheme is a user-perceived unit of text. Graphemes generally include base characters with their combining diacritics. In some orthographies they may also include groupings of two or more base characters, with their diacritics, such as for Hindi, where the following 6 characters resolve to just 2 text units: हिन्दी → हि+न्दी.
grapheme cluster: The Unicode Standard uses generalised rules to define 'grapheme clusters', which approximate the likely grapheme boundaries in a writing system. Grapheme clusters are used as a basic unit of text for operations including forwards/backwards deletion, cursor movement & selection, character counts, searching & matching, text insertion, line-breaking, justification, case conversions, and sorting. The grapheme cluster definition may need to be tailored for some orthographies, such as conjuncts like न्दी in Indic scripts.

I

ideograph: (1) Any symbol that primarily denotes an idea or concept in contrast to a sound or pronunciation—for example, ♻, which denotes the concept of recycling by a series of bent arrows. (2) A generic term for the unit of writing of a logosyllabic writing system. In this sense, ideograph (or ideogram) is not systematically distinguished from logograph (or logogram). (3) A term commonly used to refer specifically to Han characters, equivalent to the Chinese, Japanese, or Korean terms also sometimes used: hànzì, kanji, or hanja. ^u
ijam: A diacritic in the Arabic script that is considered to be an integral part of a basic letter form, such as the dots in ث [U+062B ARABIC LETTER THEH], pronounced θ. Unicode encodes letter+ijam combinations as atomic characters which are never given equivalent decompositions in the standard. Ijam generally take the form of one-, two-, three- or four-dot markings above or below the basic letter skeleton, although other diacritic forms occur, especially in extensions of the Arabic script in Central and South Asia and in Africa. For example, ۈ [U+06C8 ARABIC LETTER YU] shows a letter with ijam that represents the vowel y in the Uighur orthography. See Chapter 9 of the Unicode Standard. Compare with tashkil. See also the section Ijam, tashkil, hamza in Arabic script homographs.
independent vowel: Independent letters are used to represent zero-onset vowel sounds. They are typically found in Brahmi-derived Indic scripts, at the beginning of a word or after a word-internal vowel.
inherent vowel: A vowel sound that is automatically pronounced after a consonant letter, unless suppressed by either (a) indicating another vowel, (b) using a character specifically designed to kill the vowel sound, or (c) contextual rules. Inherent vowels are commonly found in scripts Brahmi-derived Indic scripts, and are a defining feature of an abugida. The sound of the inherent vowel varies by language.
invisible stacker: A character that is always invisible, and that is used between each consonant in a consonant cluster to create a stacked arrangement. Used for scripts such as Myanmar, Khmer, Tai Tham, Sundanese, etc. See also virama, and pure killer.

M

mater lectionis: In the spelling of Arabic, Hebrew, and other Semitic languages, mater lectionis refers to a consonant that is used to indicate the location or length of a vowel. See also Wikipedia.
medial consonant: A medial consonant appears after the first consonant in a syllable onset and before the vowel nucleus, and is often one of -w, -r, or -l. Sometimes an onset may have multiple medials, such as -rj. Some scripts have dedicated combining marks for medial consonants.
moraic syllabary: In a moraic syllabary long vowels and syllables with a final consonant are written with two symbols. For example, the Japanese word for Japan (日本) contains 2 syllables, ni-hon, but 3 morae, ni-ho-n. Similarly, the name of the city of Osaka has 3 syllables, oo-sa-ka, but 4 morae, o-o-sa-ka. The spelling of these words in the hiragana syllabic script has symbols for each mora, ie. にほん and おおさか. Morae tend to be counted as units in poetry, such as haiku, and in some languages play a role in allocation of tone.

N

normalisation: Normalisation is the process of removing alternate representations of equivalent sequences from textual data, to convert the data into a form that can be binary-compared for equivalence.
normalization form C (NFC): See also normalisation. A normalization form that erases any canonical differences, and generally produces a composed result. For example, a + umlaut is converted to ä in this form. This form most closely matches legacy usage. The formal definition is D120 in Section 3.11, Normalization Forms.
normalization form D (NFD): See also normalisation. A normalization form that erases any canonical differences, and produces a decomposed result. For example, ä is converted to a + umlaut in this form. This form is most often used in internal processing, such as in collation. The formal definition is D118 in Section 3.11, Normalization Forms.
nucleus (syllable): See syllable structure.

O

onset (syllable): See syllable structure.

orthographic syllable: A typographic unit that includes one or more grapheme clusters. Orthographic syllables are important typographic units for Brahmi-derived scripts where consonant clusters are formed into stacks or other types of conjunct that should not be split by various text operations. The sequence of characters in an orthographic syllable is typically different from that in a phonetic syllable (see the example), and may begin with the coda of the preceding phonetic syllable. In scripts with no inter-word boundaries an orthographic syllable may span word boundaries (eg. Javanese).

P

prebase: A pre-base (or prescript) vowel glyph is displayed before the consonant or orthographic syllable after which it is pronounced. If the vowel character is a combining mark, it is still typed and stored in pronunciation order, and the application will render it in the correct location. In some scripts, such as Thai, a pre-base vowel glyph is represented by a normal letter, which is typed and stored in the correct position relative to the base.
precomposed text/characters: A precomposed character is one that can also be broken down into separate code points representing its component parts (decomposition). Typically this will include base characters with diacritics, such as accented Latin characters, or Indic characters with nuktas. Normalisation Form C (NFC) produces precomposed characters from many decomposed sequences. (example)
pure killer: A character that kills the inherent vowel. It is always visible, and doesn't produce stacking or other conjunct behaviours. Used for scripts such as Myanmar, Tibetan, Sundanese, Batak, etc. See also virama, and invisible stacker.

S

shaping: Making context-sensitive changes to glyph shapes. Shaping may or may not occur at the same time as context-sensitive positioning of glyphs (such as higher diacritics over tall base characters).
spacing mark: Combining characters that consume space along the baseline, either before or after the base character.
standalone vowel: See zeroonsetvowel.
syllabary: A type of writing system in which each symbol typically represents both a consonant and a vowel, or in some instances more than one consonant and a vowel. Usually there is also a set of symbols that represent zero-onset vowel sounds.
syllable structure: A model of syllable structure divides the syllable into an onset followed by a rhyme. The rhyme is typically composed of a nucleus and an optional coda. The nucleus is the most sonorous part of the syllable. A syllable always has a nucleus, but syllables may have no onset and/or coda (eg. compare 'but', 'an', 'the', 'a').

T

tashkil: An Arabic script mark that indicates vocalization of text or other types of phonetic guide which indicate pronunciation, such as in ثَ [U+062B ARABIC LETTER THEH + U+064E ARABIC FATHA], pronounced θa. These include several subtypes: harakat (short vowel marks), tanwin (postnasalized or long vowel marks), shaddah (consonant gemination mark), and sukun (to mark lack of a following vowel). A basic Arabic letter plus any of these types of marks is never encoded as an atomic, precomposed character, but must always be represented as a sequence of letter plus separate combining mark. For example, هٰ [U+0647 ARABIC LETTER HEH + U+0670 ARABIC LETTER SUPERSCRIPT ALEF] pronounced ha, is an example of a letter plus tashkil combination in Arabic (cf. the use of that diacritic in a precomposed Uighur letter). See Chapter 9 of the Unicode Standard. Compare with ijam. See also the section Ijam, tashkil, hamza in Arabic script homographs.
tone: Variations in pitch used to convey lexical contrast. Tonal systems usually distinguish 2-5 levels (register tone languages), and many supplement those with combinations of tone on a given syllable (contour tone languages). Tone mobility is also common, whereby the tone of a particular syllable can affect or create the tones of syllables alongside it.
transcription: A transcription is likely to be more phonetically accurate than a transliteration (though usually still only reflects an approximation to the actual sound), and, in particular, does not usually allow completely reversible conversions.
transliteration: In a transliteration each native character is associated with an equivalent and unique Latin-script character. The transliteration may not accurately represent pronunciation, but does allow straightforward and reversible conversion between the two scripts. Compare with transcription.
typographic unit: A unit of text that is not normally split during a particular text operation. The actual constituents of a typographic unit vary according to the process being applied to the text. For example, in complex scripts cursor movement may use smaller typographic units than line-breaking, for the same text. See also BCCS.

V

virama: From the Unicode Standard: From Sanskrit. The name of a sign used in many Indic and other Brahmi-derived scripts to suppress the inherent vowel of the consonant to which it is applied, thereby generating a dead consonant. (See Section 12.1, Devanagari.) The sign varies in shape from script to script, and may be known by other names in various languages. It may also be visible or hidden in consonant clusters, depending on the language and context. Used for scripts such as Devanagari, Bengali, Tamil, Balinese, etc. See also invisible stacker, and pure killer.
vocalic: Vocalics are letters derived from Sanskrit that generally behave like vowels, but represent r or l followed by a vowel. They are often available both as vowel-signs and independent vowel letters.
vowel sign: Glyphs which represent vowel sounds in Brahmi-derived scripts. One or more vowel signs overwrite the inherent vowel. Sometimes vowel signs have multiple parts, which may be displayed on different sides of the base consonant or cluster. Vowel signs are usually combining characters, but may sometimes be a combination of combining marks and free-standing letters (eg. Thai, Lao), or just free-standing letters (eg. New Tai Lue). Vowel signs are typically attached to the onset of an orthographic syllable, rather than to a particular consonant, even when the syllable begins with a consonant cluster.

Glossary of script-related terms

A

B

C

D

F

G

H

I

J

K

L

M

N

O

P

R

S

T

U

V

W

Z