Updated Tue 19 May 2015 • tags tibetan, scriptnotes
In these notes I synthesize information from various sources, encountered as I explore the Tibetan script as used for Tibetan. They may be updated from time to time and should not be considered authoritative.
The page contains brief notes on general script features. See also the companion document, Tibetan Character Notes, which describes the characters used in Tibetan script one by one.
Basically I am mostly simplifying, combining, streamlining and arranging the text from the sources listed at the bottom of the page. See those links for more information, especially about the history and phonology of the Tibetan script.
You can obtain fonts for this page free from the Web. For this page I used Tibetan Machine Uni, which is downloaded with this page as a webfont. Click the blue vertical bar at the bottom right of the page to apply other fonts.
When you see red text (examples of Tibetan) you can click on it to reveal the component characters.
Tibetan is an abugida, ie. consonants carry an inherent vowel sound a that is overridden using vowel signs. Text runs from left to right.
There are various different Tibetan scripts, of two basic types: དབུ་ཅན་ dbu-can, pronounced uchen (with a head), and དབུ་མེད་ dbu-med, pronounced ume (headless). This page concentrates on the former. Pronunciations are based on the central, Lhasa dialect.
Traditional Tibetan text was written on pechas (dpe-cha དཔེ་ཆ་), loose-leaf sheets. Some of the characters used and formatting approaches are different in books and pechas.
Example of Tibetan:
འགྲོ་བ་མིའི་རིགས་རྒྱུད་ཡོངས་ལ་སྐྱེས་ཙམ་ཉིད་ནས་ཆེ་མཐོངས་དང༌། ཐོབ་ཐངགི་རང་དབང་འདྲ་མཉམ་དུ་ཡོད་ལ། ཁོང་ཚོར་རང་བྱུང་གི་བློ་རྩལ་དང་བསམ་ཚུལ་བཟང་པོ་འདོན་པའི་འོས་བབས་ཀྱང་ཡོད། དེ་བཞིན་ཕན་ཚུན་གཅིག་གིས་གཅིག་ལ་བུ་སྤུན་གྱི་འདུ་ཤེས་འཛིན་པའི་བྱ་སྤྱོད་ཀྱང་ལག་ལེན་བསྟར་དགོས་པ་ཡིན༎
Native Tibetan words use 30 consonants, but the Tibetan block contains many more. Many of the extra consonants (and other characters) are used for transliteration of other languages, principally Sanskrit and Chinese. These include the retroflex and voiced aspirated consonants. A couple of characters are extensions for Balti.
The pronunciation of Tibetan words is typically much simpler than the orthography, which involves patterns of consonants. These reduce ambiguity and can affect pronunciation and tone.
The primary consonant is called the root consonant (or radical), and the other consonants in the syllable (which normally has up to 6 consonants in total) annotate or modify it. The following rules help identify the root:
a consonant with a vowel is always the root, unless it is the phrase connector འི, and letters with superscripts or subscripts are root consonants.
in a 2-consonant syllable with no vowel, the first consonant is always the root
in a 3-consonant syllable where the last consonant is not ས, the second consonant is likely to be the root.
in a 4-consonant syllable, the second consonant is always the root.
The following diagram shows characters in all of the syllabic positions, and lists the characters that can appear in each of the non-root locations. The word is འགྲེམས་སྟོན་ 'grems-ston ɖɹem-ton (exhibition).
Prefixes. Characters in the prefix position are not pronounced, but de-aspirate aspirated root characters and give a higher tone value to nasal root characters. The consonant ག g may occur before 11 root characters, ད d before 6, བ b before 10, མ m before 11, and འ a before 10, eg. འཁོར་ལོ་ 'khor-lo kor-lo (wheel), བསད་ bsad sɛ́ (killed).
Suffixes. Characters in the suffix position have one of the following effects:
add their own sound ( ག ང བ མ འ ར ) , eg. དག་ dag dag (I).
modify the root's vowel value ( ད ས ), eg. བསད་ bsad sɛ́ (killed).
both of the above ( ན ལ ), eg. བདུན་ bdun dỳn (seven).
Secondary suffixes. Only two characters can appear in the secondary suffix location, according to Tibetan grammar, ས and ད, and the latter is no longer officially found in modern Tibetan. A character in this position adds no sound and nor does it affect the sounds in the rest of the syllable, eg. བསྒྲུབས་ bsgrubs ɖɹúb (established), and གྱུརད་ gyurd kjùr (became).
Superscripts. The three characters that appear in the superscript location raise the tone pitch of the syllable, but are not pronounced themselves. Each superscript character can only be used with a specified set of root characters.
Note that RA has a shape slightly different from its nominal shape in all combinations except རྙ and རླ. You should still use the normal RA character for the superscript. The font will make the needed adjustments of shape.
Subscripts. The four characters that can appear in the subscript location are also each combined with a particular subset of root characters and have different effects.
Note that three of the subscripts have shapes that are significantly different from the nominal shape of the character they represent.
Uniquely, WA can also appear as a sub-subscript as in གྲྭ་ grwa.
Consonant stacking. A standard stack has a standard consonant character at the top (although it may actually be slightly squeezed or adapted slightly in shape), and one or more special subjoined consonant characters beneath it.
The topmost consonant in a stack always uses the standard character from the Unicode Tibetan block regardless of whether it is a root consonant or not, and consonants below it always use a character from the subjoined range.
See this example from the Unicode Standard of the word སྤྱིར་ spyir ʧí (general), which shows a stack with three consonants.
Unlike Indic scripts, there is no virama (or halant) used for Tibetan. Instead, just a full and subjoined form of each consonant. The subjoined forms are combining characters. Avoiding the virama makes sense because the virama is not used by Tibetans, and the approach taken makes it easier to create the large number of stacks contained in Tibetan text.
Tibetan uses the word 'head' to refer to either the top-most consonant (ie. spacially) or the root consonant of a syllable, which may be a subjoined consonant. We therefore avoid this term here, and say 'root' or 'topmost'.
The following list shows the order in which characters should be typed, and stored in memory, for a set of stacked characters.
Where used, the character U+0F39 TIBETAN MARK TSA-PHRU ༹ occurs immediately after the consonant it modifies
A-chung and a-chen. The phonological realisation for U+0F60 TIBETAN LETTER -A འ (called འ་ཆུང་, 'a-chung) and U+0F68 TIBETAN LETTER A ཨ (called ཨ་ཆེན་, a-chen) is a. In the Lhasa dialect, the former has a high and the latter a low tone.
Both 'a-chung and a-chen can be used with vowel signs, in which case the a sound is replaced by that of the vowel.
'A-chung can also represent a nasal, so མཚམས་ mtshams (boundary) and མཐུན་ mthun (agreement) are often written འཚམས་ and འཐུན་.
'A-chung may also nasalise the juncture of two morphemes, as in དགེ་འདུན་ dge-'dun (buddhist community), pronounced ɡenyn.
Other than loanwords, Tibetan only allows diphthongs in diminutive expressions. 'A-chung is used to write these, as in the following: མི་ mi person → མེའུ་ me'u dwarf; རྡོ་ rdo stone → རྡེའུ་ rde'u pebble.
A subjoined 'a-chung is used to express long vowels in loan words (Tibetan doesn't have them natively), such as those borrowed from Chinese, Hindi and Mongolian. For example, ཏཱ་བླ་མ་ tā-bla-ma (grand lama) (ta from Chinese), and ཤྲཱི་ śrī (wealth) from Sanskrit. For this purpose you should use U+0F71 TIBETAN VOWEL SIGN AA ཱ, and not U+0FB0 TIBETAN SUBJOINED LETTER -A ྰ.
The Unicode Standard says of SUBJOINED LETTER -A:
U+0FB0 TIBETAN SUBJOINED LETTER -A ( a-chung ) should be used only in the very rare cases where a full-sized subjoined a-chung letter is required. The small vowel lengthening a-chung encoded as U+0F71 TIBETAN VOWEL SIGN AA is far more frequently used in Tibetan text, and it is therefore recommended that implementations treat this character (rather than U+0FB0) as the normal subjoined a-chung.
Finally, 'a-chung can be used to disambiguate the location of an inherent vowel in a syllable. The sequence དག་ dag dàg (I) is interpreted as CVC. To express CCV add 'a-chung, eg. དགའ་ dga' gà (virtue).
Irregular pronunciations. Most consonants translate to the same basic sound unless they are modified by surrounding letters as mentioned above. In some cases, however, the pronunciation of a consonant is irregular. In particular, b is sometimes pronounced w, eg. རེ་བ་ re-ba re-wa (hope), དབང་ཆ་ dbang-ca wang-ʧa (power), and some words have an additional nasalisation which is not shown, eg. ད་ལྟ་ da-lta dan-ta (now).
Standard Tibetan has five vowels, for which there are four characters, since one vowel, a, is inherent in the consonant. Non-inherent vowels are indicated by a single mark attached to and typed after a consonant or consonant stack. In the example སྤྱིར་ ʧí (general) the vowel sign that appears above the stack is typed after the three consonants that make up the stack.
In traditional, loose-leaf Tibetan pechas a head mark or yig-mgo (yig go) is used at the beginning of the front of the folio so that you can tell which is the front.
Head marks are also used in both pechas and books to indicate the start of a headline or the start of the first paragraph in a longer text.
Head marks differ from text to text. The Unicode Standard provides a number of characters to give some basic coverage, but may not meet all needs.
A common head mark is U+0F04 TIBETAN MARK INITIAL YIG MGO MDUN MA ༄, and there is also the extension character U+0F05 TIBETAN MARK CLOSING YIG MGO SGAB MA ༅. A head mark can be written alone, or can be followed by as many as three closing marks; head marks are also followed by two shads, eg.༄༅། །.
Three less common head marks, used in Nyingmapa and Bonpo literature, are also represented in the Tibetan block, namely:
U+0F01 TIBETAN MARK GTER YIG MGO TRUNCATED A ༁
U+0F02 TIBETAN MARK GTER YIG MGO -UM RNAM BCAD MA ༂
U+0F03 TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA ༃
Many of the characters in the Tibetan block are there for transcribing or transliterating non-Tibetan text. The Tibetan script provides for perfect mappings between Sanskrit and Tibetan, but Tibetan is also used to transliterate other languages, such as Chinese, Mongolian and English.
There are a number of consonants, including a range of aspirated consonants, and the following range of retroflex consonants.
The retroflex consonants, which are reversed versions of Tibetan consonant shapes, are often used to distinguish loan words from sequences of Tibetan syllables. For example, ཁ་ཎ་ཌ་ kha-ṇa-ḍa (Canada), མོ་ཊ་ mo-ṭa (car).
In transliterated text consonants are sometimes stacked in ways that are not allowed in native Tibetan text.
There are also additional vowel signs between U+0F71 and U+0F7D for Sanskrit transcriptions, and several are compound shapes. The component parts of these compounds should normally be typed individually, rather than using the compound codepoints. The table below shows the characters, and indicates those whose use is discouraged and strongly discouraged.
|II||EE||OO||Rev I||V R||V L||II||UU||Rev II||V RR||V LL|
U+0F7F TIBETAN SIGN RNAM BCAD ཿ ( nam chay ) is the visarga, and U+0F7E TIBETAN SIGN RJES SU NGA RO ཾ ( ngaro ) is the anusvara.
Compound consonants. The six compound consonants GHA, DDHA, DHA, BHA, DZHA and KSSA in the table above, used to represent the Indic consonants during transliteration, can be created by combining a head consonant with a subjoined HA, but the Unicode Standard recommends that the precomposed characters be used in order to maximise effectiveness of transmission and searching. I have suggested that this recommendation be changed in version 7, since many applications silently normalise text to the decomposed sequence.
Fixed form letters. U+0F62 TIBETAN LETTER RA at the top of a stack usually has a reduced form, eg. རྐ rka. For transliterations it is sometimes desirable to retain the full form of RA where in Tibetan words it would be reduced. To do this use U+0F6A TIBETAN LETTER FIXED-FORM RA ཪ instead of the normal RA, but only where the normal RA would not produce the full form anyway, ie. do not use eg. རྙ rnya, which has the full form already.
There are also fixed form variants of subjoined RA, YA and WA.
Tibetan has its own set of numbers. My Chinese publication, however, uses european digits.
Half-numbers. By some interpretations, the following shapes each have the value of 0.5 less than the number within which it appears. Used only in some traditional contexts, they appear as the last digit of a multidigit number, eg. ༤༬ represents 42.5. These are very rarely used, however, and other uses have been postulated. For more information see Numbers that Don't Add Up : Tibetan Half Digits, by Andrew West.
In pechas, Tibetan text is written inside a visible box which defines the margin of the page. In more recent publications this box may be invisible. Modern publications also use paragraphs. The initial line of a new paragraph may be indented.
Key divisions of the text are sections (or expressions (brjod-pa)) and topics (don-tshan), which do not necessarily equate to English phrases, sentences and paragraphs. Sections normally end with a shay, U+0F0D TIBETAN MARK SHAD །, followed by a space. Topics (eg. headlines, verses, and longer paragraphs) are often terminated or separated with shay+space+shay.
Unicode provides U+0F0E TIBETAN MARK NYIS SHAD ༎ as a means of regularising the spacing between the two shad marks, which tends to be slightly bigger than a normal space. The space between the shad marks can be stretched during justification, however, and it's not clear to me how that would work when using NYIS SHAD.
A line that ends with the root consonant U+0F40 TIBETAN LETTER KA ཀ or U+0F42 TIBETAN LETTER GA ག will normally swallow up the shay that immediately follows it, even if there is a vowel sign. For example, where you might expect to see a double shay, you might see ཀུ ། and སྐུ །. However, the shad is not omitted if these characters have a subscript, eg. གྲུ། །.
Word boundaries within a section are not indicated. Only 'syllables', known as tsheg-bar tsek bar, are separated by the tsek character, U+0F0B TIBETAN MARK INTER-SYLLABIC TSHEG ་.
The tsek is not used before a shay, except after U+0F44 TIBETAN LETTER NGA ང. For example, note the end of the three sections in this example:
Users may use an ordinary TSHEG between NGA and SHAD, but Unicode also provides a special non-breaking character that can be used instead, U+0F0C TIBETAN MARK DELIMITER TSHEG BSTAR ༌. The word 'delimiter' in the name is a misnomer.
Whitespace in Tibetan text should use U+00A0 NO-BREAK SPACE. Spaces in Tibetan text are usually wider than spaces in English text, and typically only occur after one of the following: །, ༑, ༔ or ཿ. However, numbers and embedded Western text are surrounded by smaller spaces, eg. ལོ་ ༢༠༠༡ ཤིང་བྱ་ཟླ་ ༩ ཚེས་ ༥ ཉིན་. Looks like this is also something that the application needs to take care of.
Normally, Tibetan only breaks after the tsek, and doesn't break after spaces.
Line breaks do not occur after a tsek when it follows U+0F44 TIBETAN LETTER NGA ང (with or without a vowel sign) and precedes a shay, U+0F0D TIBETAN MARK SHAD །. The Unicode Standard also talks of other instances where Tibetan grammatical rules do not permit a break, but it isn't clear what those are.
If the character after NGA is an ordinary INTER-SYLLABIC TSHEG, then applications need to ensure that lines do not break between the TSHEG and the SHAD. Text is likely to be more portable if content authors use the TSHEG BSTAR in these locations, instead of the normal TSHEG.
Line breaks are also possible after:
Tibetan never breaks inside a syllable, and has no hyphenation. If a word is composed of multiple syllables, it is also preferable to avoid breaking a line in the middle of the word.
A line must never start with a shad.
Line breaks and rin chen spungs shad. In Tibetan, especially in pechas, it is considered a special case if the last syllable of an expression that is terminated by a shay breaks onto a new line. In that case the shay or double shay is replaced by rin chen spungs shad, U+0F11 TIBETAN MARK RIN CHEN SPUNGS SHAD ༑. At the end of a topic the rules say that only one shay should be converted, ie. ༑ །, however it is moderately popular to convert both, ie. ༑ ༑. This change serves as an optical indication that there is a left-over syllable at the beginning of the line that actually belongs to the preceding line.
This varies in the following cases:
In an environment where the width or content of the page can change, this feature poses a problem for the content author. The application needs to be able to automatically switch between the two styles of shad as a syllable moves on or off a new line when the page is resized or when preceding content is modified.
The Unicode Standard adds: "Not only is rin-chen-spungs-shad used as the replacement for the shay but a whole class of “ornamental shays ” are used for the same purpose. All are scribal variants on a rin-chen-spungs-shad, which is correctly written with three dots above it."
There appear to be two alternative methods of justification.
Method 1: inter-character spacing. Spacing between all characters should be adapted equally. Note that the width of the white-space character should not be changed significantly, so Tibetan texts use the non-breaking space mentioned above, which doesn't change width on justification.
Method 2: tsek padding. While hand writing, authors add small spaces across the text to get the line end as near as possible to the right margin. Where space remains at the margin, it may be left as is, if it is short. Otherwise, the remaining space will be filled with tseks to make the line as flush as possible with the right margin (there will usually still be a slight raggedness to the right edge of the text).
There are a couple of detailed rules about the use of tsek padding. Justifying tseks are almost always used when the line ends in a tsek. If, however, the line ends in a shay, there are a number of alternatives.
If the line ends with a single shay the shay is followed by spaces. Tsek padding is never applied after spaces. (See examples in the figure above.)
If the line ends in a double shay (with space between), it is unusual (though possible) to add tsek padding. Instead, the space between the shays is stretched or narrowed. (See examples in the figure below.) The same applies if the second shay was removed because it was preceded by a KA or GA.
Over and above that described in the previous section, traditional Tibetan text uses very little punctuation, but there a number of signs and symbols to choose from.
U+0F08 TIBETAN MARK SBRUL SHAD ༈ is used to separate texts that are equivalent to topics and subtopics, such as the start of a smaller text, the start of a prayer, a chapter boundary, or to mark the beginning and end of insertions into text in pechas.
This drul-shay is usually surrounded on both sides by the equivalent of about three non-breaking spaces (though no rule is specified). The drul-shay should not appear at the beginning of a new line and the whole structure of spacing-plus- shay needs to be kept together.
U+0F3C TIBETAN MARK ANG KHANG GYON ༼ and U+0F3D TIBETAN MARK ANG KHANG GYAS ༽ are paired punctuation used to form a roof over one or more digits or words. The right-hand character can also be used much like a single parenthesis in list counters.
U+0F3E TIBETAN SIGN YAR TSHES ༾ and U+0F3F TIBETAN SIGN MAR TSHES ༿ are also paired characters used in combination with digits.
U+0F34 TIBETAN MARK BSDUS RTAGS ༴ means 'etc.', and is used after the first few tsek-bar of a recurring phrase.
U+0FBE TIBETAN KU RU KHA ྾ (often repeated three times) indicates a refrain.
U+0F36 TIBETAN MARK CARET -DZUD RTAGS BZHI MIG CAN ༶ and U+0FBF TIBETAN KU RU KHA BZHI MIG CAN ྿ are used to indicate where text should be inserted within other text or as references to footnotes and marginal notes.
U+2638 WHEEL OF DHARMA ☸ which occurs sometimes in Tibetan texts is encoded in the Miscellaneous Symbols block.
U+0F35 TIBETAN MARK NGAS BZUNG NYI ZLA ༵ and U+0F37 TIBETAN MARK NGAS BZUNG SGOR RTAGS ༷ can be used to create a similar effect to underlining or to mark emphasis.
The use of these marks is not straightforward, since they attach to a syllable rather than a character and therefore to place them correctly the application needs to take syllable boundary positions into account. If entered as combining characters they can be added after the vowel-sign in a stack.
Application software has to ignore these characters for text processing, such as search and collation.
Alternative methods of emphasis include use of a different colour, or the use of the prefix ༸.
These characters may also be used in interspersed commentaries to tag the root text that is being commented on. An alternative is to set the tsek-bar being commented on in large type and the commentary in small type.
Modern texts appear to use bolding on text.
When text in smaller annotations or larger heading text is mixed with normal text, the letter-heads of all characters should align to the same height.
This is a list of main characters or character combinations needed for Tibetan. Clicking on these characters will open a page in another window. If the character is underlined, the new page will display additional information about that character.
|Extensions for Balti||ཫ ཬ|
|Dependent vowel signs||ི ུ ེ ོ ཱ|
|Additional consonants for transliteration||
|Transliteration, fixed-form consonants||ཪ ྺ ྻ ྼ|
|Additional vowel signs for transliteration||ཱི ཱུ ྲྀ ཷ ླྀ ཹ ཻ ཽ ྀ ཱྀ|
|Transliteration, vocalic modification||ཾ ཿ ྄|
|Transliteration, head letters||ྈ ྉ ྊ ྋ|
|Transliteration, subjoined signs|
|Head marks||༁ ༂ ༃ ༄ ༅ ༆ ༇|
|Punctuation||་ ། ༎ ༏ ༐ ༑ ༈ ༔ ༴ ༵ ༷|
|Paired punctuation||༺ ༻ ༼ ༽|
|Various marks, signs & symbols||༉ ༊ ༒ ༓ ༶ ༸ ༹ ྂ ྃ ྅ ྆ ྇ ྾ ྿ ࿐ ࿑ ࿒ ࿄ ࿅ ࿆ ࿇ ࿈ ࿉ ࿊ ࿋ ࿌ ࿕ ࿖ ࿗ ࿘|
|Astrological signs||༕ ༖ ༗ ༘ ༙ ༚ ༛ ༜ ༝ ༞ ༟ ༾ ༿ ࿎ ࿏|
|Cantillation signs||࿀ ࿁ ࿂ ࿃|
|Digits||༡ ༢ ༣ ༤ ༥ ༦ ༧ ༨ ༩ ༠|
|Digits minus half||༪ ༫ ༬ ༭ ༮ ༯ ༰ ༱ ༲ ༳|