An Introduction to Writing Systems & Unicode

CJK character sets

Chinese

slide

Initially there was only one type of Chinese – what we now call Traditional Chinese. Then in the 1950s Mainland China introduced a Simplified Chinese. It was simplified in two ways:

  1. the more common character shapes were reduced in complexity,

  2. a relatively smaller set of characters was defined for common usage than had traditionally been the case (resulting in the mapping more than one character in Traditional Chinese to a single character in the Simplified Chinese set).

This slide shows Traditional Chinese above and Simplified Chinese below.

Traditional Chinese is still used to write characters in Taiwan and Hong Kong, and much of the Chinese diaspora. Simplified Chinese is used in Mainland China and Singapore. It is important to stress that people speaking many different, often mutually unintelligible, Chinese dialects would use one or other of these scripts to write Chinese – ie. the characters do not necessarily represent the sounds.

There are a few local characters, such as for Cantonese in Hong Kong, that are not in widespread use.

In Chinese these ideographs are called hanzi (xan.ʦɹ̩). They are often referred to as Han characters.

There is another script used with Traditional Chinese for annotations and transliteration during input. It is called zhuyin (ʈʂu.in) or bopomofo, and will be described in more detail later.

It is said that Chinese people typically use around 3-4,000 characters for most communication, but a reasonable word processor would need to support at least 10,000. Unicode supports over 70,000 Han characters.

slide

This slide shows examples of contrasting shapes in Traditional and Simplified ideographs.

The paragraph on the left is in Simplified Chinese. That on the right is Traditional. The slide shows the same two characters from each paragraph so that you can see how the shape varies. In one case, just the left-hand part of the glyph is different; in the other, the right-hand side is different.

Each of the large glyphs shown above is a separate code point in Unicode. The Simplified and Traditional shapes are not unified unless they are extremely similar. (Han unification will be explained in more detail later.)

Japanese

slide

Japanese uses three native scripts in addition to Latin (which is called romaji), and mixes them all together.

Top centre on the slide is an example of ideographic characters, borrowed from Chinese, which in Japanese are called kanji. Kanji characters are used principally for the roots of words.

The example at the top right of the slide is written entirely in hiragana. Hiragana is a native Japanese syllabic script typically used for many indigenous Japanese words (as in this case) and for grammatical particles and endings. The example at the bottom of the slide shows its use to express grammatical information alongside a kanji character (the darker, initial character) that expresses the root meaning of the word.

Japanese everyday usage requires around 2,000 kanji characters – although Japanese character sets include many thousands more.

slide

The example at the bottom left of this slide shows the katakana script. This is used for foreign loan words in Japanese. The example reads ‘te-ki-su-to’, ie. ‘text’.

slide
slide

On the two slides above we see the more common characters from the hiragana (left) and katakana (right) syllabaries arranged in traditional order. A character in the same location in each table is pronounced exactly the same.

With the exception of the vowels on the top line and the letter ‘n’, all of the symbols represent a consonant followed by a vowel.

The first of the two slides highlights some script features (on the right) from hiragana. The second shows the correspondences in katakana.

Voiced consonants are indicated by attaching a dakuten mark (looks like a quote mark) to the unvoiced shape. The ‘p’ sound is indicated by the use of a han-dakuten (looks like a small circle). The slides show glyphs for ‘ha’, ‘ba’, and ‘pa’ on the top line.

A small ‘tsu’ (っ) is commonly used to lengthen a consonant sound.

Small versions of や, ゆ, and よ are used to form syllables such as ‘kya’ (きゃ), ‘kyu’ (きゅ), and ‘kyo’ (きょ) respectively.

When writing katakana the mark ー is used to indicate a lengthened vowel.

slide

The lower example on the slide shows the small tsu being used in katakana to lengthen the ‘t’ sound that follows it. This can be transcribed as ‘intanetto’.

The higher example shows usage of other small versions of katakana characters. The transcription is ‘konpyuutingu’. In the first case the small ‘yu’ combines with the preceding ‘pi’ to produce ‘pyu’. In the second case the small ‘i’ is used with the preceding ‘te’ syllable to produce ‘ti’ – a sound that is not native to Japanese. (Their equivalent would be ‘chi’.)

The higher example also shows the use of the han-dakuten and dakuten to turn ‘hi’ into ‘pi’ and ‘ku’ into ‘gu’.

There is also a lengthening mark that lengthens the ‘u’ sound before it.

 go to top of page

slide

Han and kana characters are usually full-width, whereas latin text is half-width or proportionally spaced.

Half-width katakana characters do exist, and for compatibility reasons there is a Unicode block for half-width kana characters. These codes should not normally be used, however. They arise from the early computing days when Japanese had to be fitted into a Western-biased technology.

Similarly, it is common to find full-width Latin text, especially in tables. Again, there is a Unicode block dedicated to full width Latin characters and punctuation, but a font should be used instead.

 go to top of page

Korean

slide

Korean uses a unique script called hangul. It is unique in that, although it is a syllabic script, the individual phonemes within a syllable are represented by individual shapes. The example shows how the word ‘ta-kuk-o’ is composed of 7 jamos, each expressing a single phoneme. The jamos are displayed as part of a two dimensional syllabic character.

The initial jamo in the last syllable is not pronounced in initial position and serves purely to conform to the rule that hangul syllables always begin with a consonant.

It is possible to store hangul text as either jamos or syllabic characters in Unicode, although the latter is more common. Unicode enables both approaches.

South Korea also mixes ideographic characters borrowed from Chinese with hangul, though on nothing like the scale of Japanese. In fact, it is quite normal to find whole documents without any hanja, as the ideographic characters in Korean are called.

There are about 2,300 hangul characters in everyday use, but the Unicode Standard has code points for around 11,000.

 go to top of page

Radicals

slide

A radical is an ideograph or a component of an ideograph that is used for indexing dictionaries and word lists, and as the basis for creating new ideographs. The 214 radicals of the KangXi dictionary are universally recognised.

The examples enlarged on the slide show the ideographic character meaning ‘word’, ‘say’ or ‘speak’ (bottom left), and three more characters that use this as a radical on their left hand side.

slide

The visual appearance of radicals may vary significantly.

Here the radical shown on the previous slide is seen as used in Simplified Chinese (top right). Although the shape differs somewhat it still represents the same radical.

On the bottom row we see the ‘water’ radical being used in two different positions in a character, and with two different shapes. This time the right-most example is found in both simplified and traditional forms.

slide

Unicode dedicates two blocks to radicals. The KangXi radicals block (pronounced kʰɑŋ.ɕi) depicted here contains the base forms of the 214 radicals.

The CJK Radicals Extension contains variant shapes of these radicals when they are used as parts of other characters or in simplified form. These have not been unified because they often appear independently in dictionaries indices.

Characters in these blocks should never be used as ideographs.

 go to top of page

Character sets, encodings, and multi-byte characters

Character sets & encodings

slide

A very early step in realizing the use of a script or set of scripts is to define the set of characters needed for its use.

The slide shows a set that was defined for the North African Tifinagh script. It includes characters for a number of variants of Tifinagh besides that used in Morocco, such as writing used by the Touareg.

At this stage, this is just a bag of characters with no formal structure. It is not necessarily computer-specific – it is just a list of characters needed for writing Tifinagh, one way or another.

This is called a character set, or repertoire.

slide

Next the characters are ordered in a standard way and numbered. Each unique character has a unique number, called a code point. The code point of the number circled above is 33 in hexadecimal notation (a common way to represent code points), or 52 in decimal.

A set of characters ordered and numbered in this way is called a coded character set.

slide

In the early days of computing a byte consisted of 7 bits; allowing for a code page containing 128 code points. This was the day of ASCII.

slide

When bytes contained 8 bits they gave rise to code pages containing 256 code points. These code pages typically retain the ASCII characters in the lower 128 code points and add characters for additional languages to the upper reaches. On the slide we see a Latin1 code page, ISO 8859-1, containing code points for Western European languages.

slide

Unfortunately, 256 code points was not enough to support the whole of Europe – not even Latin based languages such as Turkish, Hungarian, etc. To support Greek characters you might see the code points re-mapped as shown on the slide (left hand side). These alternative code pages forced you to maintain contextual information so that you could determine the intended character from the upper ranges of the code page. It also made localization difficult since you had to keep changing code pages.

slide

East Asian computing immediately faced a much bigger problem than in Europe, as can be seen by the size of these common character sets. They resorted to double-byte coded character sets. Two-bytes per character sets provided 16 bits, and would theoretically allow for 216 (ie. 65,356) possible code points. In reality these character sets tended to be based on a 7-bit model, utilizing only a part of the total space available.

One significant problem persisted here – these character sets and their encodings were script specific. It was still difficult to represent Chinese, Korean and Japanese text simultaneously.

slide

Unicode encompasses all scripts and symbols needed for text in a single character set.

slide

Most modern scripts and useful symbols are currently encoded in a coding space called the Basic Multilingual Plane or BMP. There is room for 65,356 characters on this plane.

slide

Beyond the BMP, Unicode defines 16 additional supplementary planes, each the same size as the BMP, and characters are regularly added to those planes with each new version of Unicode. The Supplementary Multilingual Plane (SMP), contains characters for such things as additional alphabets, math characters, and the majority of emoji characters. Also a large number of additional ideographic characters have been added to the Supplementary Ideographic Plane (SIP).

In total there are now over one million code point slots available. This means that all of the above scripts and more can be represented simultaneously with ease. Localization also becomes easier, since there is no need to enable new code pages or switch encodings – you simply began using the characters that are available.

slide

In addition to the normal code point allocations, there is also space available in Unicode for privately defined character mappings. There is a Private Use Area in the BMP from code points E000–F8FF (6,400 code points). There are two additional, and much larger, private use areas in the supplementary character ranges.

slide

Although the terms 'character set' and 'character encoding' are often treated as the same thing, they actually mean separate things.

We have already explained that a character set or repertoire comprises the set of atomic text elements you will use for a particular purpose. We also explained that the Unicode Standard assigns a unique scalar number to every character in its character set. The resulting numbered set is referred to as a coded character set. Units of a coded character set are known as code points.

The character encoding reflects the way these abstract characters are mapped to numbers for manipulation in a computer.

In a standard such as ISO-8859, encodings tend to use a single byte for a given character and the encoding is straightforwardly related to the position of the characters in the set.

The above distinction becomes helpful when discussing Unicode because the set of characters (ie. the character set) defined by the Unicode Standard can be encoded in a number of different ways. The type of encoding doesn’t change the number or nature of the characters in the Unicode set, just the way they are mapped into numbers for manipulation by the computer (see the next slide).

On the Web the internal character set of an XML application or HTML browser is always Unicode. A particular XML or HTML document can be encoded using another encoding, even encodings that don’t cover the full Unicode range such as ISO 8859-1 (Latin1). Having said that, we strongly recommend that you only use the UTF-8 Unicode encoding for web pages.

If you want to know more about character encodings for web pages, read Handling character encodings in HTML and CSS.

slide

This slide demonstrates a number of ways of encoding the same characters in Unicode. These encodings are UTF-8, UTF-16, and UTF-32. The text means "Hello!" in the Berber script (Tifinagh).

In the chart on the slide, the numbers below the characters represent the code point of each character in the Unicode coded character set. The other lines show the byte values used to represent that character in a particular character encoding.

UTF-8 uses 1 byte to represent characters in the old ASCII set, two bytes for characters in several more alphabetic blocks, and three bytes for the rest of the BMP. Supplementary characters use 4 bytes.

UTF-16 uses 2 bytes for any character in the BMP, and 4 bytes for supplementary characters.

UTF-32 uses 4 bytes everywhere.

This explanation glosses over some of the detailed nomenclature related to encoding. More detail can be found in Unicode Technical Report #17, Unicode Character Encoding Model.

slide

In UTF-32, characters in the supplementary character range are encoded in bytes that correspond directly to the code point values. For example, U+10330 GOTHIC LETTER AHSA is stored as the byte sequence 00 01 03 30. In UTF-8, the character would also be represented using a 4-byte sequence, F0 90 8C B0.

UTF-16, however, wants to represent all characters using 16-bit (2 byte) 'code units', but you can't express 0x10330 (decimal 66,352) as a 16-bit value (the maximum is decimal 65,535). To get around this, UTF-16 uses instead two special, adjacent 1024-character ranges in Unicode referred to as high surrogates and low surrogates. The combination of a high surrogate followed by a low surrogate, when interpreted by the character encoding algorithm used for UTF-16, points to a specific character in a supplementary plane. For example, the Gothic AHSA is represented in UTF-16 as the byte sequence D8 00 DF 30, where D800 is the code point of a high surrogate, and DF30 is the code point of a low surrogate.

You should never encounter a single surrogate character – they should always appear as high+low surrogate pairs. Also, pairs should not be split when wrapping or highlighting text, counting characters, displaying unknown character glyphs, and so on. You should also never normally see surrogate character code points in UTF-8 or UTF-32.

 go to top of page

Unification

slide

Unicode provides a superset of most character sets in use around the world, but tries not to duplicate characters unnecessarily. For example, there are several ISO character sets in the 8859 range that all duplicate the ASCII characters. Unicode doesn't have as many codes for the letter 'a' as there are character sets - that would make for a huge and confusing character set.

The same principal applies for Han (Chinese) characters. The initial set of sources for Han encoding in Unicode laid end to end comprised 121,000 characters, but there were many repeats, and the final Unicode tally for all these after elimination of duplicates was 20,902. (There are now over 70,000 Han characters encoded in Unicode.)

If Han characters had different meanings or etymologies, they were not unified. Han characters, however, are highly pictorial in nature. So the (dis-) unification process had to take into account the visual forms to some extent. Where there was a significant visual difference between han characters that represented the same thing they were allotted to separate Unicode code points. (Unifying the Han characters is a sophisticated process, carried out over a long period by many East-Asian experts.)

Factors such as those shown on this slide prevent unification, ie.

  • Different number of components
  • Same components but different positions
  • Different structure in components
slide

What is left for unification are characters representing the same thing but exhibiting no visual differences, or relatively minor differences such as different sequence for writing strokes, differences in stroke overshoot and protrusion, differences in contact and bend of strokes, differences in accent and termination of strokes, etc.

 go to top of page

Respecting character boundaries

slide

The slide shows how a string of characters maps to byte codes in memory in UTF-8. In an encoding such as UTF-8 the number of bytes actually used depends on the character in question, and only a very small number of characters are encoded using a single byte.

This means that care has to be taken to recognize and respect the integrity of the character boundaries.

Applications cannot simply handle a fixed number of bytes when performing editing operations such as inserting, deleting, wrapping, cursor positioning, etc. Collation for searching and sorting, pointing into strings, and all other operations similarly need to work out where the boundaries of the characters lie in order to successfully process the text.

Such operations need to be based on characters, not bytes.

Similarly, string lengths should be based on characters rather than bytes.

slide

This slide illustrates how things go wrong with technology that is not multi-byte aware. In this case the author attempted to delete a Chinese character on the last line, and the application translated that to "delete a single byte". This caused a misalignment of all the following bytes, and produced garbage.

 go to top of page

slide

Here is another example of the importance of working with characters, rather than bytes (and sometimes even larger units). In this use case, text is automatically truncated after reaching a fixed number of bytes. The top row is English, and each character is represented by a single byte. Cyrillic text, however, uses 2 bytes per character in UTF-8, so the Russian text on the 2nd line is truncated in the middle of a character. Consequently, the Russian reader will see a diamond with a question mark at the end of the line, indicating that the system doesn't recognise this.

The same happens for the Chinese text, which uses 3 bytes per character, so there is twice the likelihood of garbage appearing at the line end.

It's a similar story for emoji, although there is an additional twist. This slide shows the composition of a couple of pre-composed emoji sequences, and lists the code points involved in constructing those images.

 go to top of page

slide

Truncate the emoji sequences at the right point, and rather than producing garbage, you change the picture. See how the family has lost a child at the bottom right! In this case, the truncation wasn't problematic because a single code point was split, but because an unbreakable sequence of code points was damaged. In cases such as these, it's important to locate the boundaries of the item that will be truncated. (And by the way, note how long emoji sequences can be, and think carefully before imposing short limits on field lengths.)

 go to top of page

Tools

slide

UniView is an unofficial HTML-based tool for finding Unicode characters and looking up their properties. It also acts like a character map or character picker, allowing you to create strings of Unicode characters. You can also use it to discover the contents of a string or a sequence of codepoint values, to convert to NFC or NFD normalized forms, display ranges of characters as lists or tables, highlight properties, etc.

A significant feature of UniView is that it has images for all Unicode characters (apart from some of the more recent ideographic ones), so you don't have to wrestle with fonts (although you can turn off the images if you prefer). It is always up to date with the latest Unicode version.

The Unibook Character browser is a downloadable utility for offline viewing of the character charts and character properties for The Unicode Standard, created by Asmus Freytag. It can also be used to copy&paste character codes. The utility was derived from the program used to print the character code charts for the Unicode Standard and ISO/IEC 10646.

If you need to convert Unicode characters between various escaped forms, you should try the web-based Unicode Code Converter tool.

There are also over 30 web-based Unicode Character Pickers available. These allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. They are likely to be most useful if you don't know a script well enough to use the native keyboard. The arrangement of characters makes it much more useable than a regular character map utility. The more advanced pickers provide ways to select characters from semantic or phonetic arrangements, select by shape, and select by association with a transcription.

 go to top of page

Inputting ideographic characters

Getting to the right character quickly

slide

We have noted that East Asian character sets number their characters in the thousands. So how do you, quickly, find the one character you want while typing?

In the past people have tried using extremely large keyboards, or forcing people to remember the code point numbers for the character. Not surprisingly these approaches were not very popular.

The answer is to use an IME (Input Method Editor). An IME (also called a front-end processor) is software that uses a number of strategies to help you search for the character you want.

slide

This slide summarizes the typical steps when typing in Japanese using a standard IME for Windows.

The user types Japanese in romaji transcription using a QWERTY keyboard. As they type the transcription is automatically converted to hiragana or katakana. Ranges of characters are accepted by a key press as they go along. To convert a range of characters to kanji, the user presses a key such as the space bar. Typically the IME will automatically insert into the text the kanji that were last selected for the transcription that has been input. If this is not the desired kanji sequence, the user presses the key again and a selection list pops up, usually ordered in terms of frequency of selection. The user picks the kanji characters required, and confirms their choice, then moves on.

Note that there are only a few alternatives for the sequence かいぎ. If the user had looked up かい and ぎ separately they would have been faced each time with a large number of choices. The provision of a dictionary as part of the IME for lookup of longer phrases is one way of speeding up the process of text entry for the user.

Ordering by frequency and memory of the last conversion are additional methods of assisting the user to find the right character more quickly.

 go to top of page

Chinese input methods

slide

Whereas the Japanese romaji input method predominates for Japanese, there are a number of different approaches available for Chinese.

Pinyin was introduced with Simplified Chinese, and is typically used in the same geographical areas, ie. Mainland China and Singapore.

It is essentially equivalent to the romaji input method. The numbers you see in the example above indicate tones. This dramatically reduces the ambiguity of the sounds in Chinese.

One of the problems of pinyin is that the transcription is based on the Mandarin or Putonghua dialect of spoken Chinese. So to use this method you need to be able to speak that dialect.

slide

A more common input method in Taiwan uses an alphabet called zhuyin or bopomofo. This alphabet is only used for phonetic transcription of Chinese. Essentially it is the same idea as pinyin, but with different letters. The tones in this case are indicated by spacing accent marks (shown only in the top line on the slide) which in Unicode are unified with accents used in European languages.

slide

A very different approach allows the user to create the desired character on the basis of its visual appearance rather than the underlying phonics.

Changjie input uses just such an approach. The keyboard provides access to primitive ideographic components which, when combined in the right sequence lead to the desired ideograph.

An advantage of an approach such as changjie is that you don’t have to speak Mandarin. A drawback is the additional training required.

Note that pen-based input is another useful approach. In fact, this is particularly helpful for people who do not speak Chinese or Japanese. Once you master a few simple rules about stroke order and direction, you can use something like Microsoft’s IME Pad to draw and select characters without any knowledge of components or pronunciation.

slide

The examples on this slide show the keystrokes required to enter the text used in the previous slides containing pinyin and bopomofo examples.

 go to top of page

Alternative representations of characters

slide

In some cases you may come across an ideograph that your font or your character set doesn’t support. Unicode provides a way of saying, “I can’t represent it, but it looks like this character.”

The approach requires you to add character U+303E immediately followed by a similar looking character. This is called an ideographic variation indicator. This at least gives the reader a chance to guess at the character that is missing.

slide

Another way of addressing the same problem is to use the ideographic description characters introduced in Unicode 3.0. As an example, the slide above uses this approach to describe a character which exists in Unicode.

This approach allows you to draw a picture showing what are the various components of the character you can’t represent, and where they appear. The lower line on the slide shows how you would describe the large character near the top. Note that this is interpreted recursively.

Note also, that this should not be treated as in any way equivalent to an existing ideograph when collating strings.

 go to top of page

slide

Wikipedia describes as follows a Han character that is in use but is not encoded in Unicode at the time of writing.

Biángbiáng noodles ([...] pinyin, Biángbiáng miàn), also known as (simplified Chinese: 油泼扯面; traditional Chinese: 油潑扯麵; pinyin: Yóupō chěmiàn), are a type of noodle popular in China's Shaanxi province. The noodles, touted as one of the "ten strange wonders of Shaanxi" (Chinese: 陕西十大怪), are described as being like a belt, owing to their thickness and length." "Made up of 56 strokes, the Chinese character for "biáng" is one of the most complex Chinese characters in contemporary usage, although the character is not found in modern dictionaries or even in the Kangxi dictionary."

The Wikipedia article uses ideographic description characters to show the composition of the character.

Another example of this mechanism can be found at http://www.unicode.org/Public/UNIDATA/USourceData.txt.

 go to top of page

First published Feb 2003. This version 2021-12-10 17:17 GMT.  •  Copyright r12a@w3.org. Licence CC-By.