Picture of the page in action.
>> Use the app

This app allows you to see how Unicode characters are represented as bytes in various legacy encodings, and vice versa. You can customise the encodings you want to experiment with by clicking on change encodings shown. The default selection excludes most of the single-byte encodings.

The app provides a way of detecting the likely encoding of a sequence of bytes if you have no context, and also allows you to see which encodings support specific characters. The list of encodings is limited to those described for use on the Web by the Encoding specification.

The algorithms used are based on those described in the Encoding specification, and thus describe the behaviour you can expect from web browsers. The transforms may not be the same as for other conversion tools. (In some cases the browsers may also produce a different result than shown here, while the implementation of the spec proceeds. See the tests.)

Encoding algorithms convert Unicode characters to sequences of double-digit hex numbers that represent the bytes found in the target character encoding. A character that cannot be handled by an encoder will be represented as a decimal HTML character escape.

Decoding algorithms take the byte codes just mentioned and convert them to Unicode characters. The algorithm returns replacement characters where it is unable to map a given byte to the encoding.

For the decoder input you can provide a string of hex numbers separated by space or by percent signs.

Green backgrounds appear behind sequences where all characters or bytes were successfully mapped to a character in the given encoding. Beware, however, that the character mapped to may not be the one you expect – especially in the single byte encodings.

To identify characters and look up information about them you will find UniView extremely useful. You can paste Unicode characters into the UniView Edit Buffer and click on the down-arrow icon below to find out what they are. (Click on the name that appears for more detailed information.) It is particularly useful for identifying escaped characters. Copy the escape(s) to the Find input area on UniView and click on Dec just below.

Picture of the page in action.
>> Use the picker

An update to version 17 of the Mongolian character picker is now available.

When you hover over or select a character in the selection area, the box to the left of that area displays the alternate glyph forms that are appropriate for that character. By default, this only happens when you click on a character, but you can make it happen on hover by clicking on the V in the gray selection bar to the right.

The list includes the default positional forms as well as the forms produced by following the character with a Free Variation Selector (FVS). The latter forms have been updated, based on work which has been taking place in 2015 to standardise the forms produced by using FVS. At the moment, not all fonts will produce the expected shapes for all possible combinations. (For more information, see Notes on Mongolian variant forms.)

An additional new feature is that when the variant list is displayed, you can add an appropriate FVS character to the output area by simply clicking in the list on the shape that you want to see in the output.

This provides an easy way to check what shapes should be produced and what shapes are produced by a given font. (You can specify which font the app should use for display of the output.)

Some small improvements were also made to the user interface. The picker works best in Firefox and Edge desktop browsers, since they now have pretty good support for vertical text. It works least well in Safari (which includes the iPad browsers).

For more information about the picker, see the notes at the bottom of the picker page.

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

Picture of the page in action.
>> Use UniView

This update allows you to link to information about Han characters and Hangul syllables, and fixes some bugs related to the display of Han character blocks.

Information about Han characters displayed in the lower right area will have a link View data in Unihan database. As expected, this opens a new window at the page of the Unihan database corresponding to this character.

Han and hangul characters also have a link View in PDF code charts (pageXX). On Firefox and Chrome, this will open the PDF file for that block at the page that lists this character. (For Safari and Edge you will need to scroll to the page indicated.) The PDF is useful if there is no picture or font glyph for that character, but also allows you to see the variant forms of the character.

For some Han blocks, the number of characters per page in the PDF file varies slightly. In this case you will see the text approx; you may have to look at a page adjacent to the one you are taken to for these characters.

Note that some of the PDF files are quite large. If the file size exceeds 3Mb, a warning is included.

Picture of the page in action.

>> Use UniView

Unicode 8.0.0 is released today. This new version of UniView adds the new characters encoded in Unicode 8.0.0 (including 6 new scripts). The scripts listed in the block selection menu were also reordered to match changes to the Unicode charts page.

The URL for UniView is now https://r12a.github.io/uniview/. Please change your bookmarks.

The github site now holds images for all 28,000+ Unicode codepoints other than Han ideographs and Hangul syllables (in two sizes).

I also fixed the Show Age filter, and brought it up to date.

Three bopomofo letters with tone mark.

Light tone mark in annotation.

A key issue for handling of bopomofo (zhùyīn fúhào) is the placement of tone marks. When bopomofo text runs vertically (either on its own, or as a phonetic annotation), some smarts are needed to display tone marks in the right place. This may also be required (though with different rules) for bopomofo when used horizontally for phonetic annotations (ie. above a base character), but not in all such cases. However, when bopomofo is written horizontally in any other situation (ie. when not written above a base character), the tone mark typically follows the last bopomofo letter in the syllable, with no special handling.

From time to time questions are raised on W3C mailing lists about how to implement phonetic annotations in bopomofo. Participants in these discussions need a good understanding of the various complexities of bopomofo rendering.

To help with that, I just uploaded a new Web page Bopomofo on the Web. The aim is to provide background information, and carry useful ideas from one discussion to the next. I also add some personal thoughts on implementation alternatives, given current data.

I intend to update the page from time to time, as new information becomes available.

Screen Shot 2015-01-18 at 07.42.56

Version 16 of the Bengali character picker is now available.

Other than a small rearrangement of the selection table, and the significant standard features that version 16 brings, this version adds the following:

  • three new buttons for automatic transcription between latin and bengali. You can use these buttons to transcribe to and from latin transcriptions using ISO 15919 or Radice approaches.
  • hinting to help identify similar characters.
  • the ability to select the base character for the display of combining characters in the selection table.

For more information about the picker, see the notes at the bottom of the picker page.

In addition, I made a number of additions and changes to Bengali script notes (an overview of the Bengali script), and Bengali character notes (an annotated list of characters in the Bengali script).

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

initial-letter-tibetan-01

The CSS WG needs advice on initial letter styling in non-Latin scripts, ie. enlarged letters or syllables at the start of a paragraph like those shown in the picture. Most of the current content of the recently published Working Draft, CSS Inline Layout Module Level 3 is about styling of initial letters, but the editors need to ensure that they have covered the needs of users of non-Latin scripts.

The spec currently describes drop, sunken and raised initial characters, and allows you to manipulate them using the initial-letter and the initial-letter-align properties. You can apply those properties to text selected by ::first-letter, or to the first child of a block (such as a span).

The editors are looking for

any examples of drop initials in non-western scripts, especially Arabic and Indic scripts.

I have scanned some examples from newspapers (so, not high quality print).

In the section about initial-letter-align the spec says:

Input from those knowledgeable about non-Western typographic traditions would be very helpful in describing the appropriate alignments. More values may be required for this property.

Do you have detailed information about initial letter styling in a non-Latin script that you can contribute? If so, please write to www-style@w3.org (how to subscribe).

I’m struggling to show combining characters on a page in a consistent way across browsers.

For example, while laying out my pickers, I want users to be able to click on a representation of a character to add it to the output field. In the past I resorted to pictures of the characters, but now that webfonts are available, I want to replace those with font glyphs. (That makes for much smaller and more flexible pages.)

Take the Bengali picker that I’m currently working on. I’d like to end up with something like this:

comchacon0

I put a no-break space before each combining character, to give it some width, and because that’s what the Unicode Standard recommends (p60, Exhibiting Nonspacing Marks in Isolation). The result is close to what I was looking for in Chrome and Safari except that you can see a gap for the nbsp to the left.

comchacon1

But in IE and Firefox I get this:

comchacon2

This is especially problematic since it messes up the overall layout, but in some cases it also causes text to overlap.

I tried using a dotted circle Unicode character, instead of the no-break space. On Firefox this looked ok, but on Chrome it resulted in two dotted circles per combining character.

I considered using a consonant as the base character. It would work ok, but it would possibly widen the overall space needed (not ideal) and would make it harder to spot a combining character by shape. I tried putting a span around the base character to grey it out, but the various browsers reacted differently to the span. Vowel signs that appear on both sides of the base character no longer worked – the vowel sign appeared after. In other cases, the grey of the base character was inherited by the whole grapheme, regardless of the fact that the combining character was outside the span. (Here are some examples ে and ো.)

In the end, I settled for no preceding base character at all. The combining character was the first thing in the table cell or span that surrounded it. This gave the desired result for the font I had been using, albeit that I needed to tweak the occasional character with padding to move it slightly to the right.

On the other hand, this was not to be a complete solution either. Whereas most of the fonts I planned to use produce the dotted circle in these conditions, one of my favourites (SolaimanLipi) doesn’t produce it. This leads to significant problems, since many combining characters appear far to the left, and in some cases it is not possible to click on them, in others you have to locate a blank space somewhere to the right and click on that. Not at all satisfactory.

comchacon3

I couldn’t find a better way to solve the problem, however, and since there were several Bengali fonts to choose from that did produce dotted circles, I settled for that as the best of a bad lot.

However, then i turned my attention to other pickers and tried the same solution. I found that only one of the many Thai fonts I tried for the Thai picker produced the dotted circles. So the approach here would have to be different. For Khmer, the main Windows font (Daunpenh) produced dotted circles only for some of the combining characters in Internet Explorer. And on Chrome, a sequence of two combining characters, one after the other, produced two dotted circles…

I suspect that I’ll need to choose an approach for each picker based on what fonts are available, and perhaps provide an option to insert or remove base characters before combining characters when someone wants to use a different font.

It would be nice to standardise behaviour here, and to do so in a way that involves the no-break space, as described in the Unicode Standard, or some other base character such as – why not? – the dotted circle itself. I assume that the fix for this would have to be handled by the browser, since there are already many font cats out of the bag.

Does anyone have an alternate solution? I thought I heard someone at the last Unicode conference mention some way of controlling the behaviour of dotted circles via some script or font setting…?

Update: See Marc Durdin’s blog for more on this topic, and his experiences while trying to design on-screen keyboards for Lao and other scripts.

  1. Marc Durdin Says:

    Richard, your post tickled my memory about a draft sitting in my list of blog posts to complete, so I sat down and completed it. My blog describes how we resolved the combining issue for a Lao keyboard; the principle should be applicable to other systems. It’s a bit wordy though…

    http://marc.durdin.net/2015/01/how-to-rendering-combining-marks-consistently-across-platforms-a-long-story/

khmer-picker16

I have uploaded a new version of the Khmer character picker.

The new version uses characters instead of images for the selection table, making it faster to load and more flexible. If you prefer, you can still access the previous version.

Other than a small rearrangement of the default selection table to accomodate fonts rather than images, and the significant standard features that version 16 brings, there are no additional changes in this version.

For more information about the picker, see the notes at the bottom of the picker page.

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

uighur-picker16

devanagari-picker16

gurmukhi-picker16

I have updated the Devanagari picker, the Gurmukhi picker and the Uighur picker to version 16.

You may have spotted a previous, unannounced, version of the Devanagari and Uighur pickers on the site, but essentially these versions should be treated as new. The Gurmukhi picker has been updated from a very old version.

In addition to the standard features that version 16 of the character pickers brings, things to note include the addition of hints for all pickers, and automated transcription from Devanagari to ISO 15919, and vice versa for the Devanagari picker.

For more information about the pickers, see the notes at the bottom of the relevant picker page.

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

A couple of posts ago I mentioned that I had updated the Thai picker to version 16. I have now updated a few more. For ease of reference, I will list here the main changes between version 16 pickers and previous versions back to version 12.

  • Fonts rather than graphics. The main selection table in version 12 used images to represent characters. These have now gone, in favour of fonts. Most pickers include a web font download to ensure that you will see the characters. This reduces the size and download time significantly when you open a picker. Other source code changes have reduced the size of the files even further, so that the main file is typically only a small fraction of the size it was in version 14.

    It is also now possible, in version 16, to change the font of the main selection table and the font size.

  • UI. The whole look and feel of the user interface has changed from version 14 onwards, and includes useful links and explanations off the top of the normal work space.

    In particular, the vertical menu, introduced in version 14, has been adjusted so that input features can be turned on and off independently, and new panels appear alongside the others, rather than toggling the view from one mode to another. So, for example, you can have hints and shape-based selectors turned on at the same time. When something is switched on, its label in the menu turns orange, and the full text of the option is followed by a check mark.

  • Transcription panels. Some pickers had one or more transcription views in versions below 16. These enable you to construct some non-Latin text when working from a Latin transcription. In version 16 these alternate views are converted to panels that can be displayed at the same time as other information. They can be shown or hidden from the vertical menu. When there is ambiguity as to which characters to use, a pop up displays alternatives. Click on one to insert it into the output. There is also a panel containing non-ASCII Latin characters, which can be used when typing Latin transcriptions directly into the main output area. This panel is now hidden by default, but can be easily shown from the vertical menu.

  • Automated transcription. Version 16 pickers carry forward, and in some cases add, automated transcription converters. In some cases these are intended to generate only an approximation to the needed transcription, in order to speed up the transcription process. In other cases, they are complete. (See the notes for the picker to tell which is which.) Where there is ambiguity about how to transcribe a sequence of characters, the interface offers you a choice from alternatives. Just click on the character you want and it will replace all the options proposed. In some cases, particularly South-East Asian scripts, the text you want to transcribe has to be split into syllables first, using spaces and or hyphens. Where this is necessary, a condense button it provided, to quickly strip out the separators after the transcription is done.

  • Layout The default layout of the main selection table has usually been improved, to make it easier to locate characters. Rarely used, deprecated, etc, characters appear below the main table, rather than to the right.

  • Hints Very early versions of the pickers used to automatically highlight similar and easily confusable characters when you hovered over a character in the main selection table. This feature is being reintroduced as standard for version 16 pickers. It can be turned on or off from the vertical menu. This is very helpful for people who don’t know the script well.

  • Shape-based selection. In previous versions the shape-based view replaced the default view. In version 16 the shape selectors appear below the main selection table and highlight the characters in that table. This arrangement has several advantages.

  • Applying actions to ranges of text. When clicking on the Codepoints and Escapes buttons, it is possible to apply the action to a highighted range of characters, rather than all the characters in the output area. It is also possible to transcribe only highlighted text, when using one of the automated transcription features.

  • Phoneme bank. When composing text from a Latin transcription in previous versions you had to make choices about phonetics. Those choices were stored on the UI to speed up generation of phonetic transcriptions in addition to the native text, but this feature somewhat complicated the development and use of the transcription feature. It has been dropped in version 16. Hopefully, the transcription panels and automated transcription features will be useful enough in future.

  • Font grid. The font grid view was removed in version 16. It is of little value when the characters are already displayed using fonts.

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

This update to the Language Subtag Lookup tool brings back the Check function that had been out of action since last January. The code had to be pretty much completely rewritten to migrate it from the original PHP. In the process, I added support for extension and private use tags, and added several more checks. I also made various changes to the way the results are displayed.

Give it a try with this rather complicated, but valid language tag: zh-cmn-latn-CN-pinyin-fonipa-u-co-phonebk-x-mytag-yourtag

Or try this rather badly conceived language tag, to see some error messages: mena-fr-latn-fonipa-biske-x-mylongtag-x-shorter

The IANA database information is up-to-date. The tool currently supports the IANA Subtag registry of 2014-12-17. It reports subtags for 8,081 languages, 228 extlangs, 174 scripts, 301 regions, 68 variants, and 26 grandfathered subtags.

I have uploaded another new version of the Thai character picker.

Sorry this follows so quickly on the heels of version 15, but as soon as I uploaded v15 several ideas on how to improve it popped into my head. This is the result. I will hopefully bring all the pickers, one by one, up to the new version 16 format. If you prefer, you can still access version 12.

The main changes include:

  • UI. Adjustment of the vertical menu, so that input features can be turned on and off independently, and new panels appear with the others, rather than toggling from one to another. So, for example, you can have hints and shape-based selectors turned on at the same time. When something is switched on, its label in the menu turns orange, and the full text of the option is followed by a check mark.
  • Transcription panels. Panels have been added to enable you to construct some Thai text when working from a Latin transcription. This brings the transcription inputs of version 12 into version 16, but in a more compact and simpler way, and way that gives you continued access to the standard table for special characters.

    There are currently options to transcribe from ISO 11940-2 (although there are some gaps in that), or from the transcription used by Benjawan Poomsan Becker in her book, Thai for Beginners. These are both transcriptions based on phonetic renderings of the Thai, so there is often ambiguity about how to transcribe a particular Latin letter into Thai. When such an ambiguity occurs, the interface offers you a choice via a small pop-up. Just click on the character you want and it will be inserted into the main output area.

    The transcription panels are useful because you can add a whole vowel at a time, rather than picking the individual vowel signs that compose it. An issue arises, however, when the vowel signs that make up a given vowel contain one that appears to the left of the syllable initial consonant(s). This is easily solved by highlighting the syllable in question and clicking on the reorder button. The vowel sign in question will then appear as the first item in the highlighted text.

    There is also a panel containing non-ASCII Latin characters, which can be used when typing Latin transcriptions directly into the main output area. (This was available in v15 too, but has been made into a panel like the others, which can be hidden when not needed.)

  • Tones for automatic IPA transcriptions. The automatic transcription to IPA now adds tone marks. These are usually correct, but, as with other aspects of the transcription, it doesn’t take into account the odd idiosyncrasy in Thai spelling, so you should always check that the output is correct. (Note that there is still an issue for some of the ambiguous transcription cases, mostly involving RA.)

For more information about the picker, see the notes at the bottom of the picker page.

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

I have uploaded a new version of the Thai character picker.

The new version uses characters instead of images for the selection table, making it faster to load and more flexible, and dispenses with the transcription view. If you prefer, you can still access the previous version.

Other changes include:

  • Significant rearrangement of the default selection table. The new arrangement makes it easy to choose the right characters if you have a Latin transcription to hand, which allows the removal of the previous transcription view, at the same time as speeding up that type of picking.
  • Addition of latin prompts to help locate letters (standard with v15).
  • Automatic transcription from Thai into ISO 11940-1, ISO 11940-2 and IPA. Note that for the last two there are some corner cases where the results are not quite correct, due to the ambiguity of the script, and note also that you need to show syllable boundaries with spaces before transcribing. (There’s a way to remove those spaces quickly afterwards.) See below for more information.
  • Hints! When switched on and you mouse over a character, other similar characters or characters incorporating the shape you moused over, are highlighted. Particularly useful for people who don’t know the script well, and may miss small differences, but also useful sometimes for finding a character if you first see something similar.
  • It also comes with the new v15 features that are standard, such as shape-based picking without losing context, range-selectable codepoint information, a rehabilitated escapes button, the ability to change the font of the table and the line-height of the output, and the ability to turn off autofocus on mobile devices to stop the keyboard jumping up all the time, etc.

For more information about the picker, see the notes at the bottom of the picker page.

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

More about the transcriptions: There are three buttons that allow you to convert from Thai text to Latin transcriptions. If you highlight part of the text, only that part will be transcribed.

The toISO-1 button produces an ISO 11940-1 transliteration, that latinises the Thai characters without changing their order. The result doesn’t normally tell you how to pronounce the Thai text, but it can be converted back to Thai as each Thai character is represented by a unique sequence in Latin. This transcription should produce fully conformant output. There is no need to identify syllables boundaries first.

The toISO-2 and toIPA buttons produce an output that is intended to approximately reflect actual pronunciation. It will work fine most of the time, but there are occasional ambiguities and idiosynchrasies in Thai which will cause the converter to render certain, less common syllables incorrectly. It also doesn’t automatically add accent marks to the phonetic version (though that may be added later). So the output of these buttons should be treated as something that gets you 90% of the way. NOTE: Before using these two buttons you need to add spaces or hyphens between each syllable of the Thai text. Syllable boundaries are important for correct interpretation of the text, and they are not detected automatically.

The condense button removes the spaces from the highlighted range (or the whole output area, if nothing is highlighted).

Note: For the toISO-2 transcription I use a macron over long vowels. This is non-standard.

I have uploaded a new version of the Tibetan character picker.

The new version dispenses with the images for the selection table. If you don’t have a suitable font to display the new version of the picker, you can still access the previous version, which uses images.

Other changes include:

  • Significant rearrangement of the default table, with many less common symbols moved into a location that you need to click on to reveal. This declutters the selection table.
  • Addition of latin prompts to help locate letters (standard with v15).
  • Hints (When switched on and you mouse over a character, other similar characters or characters incorporating the shape you moused over, are highlighted. Particularly useful for people who don’t know the script well, and may miss small differences, but also useful sometimes for finding a character if you first see something similar.)
  • A new Wylie button that converts Tibetan text into an extended Wylie Latin transcription. There are still some uncommon characters that don’t work, but it should cover most normal needs. I used diacritics over lowercase letters rather than uppercase letters, except for the fixed form characters. I also didn’t provide conversions for many of the symbols – they will appear without change in the transcription. See the notes on the page for more information.
  • The Codepoints button, which produces a list of characters in the output box, now has a new feature. If you have highlighted some text in the output box, you will only see a list of the highlighted characters. If there are no highlights, the contents of the whole output box are listed.
  • Don’t forget, if you are using the picker on an iPad or mobile device, to set Autofocus to Off before tapping on characters. This stops the device keypad popping up every time you select a character. (This is also standard for v15.)

About pickers: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility. See the list of available pickers.

There is some confusion about which shapes should be produced by fonts for Mongolian characters. Most letters have at least one isolated, initial, medial and final shape, but other shapes are produced by contextual factors, such as vowel harmony.

Unicode has a list of standardised variant shapes, dating from 27 November 2013, but that list is not complete and contains what are currently viewed by some as errors. It also doesn’t specify the expected default shapes for initial, medial and final positions.

The original list of standardised variants was based on 蒙古文编码 by Professor Quejingzhabu in 2000.

A new proposal was published on 20 January 2014, which attempts to resolve the current issues, although I think that it introduces one or two issues of its own.

The other factor in this is what the actual fonts do. Sometimes they follow the Unicode standardised variants list, other times they diverge from it. Occasionally a majority of implementations appear to diverge in the same way, suggesting that the standardised list should be adapted to reality.

To help unravel this, I put together a page called Notes on Mongolian variant forms that visually shows the changes between the various proposals, and compares the results produced by various fonts.

This is still an early draft. The information only covers the basic Mongolian range – Todo, Sibe, etc still to come. Also, I would like to add information about other fonts, if I can obtain them.

Update: 16 Apr 2015, The Todo, Sibe, Manchu, Sanskrit and Tibetan characters are now all done, and font information added for them. (And the document was moved to github.)

If you use my Unicode character pickers, you may have noticed some changes recently. I’ve moved several pickers on to version 14. Most of the noticeable changes are in the location and styling of elements on the UI – the features remain pretty much unchanged.

Pages have acquired a header at the top (which is typically hidden), that provides links to related pages, and integrates the style into that of the rest of the site. What you don’t see is a large effort to tidy the code base and style sheets.

So far, I have changed the following: Arabic block, Armenian, Balinese, Bengali, Khmer, IPA, Lao, Mongolian, Myanmar, and Tibetan.

I will convert more as and when I get time.

However, in parallel, I have already made a start on version 15, which is a significant rewrite. Gone are the graphics, to be replaced by characters and webfonts. This makes a huge improvement to the loading time of the page. I’m also hoping to introduce more automated transcription methods, and simpler shape matching approaches.

Some of the pickers I already upgraded to version 14 have mechanisms for transcription and shape-based identification that took a huge effort to create, and will take a substantial effort to upgrade to version 15. So they may stay as they are for a while. However, easier to handle and new pickers will move to the new format.

Actually, I already made a start with Gurmukhi v15, which yanks that picker out of the stone-age and into the future. There’s also a new picker for the Uighur language that uses v15 technology. I’ll write separate blogs about those.

 

[By the way, if you are viewing the pickers on a mobile device such as an iPad, don’t forget to turn Autofocus off (click on ‘more controls’ to find the switch). This will stop the onscreen keyboard popping up, annoyingly, each time you try to tap on a character.]

tibetan-udhr
See the Tibetan Script Notes

Last March I pulled together some notes about the Tibetan script overall, and detailed notes about Unicode characters used in Tibetan.

I am writing these pages as I explore the Tibetan script as used for the Tibetan language. They may be updated from time to time and should not be considered authoritative. Basically I am mostly simplifying, combining, streamlining and arranging the text from the sources listed at the bottom of the page.

The first half of the script notes page describes how Unicode characters are used to write Tibetan. The second half looks at text layout in Tibetan (eg. line-breaking, justification, emphasis, punctuation, etc.)

The character notes page lists all the characters in the Unicode Tibetan block, and provides specific usage notes for many of them per their use for writing the Tibetan language.

tibetan-char-notes
See the Tibetan Character Notes

Tibetan is an abugida, ie. consonants carry an inherent vowel sound that is overridden using vowel signs. Text runs from left to right.

There are various different Tibetan scripts, of two basic types: དབུ་ཙན་ dbu-can, pronounced /uchen/ (with a head), and དབུ་མེད་ dbu-med, pronounced /ume/ (headless). This page concentrates on the former. Pronunciations are based on the central, Lhasa dialect.

The pronunciation of Tibetan words is typically much simpler than the orthography, which involves patterns of consonants. These reduce ambiguity and can affect pronunciation and tone. In the notes I try to explain how that works, in an approachable way (though it’s still a little complicated, at first).

Traditional Tibetan text was written on pechas (དཔེ་ཆ་ dpe-cha), loose-leaf sheets. Some of the characters used and formatting approaches are different in books and pechas.

For similar notes on other scripts, see my docs list.

Screen shot 2014-09-26 at 16.36.47

The W3C needs to make sure that the typographic needs of scripts and languages around the world are built in to technologies such as HTML, CSS, SVG, etc. so that Web pages and eBooks can look and behave as expected for people around the world.

To that end we have experts in various parts of the world documenting typographic requirements and gaps between what is needed and what is currently supported in browsers and ebook readers.

The flagship document is Requirements for Japanese Text Layout. The information in this document has been widely used, and the process used for creating it was extremely effective. It was developed in Japan, by a task force using mailing lists and holding meetings in japanese, then converted to english for review. It was published in both languages.

We now have groups working on Indic Layout Requirements and Requirements for Hangul Text Layout and Typography, and this month I was in Beijing to discuss ongoing work on Chinese layout requirements (URL coming soon), and we heard from experts in Mongolian, Tibetan, and Uyghur who are keen to also participate in the Chinese task force and produce similar documents for their part of the world.

The Internationalization (i18n) Working Group at the W3C has also been working on other aspects of the mutlilingual user experience. For example, improvements for bidirectional text support (Arabic, Hebrew, Thaana, etc) for HTML and CSS, and supporting the work on counter styles at CSS.

To support local relevance of Web pages and eBook formats we need local experts to participate in gathering information in these task forces, to review the task force outputs, and to lobby or support via coding the implementation of features in browsers and ereaders. If you are one of these people, or know some, please get in touch!

We particularly need more information about how to handle typographic features of the Arabic script.

In the hope that it will help, I have put together some information on current areas of activity at the W3C, with pointers to useful existing requirements, specifications and tests. It is not exhaustive, and I expect it to be added to and improved over time.

Look through the list and check whether your needs are being adequately covered. If not, write to www-international@w3.org (you need to subscribe first) and make the case. If the spec does cover your needs, but the browsers don’t support your needs, raise bugs against the browsers.

It’s disappointing to see that non-standard implementations of UTF-8 are being used by the BBC on their BBC Burmese Facebook page.

Take, for example, the following text.

On the actual BBC site it looks like this (click on the burmese text to see a list of the characters used):

အိန္ဒိယ မိန်းမငယ် ၂ဦး အမှု ဆေးစစ်ချက် ကွဲလွဲနေ

As far as I can tell, this is conformant use of Unicode codepoints.

Look at the same title on the BBC’s Facebook page, however, and you see:

အိႏၵိယ မိန္းမငယ္ ၂ဦး အမႈ ေဆးစစ္ခ်က္ ကြဲလြဲေန

Depending upon where you are reading this (as long as you have some Burmese font and rendering support), one of the two lines of Burmese text above will contain lots of garbage. For me, it’s the second (non-standard).

This non-standard approach uses visual encoding for combining characters that appear before or on both sides of the base, uses Shan or Rumai Palaung codepoints for subjoining consonants, uses the wrong codepoints for medial consonants, and uses the virama instead of the asat at the end of a word.

I assume that this is because of prevalent use of the non-standard approach on mobile devices (and that the BBC is just following that trend), caused by hacks that arose when people were impatient to get on the Web but script support was lagging in applications.

However, continuing this divergence does nobody any long-term good.

[ Find fonts and other resources for the Myanmar script ]