>> See what it can do !

>> Use it !

Picture of the page in action.

The major changes in this version relate to the way searching and property-based lookup is done on characters in the lower left panel, and features for refining and capturing the resulting lists.

Removed the two Highlight selection boxes. These used to highlight characters in the lower left panel with a specific property value. The Show selection box on the left (used to be Show list) now does that job if you set the Local checkbox alongside it. (Local is the default for this feature.)

As part of that move, the former SiR (search in range) checkbox that used to be alongside Custom range has been moved below the Search for input field, and renamed to Local. If Local is checked, searching can now be done on any content in the lower left panel, and the results are shown as highlighting, rather than a new list.

To complement these new highlighting capabilities, a new feature was added. If you click on the icon next to Make list from highlights the content of the lower left panel will be replaced by a list of just those items that are currently highlighted – whether the highlighting results from a search or a property listing. Note that this can also be useful to refine searches: perform an initial search, convert the result to a list, then perform another search on that list, and so on.

Finally got around to putting  icons after the pull-down lists. This means that if you want to reapply, say, a block selection after doing something else, only one click is needed (rather than having to choose another option, then choose the original option). The effect of this on the ease of use of UniView is much greater than I expected.

Added an icon  to the text area. If you click on this, all the characters in the lower left panel are copied into the text area. This is very useful for capturing the result of a search, or even a whole block. Note that if a list in the lower left panel contains unassigned code points, these are not copied to the text area.

As a result of the above changes, the way Show as graphics and Show range as list work internally was essential rewritten, but users shouldn’t see the difference.

Changed the label Character area to Text area.

>> See what it can do !

>> Use it !

Picture of the page in action.

The main change in this version is the reworking of the former Cut & paste and Code point(s) fields to make it easier to use UniView as a generalised picker.

Moved the cut&paste field downwards, made it larger, and changed the label to character area. This should make it easier to deal with text copy/cut & paste, and more obvious that that is possible with UniView. It is much clearer now that UniView provides character map/picker functionality, and not just character lookup.

Whereas previously you had to double-click to put a character in the lower left pane into the Cut&paste field, UniView now echoes characters to the Character area every time you (single) click on a character in the lower left hand pane. This can be turned off. Double-clicking will still add the codepoint of a character in the lower left panel to the Code points field.

The Character area has its own set of icons, some of which are new: ie. you can select the text, add a space, and change the font of the text in the area (as well as turn the echo on and off). I also spruced up the icons on the UI in general.

Note that on most browsers you can insert characters at the point in the Character area where you set the cursor, or you can overwrite a highlight range of characters, whereas (because of the non-standard way it handles selections and ranges) Internet Explorer will always add characters to the end of the line.

The Code points field has also been enlarged, and I moved the Show list pull-down to the left and Show as graphics and Show page as list to the right. This puts all the main commands for creating lists together on the left.

When you mouse over character in the lower left pane you now see both hex and decimal codepoint information. (Previously you just saw an unlabelled decimal number.) You will also find decimal code point values for characters displayed in the lower right panel.

Fixed a bug in the Code points input feature so that trailing spaces no longer produce errors, but also went much further than that. You can now add random text containing codepoints or most types of hex-based escaped characters to the input field, and UniView will seek them out to create the list. For example, if you paste the following into the Code points field:

the decomposition mapping is <U+CE20, U+11B8>, and not <U+110E, U+1173, U+11B8>.

the result will be:

CE20: 츠 [Hangul Syllables]
11B8: ᆸ HANGUL JONGSEONG PIEUP
110E: ᄎ HANGUL CHOSEONG CHIEUCH
1173: ᅳ HANGUL JUNGSEONG EU
11B8: ᆸ HANGUL JONGSEONG PIEUP

Of course, UniView is not able to tell that an ordinary word like ‘Abba’ is not a hex codepoint, so you obviously need to watch out for that and a few other situations, but much of the time this should make it much easier to extract codepoint information.

I still haven’t found a way to fix the display bug in Safari and Google Chrome that causes initial content in the lower left pane to be only partially displayed.

Here are some lists of characters that are useful for normalization. I’ll probably add some others later.

The lists apply to Unicode version 5.1.

The files below contain declarations for JavaScript sparse arrays. They are easy enough to convert to other formats using global search and replace. The verbose version provides character names and code points.

Combining characters with non-zero properties

Characters with non-zero combining properties are assigned to a sparse array indexed by codepoint. The value gives the combining property value.

https://r12a.github.io/code/normalization/nonzerocombiningchars.txt

https://r12a.github.io/code/normalization/nonzerocombiningchars-verbose.txt

There are 498 of these.

Canonically decomposable characters for NFD

This list maps single characters to their decompositions. The single character is referenced by an index into the array, and the value for that index is the decomposed characters.

https://r12a.github.io/code/normalization/canonicaldecomposables.txt

https://r12a.github.io/code/normalization/canonicaldecomposables-verbose.txt

There are 2042 of these characters.

The following code converts a hex codepoint to a sequence of bytes that represent the Unicode codepoint in UTF-8.

This is useful because PHP’s chr() function only works on ASCII :((.

function cp2utf8 ($hexcp) {
	$outputString = '';
	$n = hexdec($hexcp);
	if ($n < = 0x7F) {
		$outputString .= chr($n);
		}
	else if ($n <= 0x7FF) {
		$outputString .= chr(0xC0 | (($n>>6) & 0x1F))
		.chr(0x80 | ($n & 0x3F));
		}
	else if ($n < = 0xFFFF) {
		$outputString .= chr(0xE0 | (($n>>12) & 0x0F))
		.chr(0x80 | (($n>>6) & 0x3F))
		.chr(0x80 | ($n & 0x3F));
		}
	else if ($n < = 0x10FFFF) {
		$outputString .= chr(0xF0 | (($n>>18) & 0x07))
		.chr(0x80 | (($n>>12) & 0x3F)).chr(0x80 | (($n>>6) & 0x3F))
		.chr(0x80 | ($n & 0x3F));
		}
	else {
		$outputString .= 'Error: ' + $n +' not recognised!';
		}
	return $outputString;
	}

>> Use it !

Picture of the page in action.

I have just upgraded the Malayalam picker to level 7, and added a bunch of new features that should show up in other pickers at level 7 as I get time:

Shape view The pickers are aimed particularly at people who are not familiar enough with a script to use the keyboard. However, there are many ligatures and conjuncts in Malayalam, which makes it difficult to identify the character sequences needed. This view provides most of the shapes you’ll see in Malayalam text, grouped by shape. It’s something I’ve been wanting to add to the pickers for some time.

Picture of the page in action.

Phonic view This has been done in other pickers, but it has some new features over those. The sounds have been arranged along similar lines to a standard IPA chart, and multiple transcriptions are supported. In addition, you can click on the transcription text to build up a phonemic string in IPA. This is particularly useful for creating examples.

Picture of the page in action.

Regular expressions in searches The search feature was upgraded to allow for regular expressions. So now you can highlight characters containing GA without highlighting ones containing NGA: just search for \bga\b (or use the convenient short-cut form .ga.). Of course you can do more complicated searches too.

Add codepoint You can add a hex codepoint value to the box in the yellow area to insert into the text. This is useful for things like the odd unusual character, or for just figuring out what a sequence of codepoints represents. You can input any number of codepoints (including surrogates) into the input box, separating them by spaces.

Chillus This version of the picker supports all Unicode 5.1 characters, including the chillu characters. Because most Malayalam fonts support the old way of inputting chillu forms, you can specify in the yellow box area what you want the output to be when clicking on a chillu letter: the pre-5.1 sequence or the new atomic character. (The default is the atomic character.)

The picker also comes with the usual set of level 7 features, such as font grid view, graphic characters, hiding of uncommon characters, optimised ordering of characters in the alphabetic view, two-tone highlighting, etc.

You can start up directly in either of the available views by appending the following to your URI: ?view=, followed by one of, respectively, alphabet, shape, phonic or fontgrid.

Enjoy.

>> See what it can do !

>> Use it !

Picture of the page in action.

A large amount of code was rewritten to enable data to be downloaded from the server via AJAX at the point of need. This eliminates the long wait when you start to use UniView without the database information in your cache. This means that there is a slightly longer delay when you view a new block, but the code is designed so that if you have already downloaded data, you don’t have to retrieve it again from the server.

The search mechanism was also rewritten. The regular expressions used must now be supported in both JavaScript and PHP (PHP is used if not searching within the current range). When ‘other’ is ticked, the search will look in the alternative name fields, but not in other property settings (so you can no longer use something like ;AL; to search for characters with a particular property. (Use ‘Show list’ instead.))

Removed several zero-width space characters from the code, which means that UniView now works with Google Chrome, except for some annoying display bugs that I’m not sure how to fix – for example, the first time you try to display any block you only seem to get the top line (although, if you click or drag the mouse, the block is actually there). This seems to be WebKit related, since it happens in Safari, too.

Please report any bugs to me, and don’t forget to refresh any UniView files in your cache before using the new version.

>> Read it !

Picture of the page in action.

I finally got to the point, after many long early morning hours, where I felt I could remove the ‘Draft’ from the heading of my Myanmar (Burmese) script notes.

This page is the result of my explorations into how the Myanmar script is used for the Burmese language in the context of the Unicode Myanmar block. It takes into account the significant changes introduced in Unicode version 5.1 in April of this year.

Btw, if you have JavaScript running you can get a list of characters in the examples by mousing over them. If you don’t have JS, you can link to the same information.

There’s also a PDF version, if you don’t want to install the (free) fonts pointed to for the examples.

Here is a summary of the script:

Myanmar is a tonal language and is syllable-based. The script is an abugida, ie. consonants carry an inherent vowel sound that is overridden using vowel signs.

Spaces are used to separate phrases, rather than words. Words can be separated with ZWSP to allow for easy wrapping of text.

Words are composed of syllables. These start with an consonant or initial vowel. An initial consonant may be followed by a medial consonant, which adds the sound j or w. After the vowel, a syllable may end with a nasalisation of the vowel or an unreleased glottal stop, though these final sounds can be represented by various different consonant symbols.

At the end of a syllable a final consonant usually has an ‘asat’ sign above it, to show that there is no inherent vowel.

In multisyllabic words derived from an Indian language such as Pali, where two consonants occur internally with no intervening vowel, the consonants tend to be stacked vertically, and the asat sign is not used.

Text runs from left to right.

There are a set of Myanmar numerals, which are used just like Latin digits.

So, what next. I’m quite keen to get to Mongolian. That looks really complicated. But I’ve been telling myself for a while that I ought to look at Malayalam or Tamil, so I think I’ll try Malayalam.

>> Use it !

Picture of the page in action.

I have just upgraded the Burmese picker as follows:

Rearranged characters The Myanmar3 font expects multiple combining characters to be entered in the order described in the Unicode 5.1 Standard for correct display. The panel of combining characters has been arranged so that you can easily remember what that order was. Characters to the left precede those to the right, characters higher up precede those lower down.

In addition to that, I have rearranged all the character positions so that it is easier to locate the various parts of a syllable as you type.

I also added some combinations of characters that make up multi-part vowels and the kinzi with a single click.

I have also moved some of the less common characters to an ‘advanced’ area to the right which can be opened and closed by clicking on the arrow-head icon.

New highlighting As you mouse over a character the picker will show you other characters that are visually similar (particularly useful for those not very familiar with the script). This new version shows the more likely confusable characters with a blue outline, and other similar characters with orange. This is useful given that many Myanmar characters look quite similar.

As always, you can turn off this feature or disable it in the URI you use to call the picker.

Font grid view Shows characters in Unicode order, using whatever font is specified in the Font list or Custom font input fields. This allows comparison of fonts (especially useful in IE, which shows if a glyph is missing from a font).

You can start up directly in either of the available views by appending the following to your URI: ?view=, followed by one of, respectively, alphabet or fontgrid.

Enjoy.

>> See what it can do !

>> Use it !

Picture of the page in action.

Those of you who have used UniView over the last couple of days will have seen that it now supports Unicode 5.1. All Unicode 5.1 character information is available, however you will only be able to see the new characters if you have fonts that cover them. The decodeunicode graphics for the new characters are not yet available.

Last night I also fixed a long-running bug that had meant that additional information available in my character database was not accessible in Internet Explorer (due to AJAX issues). (See the related post if you are interested in the code).

There are no other changes at this time (though those two are pretty significant).

Please report any bugs to me, and don’t forget to refresh any UniView files in your cache before using the new version.

Some code I put together to import some XML retrieved via AJAX into a document (stored here so I can find it again in the future).

IE won’t let you import a cloned nodeset into a document, so I wrote this for my UniView utility. The code starts with a node in the AJAX data and creates a copy of all elements and attributes in the current document.

function copyNodes (ajaxnode, copiednode) {
	for (var node=ajaxnode.firstChild; node != null; node = node.nextSibling) {
		if (node.nodeType == 3){ //text
			copiednode.appendChild(document.createTextNode(node.data));
			}
		if (node.nodeType == 1){ //element
			var subnode = document.createElement(node.nodeName);
			var attlist = node.attributes;
			if (attlist != null) {  
				for (var i=0; i<attlist.length; i++){
					subnode.setAttribute(attlist[i].name, attlist[i].value);
					}
				}
			copiednode.appendChild(subnode);
			copyNodes(node, subnode);
			}
		}
	}

It doesn’t expect processing instructions, comments etc. Just elements and attributes. (Though of course that can be added, if needed.)

>> Use it !

Picture of the page in action.

This latest picker includes all characters in the Unicode Lao block, plus a few punctuation characters. There are several alternative views.

Alphabetic By default, characters are arranged by groups, and consonants and vowels are listed in alphabetic order. Digits are in keypad order. Similar characters are highlighted by default, but this can be switched off using the ‘Hint’ selector.

Tone marks and combining vowels are reordered automatically so that vowels come first in the output character sequence.

Phonic Characters are grouped and ordered by sound. I set this up for myself to enter Lao text that I wanted to copy that was accompanied by a transcription. Initial consonants are followed by tones and consonants that come second in a cluster, then vowels. Alternatives with the same sound are separated by a red dot. Consonants that have different sounds when word final are also listed under those sounds. (Dropped aspiration is not considered significant.)

Dashes representing consonants indicate which vowels are non-final or occur before the consonant. Where a vowel has a part that comes before a consonant, a single click should arrange the parts properly. This behaviour speeds up typing. It may not be so intuitive to people familiar with Lao, however, since it makes Lao behave like Khmer and Indic scripts.

You should add any tone mark before the vowel and the picker will automatically reorder characters as needed. If you want to wrap text around a combination of two syllable-initial characters, type the characters then click on ‘flag as cluster’ before clicking on the tone mark or vowel.

Two old vowel spellings are only displayed if you click on the grey arrow, top right.

Font grid Shows characters in Unicode order, using whatever font is specified in the Font list or Custom font input fields. This allows comparison of fonts (especially useful in IE, which shows if a glyph is missing from a font).

You can start up directly in one of the above views by appending the following to your URI: ?view=, followed by one of, respectively, alphabet, phonic or fontgrid.

Enjoy.

>> See what it can do !

>> Use it !

Picture of the page in action.

While we await Unicode 5.1, here is another update to UniView that provides a bunch of additional useful features and fixes a few bugs.

Changes include:

  • Changed the custom range input to a single field that will accept various range formats. This makes it easier to cut and paste or drag and drop ranges into the input field. The Custom range field will accept various formats.
  • The numbers must be in hexadecimal form and separated by a colon (the default), a hyphen, one or more spaces, or one or more periods. There must be only two numbers. The numbers can be in the following formats: 1234, &#x1234;, &#1234;, \u1234, U+1234. The actual number of hex digits can be between 1 and 6.
  • Added the ability to select whether Search looks at any combination of character names only, other parts of a record in the Unicode database, or the other character description information, and added a message to say how many characters were matched.
  • Added the ability to search within the range specified in the field entitled Range.
  • Added the ability to list characters with a given General or Bidirectional property (within a specified range or not).
  • Added an AJAX link to my database of information about Unicode characters. If enabled, using the DB checkbox, this automatically retrieves any available data for a character when information about that character is displayed in the lower right panel. You can also specify that UniView should open with that set as the default using database=on in the URI used to call UniView.
  • Because of the previous improvement, I removed the ability to link in a file of information about characters. (The information in the files was a copy of the information in the database.)
  • Moved the Code point(s) and Cut & paste fields lower, to make them easier to use.
  • Fixed a bug that was preventing the Search function finding characters in the Basic Latin block.
  • Bugfix: a range like 0036:0067 will always show full rows now; a range with start higher than end will show alert.
  • Added reference to decodeunicode when graphics are displayed in left column
  • Bugfix: search parameter won’t break when graphics etc toggled
  • You can now specify windowHeight parameter at startup in the URI’s query string.

Please report any bugs to me, and don’t forget to refresh any UniView files in your cache before using the new version.

>> Use it !

Picture of the page in action.

The default arrangement for this picker is still shape-based (though with some small improvements), but I have added a new view that is arranged by sound.

Update: After some initial feedback, I decided to change the phonic view of the picker so that vowels are entered by single click. This will probably disconcert people familiar with typing Thai. Revised description follows.

Another update (2008-03-03): I have added additional ways of viewing the characters, and re-architected the picker as a basis for extending this to other pickers in the future. I also changed the way of dealing with initial clusters in the phonic view. I changed the text below again to reflect what’s new:

Alphabetic view By default, characters are arranged by groups, and consonants and vowels are listed in alphabetic order. Digits are in keypad order. Obsolete and rare characters are only displayed if you click on the grey arrow, top right. Similar characters are highlighted by default, but this can be switched off using the ‘Hint’ selector.

Comparison view This was the original view for the Thai picker. Characters are grouped by shape or type to enable easy identification by people who are unfamiliar with the Thai script. Vowels are shown near the bottom. Digits are on the right, in keypad order.

Phonic view Characters are grouped and ordered by sound. I set this up for myself, because I wanted to enter Thai text that was accompanied by a transcription.

Initial consonants are followed by tones and consonants that come second in a cluster, then vowels. Alternatives with the same sound are separated by a red dot. Consonants that have different sounds when word final are also listed under those sounds. (Dropped aspiration is not considered significant.)

Dashes representing consonants indicate which vowels are non-final or occur before the consonant.

Where a vowel has a part that comes before a consonant, a single click should arrange the parts properly. This behaviour speeds up typing. It may not be so intuitive to people familiar with Thai, however, since it makes Thai behave like Khmer and Indic scripts. You should add any tone mark before the vowel and the picker will automatically reorder characters as needed.

If you want to wrap text around a combination of two syllable-initial characters, type the characters then click on ‘flag as cluster’ before clicking on the tone mark or vowel.

Font grid view Shows characters in Unicode order, using whatever font is specified in the Font list or Custom font input fields. This allows comparison of fonts (especially useful in IE, which shows if a glyph is missing from a font).

You can start up directly in any one of the above views by appending the following to your URI: ?view=, followed by one of, respectively, alphabet, comparison, phonic or fontgrid.

Enjoy.

>> Use it !

Picture of the page in action.

This latest picker includes characters used for writing Vietnamese. Characters are taken from various Latin Unicode blocks.

Tones are separated from base characters in the selection area, but the output you create is always fully precomposed. If you copy and paste text into the output area, you can normalize the Vietnamese text as NFC by selecting the tab below. The Vietnamese text in the output area is also normalized when you select one of the transcription tabs.

The tabs IPA N and IPA S tabs provide a basic, mostly phonemic-level, transcription of the pronunciation. N means North Vietnamese, S is for South. The sources I used for this varied a great deal, particularly in the choice of symbols to represent vowels. There are also more than two main dialects. So this is a synthesis and a rough guide. Some rare vowel combinations may be missing, although I have covered quite a number.

There are a large number of UVN fonts – so many that I didn’t know which ones to pick for the font pulldown. I chose the two that show up on Alan Wood’s page. If you think certain others are so common that they ought to be there, please let me know.

Enjoy.

This post is about the dangers of tying a specification, protocol or application to a specific version of Unicode.

For example, I was in a discussion last week about XML, and the problems caused by the fact that XML 1.0 is currently tied to a specific version of Unicode, and a very old version at that (2.0). This affects what characters you can use for things such as element and attribute names, enumerated lists for attribute values, and ids. Note that I’m not talking about the content, just those names.

I spoke about this at a W3C Technical Plenary some time back in terms of how this bars people from using certain aspects of XML applications in their own language if they use scripts that have been added to Unicode since version 2.0. This includes over 150 million people speaking languages written with Ethiopic, Canadian Syllabics, Khmer, Sinhala, Mongolian, Yi, Philippine, New Tai Lue, Buginese, Cherokee, Syloti Nagri, N’Ko, Tifinagh and other scripts.

This means, for example, that if your language is written with one of these scripts, and you write some XHTML that you want to be valid (so you can use it with AJAX or XSLT, etc.), you can’t use the same language for an id attribute value as for the content of your page. (Try validating this page now. The previous link used some Ethiopic for the name and id attribute values.)

But there’s another issue that hasn’t received so much press – and yet I think, in it’s own way, it can be just as problematic. Scripts that were supported by Unicode 2.0 have not stood still, and additional characters are being added to such scripts with every new Unicode release. In some cases these characters will see very general use. Take for example, the Bengali character U+09CE BENGALI LETTER KHANDA TA.

With the release of Unicode 4.1 this character was added to the standard, with a clear admonition that it should in future be used in text, rather than the workaround people had been using previously.

This is not a rarely used character. It is a common part of the alphabet. Put Bengali in a link and you’re generally ok. Include a khanda ta letter in it, though, and you’re in trouble. It’s as if English speakers could use any word in an id, as long as it didn’t have a ‘q’ in it. It’s a recipe for confusion and frustration.

Similar, but much more far reaching, changes will be introduced to the Myanmar script (used for Burmese) in the upcoming version 5.1. Unlike the khanda ta, these changes will affect almost every word. So if your application or protocol froze its Unicode support to a version between 3.0 and 5.0, like IDNA, you will suddenly be disenfranchising Burmese users who had been perfectly happy until now.

Here are a few more examples (provided by Ken Whistler) of characters added to Unicode after the initial script adoption that will raise eyebrows for people who speak the relevant language:

  • 01F6 LATIN SMALL LETTER N WITH GRAVE: shows up in NFC pinyin data for Chinese.
  • 0219 LATIN SMALL LETTER S WITH COMMA BELOW: Romanian data.
  • 0450 CYRILLIC SMALL LETTER IE WITH GRAVE: Macedonian in NFC.
  • 0653..0655 Arabic combining maddah and hamza: Implicated in NFC normalization of common Arabic letters now.
  • 0972 DEVANAGARI LETTER CANDRA A: Marathi.
  • 097B DEVANAGARI LETTER GGA: Sindhi.
  • 0B35 ORIYA LETTER VA: Oriya.
  • 0BB6 TAMIL LETTER SHA: Needed to spell sri.
  • 0D7A..0D7F Malayalam chillu letters: Those will be ubiquitous in Malayalam data, post Unicode 5.1.
  • and a bunch of Chinese additions.

So the moral is this: decouple your application, protocol or specification from a specific version of the Unicode Standard. Allow new characters to be used by people as they come along, and users all around the world will thank you.

This came up again recently in a discussion on the W3C i18n Interest Group list, and I decided to put my thoughts in this post so that I can point people to them easily.

I think HTML4 and HTML5 should continue to support <b> and <i> tags, for backwards compatability, but we should urge caution regarding their use and strongly encourage people to use <em> and <strong> or elements with class="…" where appropriate. (I reworded this 2008-02-01)

Here are a couple of reasons I say that:

  1. I constantly see people misusing these tags in ways that can make localization of content difficult.

    For example, just because and English document may use italicisation for emphasis, document titles and foreign words, it doesn’t hold that a Japanese translation of the document will use a single presentational convention for all three. Japanese authors may avoid both italicization and bolding, since their characters are too complicated to look good in small sizes with these effects. Japanese translators may find that the content communicates better if they use wakiten (boten marks) for emphasis, but corner brackets for 『 document names ă€, and guillemets for 《 foreign words ă€‹. These are common Japanese typographic approaches that we don’t use in English.

    The problem is that, if the English author has used <i> tags everywhere (thinking about the presentational rendering he/she wants in English), the Japanese localizer will be unable to easily apply different styling to the different types of text.

    The problem could be avoided if semantic markup is used. If the English author had used <em>..</em> and <span class="doctitle">...</span> and <span class="foreignword">..</span> to distinguish the three cases, it would allow the localizer to easily change the CSS to achieve different effects for these items, one at a time.

    Of course, over time this is equally relevant to pages that are monolingual. Suppose your new corporate publishing guidelines change, and proclaim that bolding is better than italics for document names. With semantically marked up HTML, you can easily change a whole site with one tiny edit to the CSS. In the situation described above, however, you’d have to hunt through every page for relevant <i> tags and change them individually, so that you didn’t apply the same style change to emphasis and foreign words too.

  2. Allowing authors to use <b> and <i> tags is also problematic, in my mind, because it keeps authors thinking in presentational terms, rather than helping them move to properly semantic markup. At the very least, it blurs the ideas. To an author in a hurry, it is also tempting to just slap one of these tags on the text to make it look different, rather than to stop and think about things like consistency and future-proofing. (Yes, I’ve often done it too…)

I always forget how to get around the namespace issue when transforming XHTML files to XHTML using XSL, and it always takes ages for me to figure it out again. So I’m going to make a note here to remind me. This seems to work:

<?xml version="1.0" encoding="UTF-8"?>

<xsl:transform version="2.0"
xmlns="http://www.w3.org/1999/xhtml"
xmlns:html="http://www.w3.org/1999/xhtml" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/02/xpath-functions" xmlns:xdt="http://www.w3.org/2005/02/xpath-datatypes"
xmlns:saxon="http://icl.com/saxon"
<strong>exclude-result-prefixes="saxon fn xs xdt html"</strong>>
;

<xsl:output method="xhtml" encoding="UTF-8"
doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN" indent="no" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" />

Then you need to refer to elements in the source to be converted by using the html: namespace prefix, eg. <xsl :template match=”html:div”>….</xsl>.

I always have to look up the template that copies everything not fiddled with in the other templates, too, so here it is, for good measure:

<xsl:template match="@*|node()">
	<xsl:copy>
		<xsl:apply-templates select="@*|node()"/>
		</xsl:copy>
	</xsl:template>

>> Use it !

Picture of the page in action.

Although I have a picker already for Arabic, Persian and Urdu, I have developed another that is specifically for inputting Urdu. One reason for this is to reduce the choice of characters so that the user is more likely to select the right character for Urdu (eg. heh goal rather than arabic heh). Another is to provide shortcuts for things like aspirated letters and some common combinations (like the word ‘allah’).

It includes characters used for Urdu in Unicode 5.0. Most of the characters in the Urdu standard UZT 1.01 are included.

The aspirated letters of the alphabet can be entered with a single click. Also, base characters with diacritics can be inserted into the text with a single click where NFC normalisation would produce a single precomposed character.

Letters of the alphabet are shown in alphabetic order at the top left, digits are in keypad order, and combining characters related to vowel sounds are shown along the bottom. The lower middle section contains useful but non-alphabetic characters and punctuation. To the right are various symbols. Hinting is implemented for visually similar glyphs.

>> Use it !

Picture of the page in action.

Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you don’t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more useable than a regular character map utility.

The Bengali picker includes all the characters in the Unicode 5.0 Bengali block. Note: There was an important addition to the Bengali block in version 4.1, a single character for khanda ta, that may not yet be supported in fonts, but has been added to this version of the picker.

Consonants are mostly in a typical articulatory arrangement, vowels are aligned with vowel signs, and digits are in keypad order. Hinting is implemented for visually similar glyphs.

A function has also been added to transliterate Bengali text to Latin, though the scheme used is not standard, and may change at short notice. Don’t use it in anger yet.