Picture of the page in action.

About the tool: Look up and see characters (using graphics or fonts) and property information, view whole character blocks or custom ranges, select characters to paste into your document, paste in and discover unknown characters, search for characters, do hex/dec/ncr conversions, highlight character types, etc. etc. Supports Unicode 5.2 and written with Web Standards to work on a variety of browsers. No need to install anything.

Latest changes: The major change in this update is the addition of a function, Show age, to show the version of Unicode where a character was added (after version 1.1). The same information is also listed in the details given for a character in the lower right panel.

The trigger for context-sensitive help was reduced to the first character of a command name, rather than the whole command name. This improves behaviour for commands under More actions by allowing you to click on the command name rather than just the icon alongside to activate the command.

Some ā€˜quick startā€™ instructions were also added to the initial display to orient people new to the tool, and this help text was updated in various areas.

The highlighting mechanism was changed. Rather than highlight characters using a coloured border (which is typically not very visible), highlighting now works by greying out characters that are not highlighted. This also makes it clearer when nothing is highlighted.

In the recent past, when you converted a matrix to a list in the lower left panel, greyed-out rows would be added for non-characters. These are no longer displayed. Consequently, the command to remove such rows from the list (previously under More actions) has been removed.

A lot of invisible work went into replacing style attributes in the code with class names. This produces better source code, but doesnā€™t affect the user experience.

>> Use it


Characters in the Unicode Bengali block.

If youā€™re interested, I just did a major overhaul of my script notes on Bengali in Unicode. Thereā€™s a new section about which characters to use when there are multiple options (eg. RRA vs. DDA+nukta), and the page provides information about more characters from the Bengali block in Unicode (including those used in Bengaliā€™s amazingly complicated currency notation prior to 1957).

In addition, this has all been squeezed into the latest look and feel for script notes pages.

The new page is at a new location. There is a redirect on the old page.

Hope itā€™s useful.

>> Read it


Picture of the page in action.
 
Picture of the page in action.

About the tools: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you donā€™t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility

Latest changes: The Urdu and Tamil pickers have been upgraded to version 10. This provides new views of the data, but also involved a thorough overhaul and redesign of the pickers. Transliteration functions have also been added for the Tamil picker.

In addition, the Urdu notes page was updated and a new Tamil notes page was created. Database entries were also updated or, in the case of Tamil, created to support the notes pages. These notes pages are the first to use a new look and feel, based on the analyse-string tool I produced earlier this year. This adds information about each character from the Unicode descriptions data to that from my own database.

  1. paulmcj Says:

    Hi Guys,

    I like your pickers and what is appealing even more is that youā€™ve added transliteration, which is a very nice feature for any picker.

    Paul

Picture of the page in action.

About the tool: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you donā€™t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more useable than a regular character map utility

Latest changes: Over the Christmas break Iā€™ve applied version 10 upgrades to the following pickers: Bengali, Hebrew, Khmer, Lao, Malayalam, Myanmar, Thai and Tifinagh. In the case of Hebrew and Tifinagh, this came down to completely rewriting the pickers.

Key changes in version 10 include the following:

  • The visible layout of the shape view has been reduced in the vertical direction by showing a group of characters only when you mouse over the orange keys at the top. This makes it easier and faster to locate characters, and also improves use on screens with restricted space. The way similar characters in other groups is handled has been reinvented to fit the new approach better, and enable faster creation of pickers in the future.
  • The visible layout of the transcription view has been adapted in a similar way to the shape view.
  • The button to dump the phonetic buffer has been moved to just below the output area.
  • The Detail button is now called the Analyse button, and both this and the Codepoints commands now bring up the new String Analyser utility, which provides much better results than the old pages.
  • A keyboard view has been added to the Tifinagh picker. This new view may pop up in other pickers in the future.

There were a number of other changes to the code, and not least to the instructions for use on the main picker page and each set of notes below the pickers themselves.

>> Use it


Picture of the page in action.

About the tool: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you donā€™t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more useable than a regular character map utility

Latest changes: This is the first version 9 picker. Changes introduced in version 9 include moving the buttons that allow you to display different views to just below the page title. Also, in version 8 pickers, there was an icon in the phonic view that allowed you to dump to the output the phonetic transcription that builds up while selecting characters. This has been replaced with a button just below the output field. There were a number of other superficial changes.

A significant addition to the Malayalam picker is the ability to convert Malayalam text into a Latin transliteration, based on ISO 15919. There was already a way to convert Latin transliterations to Malayalam script.

This version also continues to allow you to type in chillu characters as either single characters as included in Unicode v5.1, or as a sequence of consonant+virama+zwj. Additions to the Malayalam repertoire added in v5.2 have not yet been added to the picker.

>> Use it


I just received email from Derek Reid in the XMetal team at JustSystems to say that they have significantly improved the way the XMetal XML editor uses xml:lang attributes in the source code in conjunction with its spell-checker.

Basically, XMetal will switch spell-checking dictionaries based on the xml:lang settings in the markup. It also supports xml:lang=ā€ā€ and xml:lang=ā€zxxā€ for places you donā€™t want to spell-check. It even does this when using interactive red squiggles to highlight potential misspellings.

I wrote a blog post about this in 2007, when the capability was only partially developed. Derek says:

I read this post when you first wrote it, and after getting feedback from a large number of our clients I was finally able to convince our development and management teams to properly support language auto-switching for spell checking in conjunction with xml:lang attribute values in our product.

We made a big effort to deal with these limitations during the past year and our XMetaL Author Enterprise 6.0 release addresses most or all of them.

If you are interested, I have posted instructions on how to configure XMetaL Author Enterprise 6.0 to properly support this feature:
http://forums.xmetal.com/index.php/topic,539.msg1701

I havenā€™t had a chance to try it out yet, but it sounds exciting. Now how about DreamWeaverā€¦

>> Use it

Picture of the page in action.

About the tool: BCP 47 language tags are built from subtags in the IANA Subtag Registry. This tool helps you find or look up subtags and check for errors in language tags. It also provides information to guide your choices.

Latest changes: I reworked the informational text that accompanies macrolanguages, their encompassed languages, and extlang subtags. As part of that, I changed the code to allow for highlighting of specific cases. For example, where legacy may dictate that the macrolanguage subtag (zh) is more useful for Mandarin Chinese than the more specific tags (cmn or zh-cmn).

I simplified the intro to the page, but added a link to the new article Choosing a Language Tag, which provides useful step-by-step guidelines on creating language tags.

I also changed the user interface somewhat. The input fields are easier to work with and take up less vertical space. Also, you can now submit a query by simply hitting return after typing into a field. I had originally required you to click on a submit button so that all values in other fields would be retained when the answer is shown ā€“ this was so that while checking various subtags you could build up a language tag in the Check field for later checking. I just found that the annoyance of continually having to resubmit after forgetting to click on the submit button wasnā€™t worth the extra functionality (and I was also encouraged to do so by feedback from Bert Bos).

>> See what it can do

>> Use it

Picture of a part of the page.

It took me a while to find the time, but I have finally upgraded UniView to suport the final 5.2 release of Unicode, plus a few extra features.

The order of blocks listed in the top left pulldown menu was changed to ressemble the order in the Unicode Charts page. Several sub-block selections were also added to the list (as in the Unicode page), and are displayed in italics.

When you display details of a character in the right panel, the heading Script group has now been used to indicate the sub-block-level headings in the block listings of the Unicode Standard. The link to the Unicode block now follows the heading Unicode block. These sub-block-level headings are also shown when you display a range as a list (as opposed to a matrix).

When you mouse over characters displayed in a matrix, the codepoint and name information for that character now appear just above the matrix. This makes it much easier to locate characters you are looking for.

Finally, but by no means least, small and large graphics are now available for all 1071 Egyptian Hieroglyph characters. This was the last block for which graphics were completely unavailable.

>> Read the notes

Today I put the finishing touches to and uploaded my first draft notes about the long lost ishidic script. See what you think of it.

Hereā€™s a small section of the sample text shown at the bottom of the post. Click on it to see the whole transcript.

Part of a sample of text written in ishidic script.

  1. Liam R E Quin Says:

    At first glance it looks like a mix of Arabic and Cree, with the artificiality and non-calligraphic nature of Cree combined with the long extenders and curves of Arabic. If you tried to write with it by hand for a long time, maybe it would become more condensed, like Pitman short-hand. Or if you wrote it with a calligraphic pen ā€” maybe I should try that!

  2. David Clarke Says:

    Do you have a translation, or is it to be treated as a linqua-cryptographic puzzle?

  3. r12a Says:

    Where would the fun be in providing a translation? šŸ˜‰

    Hereā€™s a head start:
    ā€œThe decipherment of a newly discovered or perennially mysterious text is the most glamorous aspect of the study of writing systems. ā€¦ā€

Removed the ā€˜betaā€™ from the version number and replaced with .0.1. New version converts u+ā€¦ (ie. lowercase u) as well as U+ā€¦ now.

See https://r12a.github.io/tools/conversion/

Thanks to Martin DĆ¼rst for the suggestion.

>> Use it

Picture of the page in action.

I have added a bunch of additional new features to my lookup tool to help with choosing language tags. There is additional information available when you look up subtags (such as what to use if the subtag is deprecated, and what subtags macrolanguages enclose, etc.), and more tests on well-formedness with clearer explanations of the problem. Example.

This should make it a lot more useful to people who havenā€™t read BCP 47 and want to create language tags. Hopefully, in a short while, Iā€™ll also write and link to an article that describes how to use subtags from the ground up in a procedural way, that will complement the tool.

For further assistance, you can now link from a language subtag result to the SIL Ethnologue, to make it easier to check whether that subtag really does refer to the language you were thinking of.

In addition, script subtag results link to Unicode blocks in UniView.

>> Use it

Picture of the page in action.

The IANA Subtag Registry has been recently updated to contain 220 extlang subtags and the ISO 639-3 language subtags, taking the total number of subtags to almost 8,000.

I have produced a new version of my lookup tool to help with language tagging. In addition to helping you find subtags and lookup the meaning of subtags, it now helps check the well-formedness of a language tag.

The tool provides access to all currently defined subtags, including the new extlang subtags.

Parsing language tags. In addition to trying to make the user interface more friendly, I also added the ability to parse hyphenated tags and discover their structure and check for errors. Iā€™m not claiming with this release that the new parser field tests all the corner cases, but it should provide reports for most of the typical errors.

It reports errors for the following:

ā€“ subtags that are not in the registry (by type)
ā€“ incorrectly ordered subtags
ā€“ duplicate variant tags and multiple tags of other types
ā€“ overlong private use subtags

Try this example.

It doesnā€™t yet handle extensions, but then there arenā€™t any valid ones to handle yet anyway.

I hope thatā€™s useful.

>> See what it can do

>> Use it

Picture of the page in action.

Following hot on the heels of the last release come some further significant changes to UniView aimed at making it easier to use as Unicode grows.

The big change is that UniView now starts up in graphics mode by default. This means that pages load more slowly, but (especially with the continuing growth of Unicode) also means that you are more likely to be able to see the characters you are looking for. Itā€™s easy to switch between modes at any point, using the ā€œUse graphicsā€ checkbox. (And if you preferred font glyphs as a default, you just need to change the URI in your bookmarked link slightly, and you can continue to work that way.)

To facilitate this change, I created my own graphics for a number of blocks which are not yet covered by decodeunicode, or which are no longer fully covered by decodeunicode. The blocks for which I provided graphics are Latin Extended-C, Latin Extended-D, Latin Extended Additional, Cyrillic Supplement, Cyrillic Extended-B, Modifier Tone Letters, Tibetan, Malayalam, Saurashtra, Ol Chiki, Myanmar, Kayah Li, Cham, Rejang, Vai, Supplemental Punctuation, and Miscellaneous Symbols and Arrows.

There are still many characters for which there are no graphics (especially the new characters in Unicode 5.2), but coverage is much better than it was. As I find more fonts, I will be able to create graphics for the remaining characters.

I also put a grey box around the characters in tables. This is particularly useful if there are no graphics or font glyphs for a block or range of characters, as it makes it easier to locate the character you are looking for.

I also fixed a bug that was preventing Chrome and Safari and IE from displaying the first two Latin blocks. I think the bug was actually in the Unicode data file.

>> See what it can do

>> Use it

Picture of the page in action.

With the family now in Japan, I had some extra time to spare this weekend, so I upgraded UniView to handle all the proposed characters for Unicode 5.2.

While the properties for new and modified characters are still in beta they are not officially stable, however the character allocations should be stable at this point. UniView therefore alerts you if you are looking at a new character.

If the Unicode database information has changed for a given character you are also warned, and provided with a link that points to the previous information for that character. These warnings will be removed from UniView when Unicode 5.2 is released.

Of course, you are unlikely to be able to actually see the new characters themselves, unless you are lucky enough to have a very new font to hand. The graphic alternatives are not available yet for these characters. Iā€™m wondering whether itā€™s possible for me to do something about that, but that will take a little longer. In the meantime, you might find it more useful to view blocks in list view. (Click on ā€˜Show range as listā€™).

This release also fixes a few small bugs in the HTML and JavaScript code.

A new version of this very popular tool is now available, in a new location. Although it is currently labeled ā€˜betaā€™, I recommend that you use that instead, and change any links and bookmarks to the new location. There are a number of new features.

There is also a vastly improved code base. If you are one of the many people who have contacted me to ask how I coded the conversions, please take a look at the new javascript code. It is much cleaner and more compact.

New features include:

* New mixed input field and position of some fields changed.
* New field for conversion of 0xā€¦ notation hex escapes.
* Enabled invisible and ambiguous characters to be made visible in the XML output.
* Added support for all HTML entities in HTML/XML input.
* All code rewritten to use characters as the internal representation, rather than code points. Also, code is much smaller and cleaner, partly through use of regular expression matching.
* Various filters available for conversion, such as allowing ASCII or Latin1 characters to remain unconverted in NCR output.
* New icon to quickly select all contents of a field.

There is also a new demonstration feature.

If there are no issues raised/remaining in a couple of months, Iā€™ll remove the beta tag.

>> Use it

Picture of the page in action.

This is a new tool that helps you to locate a country or territory on a map of the world. Ever wondered where Khazakhstan is? This will show you.

The map is in SVG and expands to fill the window. Territories are coloured red. Very small territories are marked by a red dot.

The map comes from Wikipedia. The list of territories comes from the regions listed in the IANA Language Subtag Registry. I canā€™t guarrantee that all the territories in the pulldown list are viewable, but nearly all are.

Itā€™s quite a big SVG file, so it takes a little while to draw. Iā€™ll try to speed that up in the future. It seems to draw much faster on Chrome or Opera than on Firefox or IE.

For the future I have some other ideas, such as displaying the country name natively, and linking to Wikipedia articles, CLDR data, etc. But thatā€™s for later.

Update: Almost every time I located a country, I found myself wondering what the countries alongside are. So now as you move your mouse over a country, the name of that country pops up.

Enjoy.

>> See what it can do !

>> Use it !

Picture of the page in action.

The major changes in this version include a new feature to normalise text as NFC or NFD, the ability to accept decimal code point values, and an overhaul of top part of the user interface.

Added buttons to the Text area to allow conversion of the text to NFC or NFD normalization forms. (You may not notice the change until you list the characters.)

The control panel was also substantially rearranged again to hopefully make it easier for newcomers to see what they can do.

The Code point conversion feature was upgraded to handle decimal code point values.

A single character in the codepoints area or text area is now listed in the lower left panel when you click on  , rather than in the right-hand properties panel. This is to improve consistency and avoid surprises.

Added a link to the CLDR property demo from the right panel to give access to additional properties.

Improved the parsing of codepoints when surrounded by text in the Code point input field, so that it now works with &#xā€¦; and \uā€¦ and \Uā€¦ escapes.

Jettisoned some unneeded code to reduce download by around 40-50K bytes. Implemented the NFC/NFD feature using AJAX, to avoid putting the download size back up.

When you delete the contents of the text area or the code point area, the associated input field is given focus, so you are ready for input.

A couple more minor bug fixes.

I was asked to make available the code for my normalization functions in JavaScript and PHP. The links are below. Iā€™m making the code available under a Creative Commons Attribution-Noncommercial-Share Alike licence.

Disclaimers Note that I make no claim to have produced polished, compact or well-optimised code! The code does what I need, and Iā€™m happy with that. You are welcome to suggest improvements, and Iā€™m sure there are many that could be made.

As they say, this code is made available in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.

The code is a little more convoluted that it ought to be, to get around the fact that JavaScript doesnā€™t understand supplementary characters, and PHP just doesnā€™t naturally understand Unicode. (How I long for PHP6.)

Update: [[I meant to mention that there is a way of doing normalization in PHP already. I made this code available just because I had it. I created it as a learning exercise. It may be useful, however, if you are unable to load the ICU and intl packages onto your server.]]

To use the code, simply call nfc('your-text-string') or nfd('your-text-string') from your code and capture the result.

For PHP youā€™ll need these routines and this data.

For JavaScript look at these routines and this data. There is also a lite version of the data file that doesnā€™t include Han characters. I use this sometimes for bandwidth savings (about 14K less).

Test files I also created some test files for PHP and for JavaScript.
Both of these expect to find a copy of http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt in the local directory. These files run 71,076 tests.

Cautions Be careful about the editor you use for the data files. I spent several hours fruitlessly debugging the routines, only to find that Notepad++ was displaying certain supplementary characters ok, but corrupting them on save. I switched to Notepad and the problem evaporated. And I probably donā€™t need to add that editing the data files in something like DreamWeaver is a bad idea because it will probably normalize the data before saving.

Another point: you may see Unicode replacement characters at a couple of points in the PHP source. These represent the first and last characters in the high surrogate range.

Experimenting If you want to play with something that uses this you could try my TłÄÆchĒ« (Dogrib) character picker, or my Normalizer tool. I will slowly fit this to all the pickers and to UniView. I have a local version of UniView waiting in the wings that uses the PHP files via AJAX, to reduce download size. For that you need a file that returns the result as plain text across the wire, such as this.

Well, I hope that that may be of use to someone, somewhere. I hope I havenā€™t forgotten anything.

>> Try it !

Picture of the page in action.

This tool allows you to normalise short pieces of text to Unicode forms NFC or NFD. You can paste the relevant text into a text area, or append it to the uri that calls the page, eg. Vietnamese example.

Note that, although I spell normalisation in the British way in this post, the uri uses the American spelling, since I suspect most users of the tool will expect it to be spelt that way.

Wondering what normalisation is? In Unicode a letter like Ć” can be represented by a (precomposed) single character or by an a followed by an acute accent (a decomposed sequence). Unicode regards these two representations as formally equivalent. If you are comparing strings, therefore, you need to know which representations are equivalent. Usually you would want to normalise your text prior to comparison to a given normalisation form, so that the comparison process can be efficient. Unicode defines four normalization forms, two of which, NFD and NFC, are handled by this tool.

Basically NFD reduces all precomposed characters to their decomposed equivalents, whereas NFC uses precomposed characters for most common situations.