New utility

I’ve been developing some small utilities in PHP to help me geotag my photos. First up for mention here is a tool for converting latitude and longitude to decimal format and to the tags needed for Flickr. I think it may be of use to other people than just me.

The following input formats are acceptable:

* Any arrangement that includes at least one of the following characters ° ‘ ” or , and lists figures in the order degrees, minutes, seconds (where the last two are optional).
* Decimal formats – useful for just formatting as flickr geotags.
* Use of N,S,E,W (any case and any location) or minus signs.

The following decimal output formats are provided:

* latitude longitude: this is arranged in such a way that you could cut and paste the whole thing as one, eg. into, say, Google Earth’s search field to find a location quickly
* geotags for Flickr: this provides the regular geotag, lat= and lon= tags plus a fourth combination tag in a format that enables to create tags with a single cut and paste

New picker

Someone who uses the pickers for cataloguing the Asian language collection in a UK library asked me to provide a Gujarati picker. Since he was suitably flattering about the other pickers, I thought I ought to oblige.

This picker includes all the characters in the Unicode Gujarati block.

The default shows all characters as images due to the rarity of Gujarati fonts. Consonants are mostly in a typical articulatory arrangement, vowels are aligned with vowel signs, and digits are in keypad order. I have not implemented any highlighting of similar characters, since I put this together very quickly.

Enjoy.

Updated picker

Thanks to Gernot Katzer, I realised that during the recent styling update for the Hebrew picker I missed out a whole bunch of combining marks.

It’s now fixed. Thanks Gernot!

New picker

This new picker includes all the characters in the Unicode Myanmar block.

The default shows all characters as images due to the rarity of Malayalam fonts. Consonants are mostly in a typical articulatory arrangement, vowels are aligned with vowel signs, and digits are in keypad order. Hinting is implemented for visually similar glyphs.

I don’t know a lot about Myanmar yet, so any suggestions for improving the layout are welcome. (Noting that this is supposed to help recognition of characters by people who are new to the script.)

New version

Version 4.1.0b is a minor update, but adds some useful functionality.

Changes:

  • Provided a way to start up UniView with a particular block and/or character displayed as a table in the lower panels. This should be particularly useful for pointing a person to a particular Unicode block or character in a URI. For example, if you wanted to point someone to UniView so that they immediately find the Greek and Coptic block, and a description of U+0E33: THAI CHARACTER SARA AM, you could put the following link in your email or page: http://people.w3.org/rishida/scripts/uniview/?block=greek-and-coptic&char=0e33
  • Added a link to the decodeUnicode wiki for each character that is displayed in the right-hand panel. This is a wiki where people can contribute information about Unicode blocks and characters. It is developed at the Department of Design at the University of Applied Sciences in Mainz. “The project is supported by the Federal Ministry of Education and Research (BMBF) and has the objectives of creating a basis for fundamental typographic research and facilitating a textual approach to the characters of the world for all computer users.”
  • Fixed a couple of minor bugs in the CSS.

New version.

Just produced version 2.0 with the following changes:

  1. Added some glyphs that were missing by comparing with the information at the IPA home page.
  2. Added image/character switch, but changed it so that combining characters are only ever shown as images.
  3. Combining graphics are now distinguished by a pale blue background.
  4. Added some links and information to the bottom of the page.

The French version has not been updated.

The W3C i18n Working Group would like to hear from you if you have some knowledge/thoughts in this area. We would like to gather information about the usefulness, in general, of the ::first-letter pseudo-element in non-Latin scripts, and any particular issues or differences arising from the different characteristics of the scripts.Please send your comments to www-international @ w3.org
Archive and subscription: http://lists.w3.org/Archives/Public/www-international/

The latest working draft of CSS3 Selectors proposes the ::first-letter pseudo-element.

See http://www.w3.org/TR/2005/WD-css3-selectors-20051215/#first-letter

The ::first-letter pseudo-element represents the first letter of the first line of a block, if it is not preceded by any other content (such as images or inline tables) on its line.

It allows that first letter to be styled individually, without markup. It may be used for “initial caps” and “drop caps”, which are common typographical effects in text in Latin script.

We commented to the CSS Working Group that they need to define ‘letter’ more carefully, and proposed that they specify that ‘letter’ equates to ‘default grapheme cluster’, as described in the Unicode Standard Annex #29.

See http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

(A rough and ready explanation of this is that base characters and any following combining characters are styled together. So

0065: e LATIN SMALL LETTER E + 0301: ́ COMBINING ACUTE ACCENT

would be handled as a single letter.)

We also suggested that implementors should then be encouraged to provide tailored algorithms on a per language basis to cope with anomolies, particularly such as may occur in non-Latin scripts.

Here are some initial questions:

[1] Are there scripts that would never use this approach?

[2] We mention ‘initial caps’ and ‘drop caps’ above. What other types of styling would be commonly applied in other scripts if this feature were available?

[3] What script features would cause difficulties, eg syllabic groupings (see the example of indic script example below), ligatures, cursive text (eg. Arabic, Urdu, etc.), and how would the script normally deal with them?

Please send your comments to www-international @ w3.org
Archive: http://lists.w3.org/Archives/Public/www-international/

————

What follows are some examples of questions that spring to mind.

SYLLABIC INDIC SCRIPTS

In the Hindi word स्थिति (‘sthiti’) the sequence of characters in the first syllable is as follows in memory:

0938: स DEVANAGARI LETTER SA
094D: ् DEVANAGARI SIGN VIRAMA
0925: थ DEVANAGARI LETTER THA
093F: ि DEVANAGARI VOWEL SIGN I

The displayed text, however, is The first syllable of the word 'sthiti' in Hindi, showing the positions of the characters after display.

Note how the vowel sign appears to the left of the first character, not the third.

The default grapheme clusters here are, I believe, 0938+094D, then each of the following two characters.

Would Devanagari-based languages use special styling for initial syllables? If so, would they actually apply the styling to the vowel sign alone, or to the whole syllable?

LIGATURES

If a script styles the ‘first letter’, but that letter is part of a ligature (ie. a single glyph representing more than one underlying character), would it be ok to split the ligature, or should the other characters that compose the ligature also be styled?

CURSIVE SCRIPTS

Since Arabic and Mongolian letters in a word are normally joined, has first letter styling been used at all in these scripts?

CHINESE, JAPANESE, KOREAN

Do languages using these scripts do first letter styling?

RUSSIAN, GREEK, ARMENIAN, etc.

Is first letter styling common practise in these scripts too?

New paper.

I thought this paper had been lost when Xerox erased my global design site, but I just found a link to a cached copy of it at the XML Cover pages.

The paper refers to standard topics such as character encoding and language declarations, but also covers topics such as implementation of emphasis and style conventions, handling of citations, use of text in attribute values, and the need for an element like HTML’s SPAN. In addition, other topics that have traditionally been associated with translation of user interface messages become applicable due to the nature of XML documents. These include the provision of designer’s notes, identification of non-translatable text, and use of element ids for automatic translation of elements.

New version

Although the difference between 4.1 and 4.1.0a doesn’t look like much, this is a substantial update. I also updated the help/user guide.

Changes:

Added support for Unicode version 4.1.0.

Retrieves graphics from decodeunicode.org rather than the slow-loading and sparse graphics that were available from the Unicode site. Also added my own graphics where decodeunicode has gaps.

Moved the files to PHP. This enables a different approach to the inclusion of user-defined notes that now works on IE and Opera, too.

Another benefit of using PHP is that you can now prep the conversion page with data in the ‘Code point’ or ‘Cut & paste’ fields. By clicking on the appropriate icon, the conversion page will now open with the conversions already done for the relevant field.

Yet another benefit of PHP is that, if you really want to, you can now set various preferences related to the intial look and feel by specifying them as query parameters when you call UniView.

NOTE: If you want to be able to download UniView to your hard drive and you don’t have a server and PHP, let me know. If enough people ask for it, I will create a downloadable zipped package again that will work without PHP (and without the additional notes feature). I will also post notes on how to customise various aspects of the setup.

Rearranged the top of the page to allow UniView to be used in narrower windows.

New app

This picker includes all characters in the Unicode 4.0 Ethiopic block. It does not cover additions in Unicode version 4.1.

(Before creating a full Ethiopic picker, I need to get a font that covers all the new characters, and I need to figure out how to best fit the new characters into the current arrangement.)

Letters are arranged in the Unicode code page order, which is aligned with the traditional consonant-vowel matrix. I didn’t actually use a table, to save space. If you have strong views on the layout, send me some suggestions.

New app

Includes all characters in the Unicode Armenian block. Letters are arranged in the Unicode code page order, but upper and lower case letters are side by side.

[new version]

In preparation for my full-day tutorial at WWW2005 I’ve been creating and renovating presentations. As a result, the format of this tutorial has been reworked to fit the style I have recently developed for tutorials on the W3C Internationalization site.

The text remains the same, but you can now view slides as a single document (good for printing), or as individual slides with XHTML-based notes. The slides themselves are graphics, so you can see what I’m trying to show you, but there’s also a text only version of each slide. There’s also a new overview, to help you jump around easily.

I had wanted to also provide SVG slides, but the SVG export utility in the new version of Open Office (2.0 beta) takes control of positioning the text in a way that doesn’t support combining characters and shaping etc properly. Since my talk is about features that need to be supported for support of complex scripts, this actually provides an interesting object lesson! 🙁 Maybe when I have more time I’ll upload them.

New version Download zip Overview/Instructions

New features include:

  • Support for supplementary characters
  • Double-clicking on a list item to the left adds the character to the Cut&Paste field above
  • Han and Ideographic characters are shown for code points typed in the Code Point field.

About UniView
Look up characters, character blocks, paste in and discover unknown characters, store your own info about characters, search on character names, do hex/dec/ncr conversions, highlight character types, etc. etc. Check out the help file for instructions and new features, and have fun!

Supports Unicode 4.0 and rewritten with Web Standards to work on a variety of browsers. – this is still work in progress and has some known bugs (esp surrogates), but I use it all the time.

Initially downloads about 1Mb of Unicode data, so you should use a fast connection. Once it’s cached you’re ok. Alternatively, download the newly available zip file and run it from your PC. It’s just XHTML and Javascript – so no worries about viruses.

Updated

I added 02BF: MODIFIER LETTER LEFT HALF RING and 02BE: MODIFIER LETTER RIGHT HALF RING.

This was in response to a request from someone who wants to transliterate Arabic, Persian, and several other languages during their work publishing Middle Eastern historical works.

New app

Includes characters needed in Morocco to represent the Berber script. Unicode codes are those expected to be standardised in Unicode 4.1. Letters are arranged in approximately the same arrangement as the proposed IRCAM keyboard. Get the Hapax Berbère font.

New article

The HTML specification suggests that the link element can be used by search engines to find alternate translations of the current page. Some browsers expose the link information on the user interface.

Andrew Cunningham and I wrote a test for this. Here is a summary of the results of some brief testing on mainstream browsers on Windows XP.

New article

Some browsers apply the fonts listed in the user font preferences to the display of HTML Unicode text in Traditional Chinese, Simplified Chinese, Japanese and Korean, depending on the setting of the lang/xml:lang attribute. Here is a summary of the results of some brief testing of mainstream browsers on Windows XP. I may update this as additional information becomes available.

New article

In-progress draft of notes that list the symbols used to represent Bengali, describe their use, and relate them to appropriate characters for representation in Unicode. There is an index of shapes you can use to look up Bengali glyphs and track them down to their constituent Unicode codepoints.

New version   Download zip

New features include:

  • French version available, thanks to translation by Patrick Andries
  • improvements to user interface, including ability to set height of visible area in left panel (useful for large screens) and tooltips to help understand the UI
  • disactivated display of personalised notes in IE until fix can be found

About UniView:
Look up characters, character blocks, paste in and discover unknown characters, store your own info about characters, search on character names, do hex/dec/ncr conversions, highlight character types, etc. etc. Check out the help file for instructions and new features, and have fun!

Supports Unicode 4.0 and rewritten with Web Standards to work on a variety of browsers. – this is still work in progress and has some known bugs (esp surrogates), but I use it all the time.

Initially downloads about 1Mb of Unicode data, so you should use a fast connection. Once it’s cached you’re ok. Alternatively, download the newly available zip file and run it from your PC. It’s just XHTML and Javascript – so no worries about viruses.