Picture of the page in action.

>> Use UniView

This version updates the app per the changes during beta phase of the specification, so that it now reflects the finalised Unicode 7.0.0.

The initial in-app help information displayed for new users was significantly updated, and the help tab now links directly to the help page.

A more significant improvement was the addition of links to character descriptions (on the right) where such details exist. This finally reintegrates the information that was previously pulled in from a database. Links are only provided where additional data actually exists. To see an example, go here and click on See character notes at the bottom right.

Rather than pull the data into the page, the link opens a new window containing the appropriate information. This has advantages for comparing data, but it was also the best solution I could find without using PHP (which is no longer available on the server I use). It also makes it easier to edit the character notes, so the amount of such detail should grow faster. In fact, some additional pages of notes were added along with this upgrade.

A pop-up window containing resource information used to appear when you used the query to show a block. This no longer happens.

Changes in version 7beta

I forgot to announce this version on my blog, so for good measure, here are the (pretty big) changes it introduced.

This version adds the 2,834 new characters encoded in the Unicode 7.0.0 beta, including characters for 23 new scripts. It also simplified the user interface, and eliminated most of the bugs introduced in the quick port to JavaScript that was the previous version.

Some features that were available in version 6.1.0a are still not available, but they are minor.

Significant changes to the UI include the removal of the ā€˜popoutā€™ box, and the merging of the search input box with that of the other features listed under Find.

In addition, the buttons that used to appear when you select a Unicode block have changed. Now the block name appears near the top right of the page with a I icon icon. Clicking on the icon takes you to a page listing resources for that block, rather than listing the resources in the lower right part of UniViewā€™s interface.

UniView no longer uses a database to display additional notes about characters. Instead, the information is being added to HTML files.

Korean justification

The editors of the CSS3 Text module specification are currently trying to get more information about how to handle Korean justification, particularly when hanja (chinese ideographic characters) are mixed with Korean hangul.

Should the hanja be stretched with inter-character spacing at the same time as the Korean inter-word spaces are stretched? What if only a single word fits on a line, should it be stretched using inter-character spacing? Are there more sophisticated rules involving choices of justification location, as there are in Japanese where you adjust punctuation first and do inter-character spacing later? What if the whole text is hanja, such as for an ancient document?

If you are able to provide information, take a look at whatā€™s in the CSS3 Text module and follow the thread on public-i18n-cjk@w3.org (subscribe).

Factoids listed at the start of the EURid/UNESCO World Report on IDN Deployment 2013

5.1 million IDN domain names

Only 2% of the worldā€™s domain names are in non-Latin script

The 5 most popular browsers have strong support for IDNs in their latest versions

Poor support for IDNs in mobile devices

92% of the worldā€™s most popular websites do not recognise IDNs as URLs in links

0% of the worldā€™s most popular websites allow IDN email addresses as user accounts

99% correlation between IDN scripts and language of websites (Han, Hangkuk, Hiragana, Katakana)

About two weeks ago I attended the part of a 3-day Asia Pacific Top Level Domain Association (APTLD) meeting in Oman related to ā€˜Universal Acceptanceā€™ of Internationalized Domain Names (IDNs), ie. domain names using non-ASCII characters. This refers to the fact that, although IDNs work reasonably well in the browser context, they are problematic when people try to use them in the wider world for things such as email and social media ids, etc. The meeting was facilitated by Don Hollander, GM of APTLD.

Hereā€™s a summary of information from the presentations and discussions.

(By the way, Don Hollander and Dennis Tan Tanaka, Verisign, each gave talks about this during the MultilingualWeb workshop in Madrid the week before. You can find links to their slides from the event program.)

Basic proposition

International Domain Names (IDNs) provide much improved accessibility to the web for local communities using non-Latin scripts, and are expected to particularly smooth entry for the 3 billion people not yet web-enabled. For example, in advertising (such as on the side of a bus) they are easier and much faster to recognise and remember, they are also easier to note down and type into a browser.

The biggest collection of IDNs is under .com and .net, but there are new Brand TLDs emerging as well as IDN country codes. On the Web there is a near-perfect correlation between use of IDNs and the language of a web site.

The problems tend to arise where IDNs are used across cultural/script boundaries. These cross-cultural boundaries are encountered not just by users but by implementers/companies that create tools, such as email clients, that are deployed across multilingual regions.

It seems to be accepted that there is a case for IDNs, and that they already work pretty well in the context of the browser, but problems in widespread usage of internationalized domain names beyond the browser are delaying demand, and this apparently slow demand doesnā€™t convince implementers to make changes ā€“ itā€™s a chicken and egg situation.

The main question asked at the meeting was how to break the vicious cycle. The general opinion seemed to lean to getting major players like Google, Microsoft and Apple to provide end-to-end support for IDNs throughout their produce range, to encourage adoption by others.

Problems

Domain names are used beyond the browser context. Problem areas include:

  • email
    • email clients generally donā€™t support use of non-ascii email addresses
    • standards donā€™t address the username part of email addresses as well as domain
    • thereā€™s an issue to do with smptutf8 not being visible in all the right places
    • you canā€™t be sure that your email will get through, it may be dropped on the floor even if only one cc is IDN
  • applications that accept email IDs or IDNs
    • even Russian PayPal IDs fail for the .рф domain
    • things to be considered include:
      • plain text detection: you currently need http or www at start in google docs to detect that something is a domain name
      • input validation: no central validation repository of TLDs
      • rendering: what if the user doesnā€™t have a font?
      • storage & normalization: ids that exist as either IDN or punycode are not unique ids
      • security and spam controls: Google wonā€™t launch a solution without resolving phishing issues; some spam filters or anti-virus scanners think IDNs are dangerous abnormalities
      • other integrations: add contact, create mail and send mail all show different views of IDN email address
  • search: how do you search for IDNs in contacts list?
    • search in general already works pretty well on Google
    • I wasnā€™t clear about how equivalent IDN and Latin domain names will be treated
  • mobile devices: surprisingly for the APTLD folks, itā€™s harder to find the needed fonts and input mechanisms to allow typing IDNs in mobile devices
  • consistent rendering:
    • some browsers display as punycode in some circumstances ā€“ not very user friendly
    • there are typically differences between full and hybrid (ie. partial) int. domain names
    • IDNs typed in twitter are sent as punycode (mouse over the link in the tweet on a twitter page)

Initiatives

Google are working on enabling IDNā€™s throughout their application space, including Gmail but also many other applications ā€“ they pulled back from fixing many small, unconnected bugs to develop a company wide strategy and roll out fixes across all engineering teams. The Microsoft speaker echoed the same concerns and approaches.

In my talk, I expressed the hope that Google and MS and others would collaborate to develop synergies and standards wherever feasible. Microsoft, also called for a standard approach rather than in-house, proprietary solutions, to ensure interoperability.

However, progress is slow because changes need to be made in so many places, not just the email client.

Google expects to have some support for international email addresses this summer. You wonā€™t be able to sign up for Arabic/Chinese/etc email addresses yet, but you will be able to use Gmail to communicate with users on other providers who have internationalized addresses. Full implementation will take a little longer because thereā€™s no real way to test things without raising inappropriate user expectations if the system is live.

SaudiNIC has been running Arabic emails for some time, but itā€™s a home-grown and closed system ā€“ they created their own protocols, because there were no IETF protocols at the time ā€“ the addresses are actually converted to punycode for transmission, but displayed as Arabic to the user (http://nic.sa).

Google uses system information about language preferences of the user to determine whether or not to display the IDN rather than punycode in Chromeā€™s address bar, but this could cause problems for people using a shared computer, for example in an internet cafĆ©, a conference laptop etc. They are still worrying about usersā€™ reactions if they canā€™t read/display an email address in non-ASCII script. For email, currently theyā€™re leaning towards just always showing the Unicode version, with the caveat that they will take a hard line on mixed script (other than something mixed with ASCII) where they may just reject the mail.

A trend to note is a growing number of redirects from IDN to ASCII, eg. http://ŠæрŠ°Š²ŠøтŠµŠ»ŃŒŃŃ‚Š²Š¾.рф page shows http://government.ru in the address bar when you reach the site.

Other observations

All the Arabic email addresses I saw were shown fully right to left, ie. <tld><domain>@<username>. I wonder whether this may dislodge some of the hesitation in the IETF about the direction in which web addresses should be displayed ā€“ perhaps they should therefore also flow right-to-left?? (especially if people write domain names without http://, which these guys seem to think they will).

Many of the people in the room wanted to dispense with the http:// for display of web addresses, to eliminate the ASCII altogether, also get rid of www. ā€“ problem is, how to identify the string as a domain name ā€“ is the dot sufficient?? We saw some examples of this, but they had something like ā€œsee this linkā€ alongside.

By the way, Google is exploring the idea of showing the user, by default, only the domain name of a URL in future versions of the Chrome browser address bar. A Google employee at the workshop said ā€œI think URLs are going away as far as something to be displayed to users ā€“ the only thing that matters is the domain name ā€¦ users donā€™t understand the rest of the URLā€. I personally donā€™t agree with this.

One participant proposed that government mandates could be very helpful in encouraging adaptation of technologies to support international domain names.

My comments

I gave a talk and was on a panel. Basically my message was:

Most of the technical developments for IDN and IRIs were developed at the IETF and the Unicode Consortium, but with significant support by people involved in the W3C Internationalization Working Group. Although the W3C hasnā€™t been leading this work, it is interested in understanding the issues and providing support where appropriate. We are, however, also interested in wider issues surrounding the full path name of the URL (not just the domain name), 3rd level domain labels, frag ids, IRI vs punycode for domain name escaping, etc. We also view domain names as general resource identifiers (eg. for use in linked data), not just for a web presence and marketing.

I passed on a message that groups such as the Wikimedia folks I met with in Madrid the week before are developing a very wide range of fonts and input mechanisms that may help users input non-Latin IDs on terminals, mobile devices and such like, especially when travelling abroad. Itā€™s something to look into. (For more information about Wikimediaā€™s jQuery extensions, see here and here.)

I mentioned the idea of bidi issues related to both the overall direction of Arabic/Hebrew/etc URLs/domain names, and the more difficult question about to handle mixed direction text that can make the logical http://www.oman/muscat render to the user as http://www.muscat/oman when ā€˜muscatā€™ and ā€˜omanā€™ are in Arabic, due to the default properties of the Unicode bidi algorithm. Community guidance would be a help in resolving this issue.

I said that the W3C is all about getting people together to find interoperable solutions via consensus, and that we could help with networking to bring the right people together. Iā€™m not proposing that we should take on ownership of the general problem of Universal Acceptance, but I did suggest that if they can develop specific objectives for a given aspect of the problem, and identify a natural community of stakeholders for that issue, then they could use our Community Groups to give some structure to and facilitate discussions.

I also suggested that we all engage in grass-roots lobbying, requesting that service/tool providers allow us to use IDNs.

Conclusions

At the end of the first day, Don Hollander summed up what he had gathered from the presentations and discussions as follows:

People want IDNs to work, they are out there, and they are not going away. Things donā€™t appear quite so dire as he had previously thought, given that browser support is generally good, closed email communities are developing, and search and indexing works reasonably well. Also Google and Microsoft are working on it, albeit perhaps slower than people would like (but thatā€™s because of the complexity involved). There are, however, still issues.

The question is how to go forward from here. He asked whether APTLD should coordinate all communities at a high level with a global alliance. After comments from panelists and participants, he concluded that APTLD should hold regular meetings to assess and monitor the situation, but should focus on advocacy. The objective would be to raise visibility of the issues and solutions. ā€œThe greatest contribution from Google and Microsoft may be to raise the awareness of their thousands of geeks.ā€ ICANN offered to play a facilitation role and to generate more publicity.

One participant warned that we need a platform for forward motion, rather than just endless talking. I also said that in my panel contributions. I was a little disappointed (though not particularly surprised) that APTLD didnā€™t try to grasp the nettle and set up subcommittees to bring players together to take practical steps to address interoperable solutions, but hopefully the advocacy will help move things forward and developments by companies such as Google and Microsoft will help start a ball rolling that will eventually break the deadlock.

Iā€™ve been trying to understand how web pages need to support justification of Arabic text, so that there are straight lines down both left and right margins.

The following is an extract from a talk I gave at the MultilingualWeb workshop in Madrid at the beginning of May. (See the whole talk.) Itā€™s very high level, and basically just draws out some of the uncertainties that seem to surround the topic.

Letā€™s suppose that we want to justify the following Arabic text, so that there are straight lines at both left and right margins.

Arabic justification #1

Unjustified Arabic text

Generally speaking, received wisdom says that Arabic does this by stretching the baseline inside words, rather than stretching the inter-word spacing (as would be the case in English text).

To keep it simple, lets just focus on the top two lines.

One way you may hear that this can be done is by using a special baseline extension character in Unicode, U+0640 ARABIC TATWEEL.

Arabic justification #2

Justification using tatweels

The picture above shows Arabic text from a newspaper where we have justified the first two lines using tatweels in exactly the same way it was done in the newspaper.

Apart from the fact that this looks ugly, one of the big problems with this approach is that there are complex rules for the placement of baseline extensions. These include:

  • extensions can only appear between certain characters, and are forbidden around other characters
  • the number of allowable extensions per word and per line is usually kept to a minimum
  • words vary in appropriateness for extension, depending on word length
  • there are rules about where in the line extensions can appear ā€“ usually not at the beginning
  • different font styles have different rules

An ordinary web author who is trying to add tatweels to manually justify the text may not know how to apply these rules.

A fundamental problem on the Web is that when text size or font is changed, or a window is stretched, etc, the tatweels will end up in the wrong place and cause problems. The tatweel approach is of no use for paragraphs of text that will be resized as the user stretches the window of a web page.

In the next picture we have simply switched to a font in the Naskh style. You can see that the tatweels applied to the word that was previously at the end of the first line now make the word to long to fit there. The word has wrapped to the beginning of the next line, and we have a large gap at the end of the first line.

Arabic justification #3

Tatweels in the wrong place due to just a font change

To further compound the difficulties mentioned above regarding the rules of placement for extensions, each different style of Arabic font has different rules. For example, the rules for where and how words are elongated are different in the Nastaliq version of the same text which you can see below. (All the characters are exactly the same, only the font has changed.) (See a description of how to justify Urdu text in the Nastaliq style.)

Arabic justification #4: Nastaliq
Same text in the Nastaliq font style

And fonts in the Ruqah style never use elongation at all. (Weā€™ll come back to how you justify text using Ruqah-style fonts in a moment.)

Arabic justification #5: Ruqah
Same text in the Ruqah font style

In the next picture we have removed all the tatweel characters, and we are showing the text using a Naskh-style font. Note that this text has more ligatures on the first line, so it is able to fit in more of the text on that line than the first font we saw. Weā€™ll again focus on the first two lines, and consider how to justify them.

Arabic justification #6: Naskh
Same text in the Naskh font style

High end systems have the ability to allow relevant characters to be elongated by working with the font glyphs themselves, rather than requiring additional baseline extension characters.

Arabic justification #7: kashida elongation
Justification using letter elongation (kashida)

In principle, if you are going to elongate words, this is a better solution for a dynamic environment. It means, however, that:

  1. the rules for applying the right-sized elongations to the right characters has to be applied at runtime by the application and font working together, and as the user or author stretches the window, changes font size, adds text, etc, the location and size of elongations needs to be reconfigured
  2. there needs to be some agreement about what those rules are, or at least a workable set of rules for an off-the-shelf, one-size-fits-all solution.

The latter is the fundamental issue we face. There is very little, high-quality information available about how to do this, and a lack of consensus about, not only what the rules are, but how justification should be done.

Some experts will tell you that text elongation is the primary method for justifying Arabic text (for example), while others will tell you that inter-word and intra-word spacing (where there are gaps in the letter-joins within a single word) should be the primary approach, and kashida elongation may or may not be used in addition where the space method is strained.

Arabic justification #8: space based

Justification using inter-word spacing

The space-based approach, of course, makes a lot of sense if you are dealing with fonts of the Ruqah style, which do not accept elongation. However, the fact that the rules for justification need to change according to the font that is used presents a new challenge for a browser that wants to implement justification for Arabic. How does the browser know the characteristics of the font being used and apply different rules as the font is changed? Fonts donā€™t currently indicate this information.

Looking at magazines and books on a recent trip to Oman I found lots of justification. Sometimes the justification was done using spaces, other times using elongations, and sometimes there was a mixture of both. In a later post Iā€™ll show some examples.

By the way, for all the complexity so far described this is all quite a simplistic overview of whatā€™s involved in Arabic justification. For example, high end systems that justify Arabic text also allow the typesetter to adjust the length of a line of text by manual adjustments that tweak such things as alternate letter shapes, various joining styles, different lengths of elongation, and discretionary ligation forms.

The key messages:

  1. We need an Arabic Layout Requirements document to capture the script needs.
  2. Then we need to figure out how to adapt Open Web Platform technologies to implement the requirements.
  3. To start all this, we need experts to provide information and develop consensus.

Any volunteers to create an Arabic Layout Requirements document? The W3C would like to hear from you!

When it comes to wrapping text at the end of a line in a web page, there are some special rules that should be applied if you know the language of the text is either Chinese or Japanese (ie. if the markup contains a lang attribute to that effect).

The CSS3 Text module attempts to describe these rules, and we have some tests to check what browsers currently do for Japanese and Chinese.

Thereā€™s an open question in the editorā€™s draft about whether Korean has any special behaviours that need to be documented in the spec, when the markup uses lang to identify the content as Korean.

If you want to provide information, take a look at whatā€™s in the CSS3 Text module and write to www-international@w3.org and copy public-i18n-cjk@w3.org.

If you put a span tag around one or two letters in an Arabic word, say to change the colour, it breaks the cursiveness in WebKit and Blink browsers. You can change things like colour in Mozilla and IE, but changing the font breaks the connection.

Breaking on colour change makes it hard to represent educational texts and things such as the Omantel logo, which I saw all over Muscat recently. (Omantel is the largest internet provider in Oman.) Note how, despite the colour change, the Arabic letters in the logo below (on the left) still join.

Picture of the Omantel logo.
Multi-coloured Omantel Arabic logo on a building in Muscat.

Hereā€™s an example of an educational page that colours parts of words. You currently have to use Firefox or IE to get the desired effect.

This lead to questions about what to do if you convert block elements, such as li, into inline elements that sit side by side. You probably donā€™t want the character at the end of one li tag to join with the next one. What if there is padding or margins between them, should this cause bidi isolation as well as preventing joining behaviour?

See a related thread on the W3C Internationalization and CSS lists.

Following up on a suggestion by Nathan Hill of SOAS, I added a la-swe glyph to the default view of the picker alongside the medial consonants. If you click on it, it produces U+1039 MYANMAR SIGN VIRAMA + U+101C MYANMAR LETTER LA.

I also rearranged the font pull-down list a little, adding information about what fonts are available on your Mac OS X or Windows7 system, and added a placeholder, like I did recently for the Khmer picker.

You can find the Myanmar picker at https://r12a.github.io/pickers/burmese/

Picture of the page in action.

This kind of list could be used to set font-family styles for CSS, if you want to be reasonably sure what the user will see, or it could be used just to find a font you like for a particular script.

Iā€™ve updated the page to show the fonts added in Windows8. This is the list:

  • Aldhabi (Urdu Nastiliq)
  • Urdu Typesetting (Urdu Nastiliq)
  • Gadugi (Cherokee/Unified Canadian Aboriginal Syllabics)
  • Myanmar Text (Myanmar)
  • Nirmala UI (10 Indic scripts)

There were also two additional UI fonts for Chinese, Jhenghei UI (Traditional) and Yahei UI (Simplified), which I havenā€™t listed. Also Microsoft Uighur acquired a bold font.

>> See the list

See the blog post for the first version or the page for more information.

Update, 25 Jan 2013

Patrick Andries pointed out that Tifinagh using the Windows Ebrima font was missing from the list. Not any more.

Following up on a very good suggestion by Roger Sperberg, I added two webfonts to the Khmer picker and arranged the font selection list so that you can see which fonts are available on your Mac OS X or Windows7 system.

The webfonts make it possible to use the picker on an iPad or other device that doesnā€™t have a Khmer font installed. I added two webfonts because one worked on my iPad and the other didnā€™t, and it was vice versa on my Snow Leopard Macbook.

I also added an HTML5 placeholder for the output box. (Iā€™m wishing you could style that differently from the standard content ā€“ and wishing that markup designers would think about this sort of thing and stop using attributes for natural language textā€¦).

You can find the Khmer picker at https://r12a.github.io/pickers/khmer/

Picture of the page in action.

>> Use UniView

The main addition in this version is a couple of buttons that appear when you ask UniView to display a block.

Clicking on Show annotated list generates a list of all characters in the block, with annotations.

Clicking on Show script links displays a list of links to key sources of information about the script of the block, links to relevant articles and apps on the r12a.github.io site, and related fonts and input methods. This provides a very quick way of finding this information. One particularly useful link (ā€˜Historical documentationā€™, which links to a Scriptsource.org page) allows you to find the proposals for all additions to Unicode related to the relevant script. These proposals are a mine of useful information about the individual characters in a block, and SIL staff should get a medal for trawling through all the relevant data to provide this.

In addition, there were some changes to the user interface, including the following:

  • The order of information in the lower right panel (detailed character information) was slightly changed, and two alterative representations of the character were added: an HTML escape, and a URI escape.
  • The search box at the top left was constrained to appear closer to the other controls when the window is stretched wide.

Various bugs were also fixed.

>> Use it

This HTML page allows you to expand information in the lines of the UnicodeData.txt file, edit them and generate a new version. It also checks the data for validity in a number of areas.

It can be helpful if you have the misfortune to pore over the source code of the UnicodeData.txt file and find your eyes blurring as you count fields. And it is particularly useful for people submitting proposals for new scripts or characters to the Unicode Consortium, to help them generate correct lists of unicode properties for inclusion in the proposal. (You can even build the whole thing in the UI, error free, by starting with a number of blank lines, such as 1111;NAME;;;;;;;;;;;;;.)

The image below shows the page in action. I dropped in a couple of lines from the Ahom script proposal, and vandalised them slightly. The first panel shows that the app has spotted an error. I used the column to the right to edit out the error in the second panel, and regenerated the lines in the box below.

Picture of the page in action.

Having made edits you can copy paste the output back into the top box to send it through the sausage machine again, and check that there are no remaining errors.

You can add a whole script block at a time to the top box, or a single line ā€“ as you like.

Well, itā€™s a bit esoteric, but hopefully it will be useful to someone somewhere.

This is a note to myself, so I remember what I did today.

Iā€™d like to let people know when I have uploaded new photos to Flickr, but Iā€™ve been relying so far on IFTTT, and the results havenā€™t been great. The notifications donā€™t actually point at the Flickr site, and notifications are fired off every time I upload something ā€“ which is too often.

It looks like Iā€™ll need to manually fire off notifications, but I donā€™t want to have to do it separately for Twitter, Facebook and Google+. However I have set things up so that Google+ posts (public ones) go to Twitter, and Twitter posts go to Facebook.

Then the question was how to set up the Google+ post. Hereā€™s the sequence of events, starting with the manual bit where I create a post in Google+:

[1] Add a blurb to the Google+ post, add a link using the link button to the setā€™s URI, and delete the description (leave the photo). (It didnā€™t work to link to a photo in lightview ā€“ you just get a link to the photo page.)

Picture of desktop google+ post.

This produces a slightly different result on my iPad and HTC, with two photos shown and 3 more accessible via an expansion button.

Picture of mobile google+ post.

[2] This produces an impressive tweet on a desktop machine, if you click on ā€œShow photoā€, that includes a link to a slide show. It also has links to many, but not all, of the photos individually. That was a bit of a surprise.

Picure of twitter post.

On a mobile device, no photo is shown in the twitter stream. You just get the link (much like in [4] below).

[3] Not all of that gets forwarded to Facebook. Just the principal photo, and the blurb, but it also pulls the description back in from the flickr site.

Picture of Facebook post.

[4] And finally, this is what I see in the right column of my blog.

Picture of blog page twitter listing.

Characters in the Unicode Balinese block.

I just uploaded an initial draft of an article Balinese Script Notes. It lists the Unicode characters used to represent Balinese text, and briefly describes their use. It starts with brief notes on general script features and discussions about which Unicode characters are most appropriate when there is a choice.

The script type is abugida ā€“ consonants carry an inherent vowel. Itā€™s a complex script derived from Brahmi, and has lots of contextual shaping and positioning going on. Text runs left-to-right, and words are not separated by spaces.

I think itā€™s one of the most attractive scripts in Unicode, and for that reason Iā€™ve been wanting to learn more about it for some time now.

>> Read it

Picture of the page in action.

>> Use it

This picker contains characters from the Unicode Balinese block needed for writing the Balinese language. Characters needed for Sasak are also available in the Advanced section. Balinese musical notation characters are not included.

About the tool: Pickers allow you to quickly create phrases in a script by clicking on Unicode characters arranged in a way that aids their identification. Pickers are likely to be most useful if you donā€™t know a script well enough to use the native keyboard. The arrangement of characters also makes it much more usable than a regular character map utility.

About this picker: Characters are grouped to aid input. The consonant block includes characters needed for Kawi and Sanskrit as well as the native Balinese characters, all arranged according to the Brahmi pronunciation grid.

The picker has only a default view and a font grid view. Itā€™s difficult to put in the time for the shape-based, keyboard-based, and various transcription-based views in some other pickers. In a new departure, however, I have included a list of Latin characters on the default view to assist in writing transcriptions alongside Balinese text.

There is, however, a significant issue with this picker, due to the lack of support for Balinese as a script in computers. The only Unicode-based Balinese font I know of is Aksara Bali, but that font seems to only work as expected in Firefox on Mac OS X. Furthermore, the Aksara Bali font doesnā€™t handle ra repa as described in the Unicode Standard. The sequence <consonant , adeg-adeg, ra repa> produces a visible adeg-adeg, rather than the post-fixed form of ra repa. The sequence <consonant , vowel sign ra repa> produces the post-fixed form of ra repa, rather than the subjoined form. You can produce the post-fixed form with this font by using <consonant , vowel sign ra repa> and the subjoined form by using <consonant , adeg-adeg, ra, pepet>, but these sequences will produce content that cannot be matched against sequences using the Unicode approach, and content that may fail with other Unicode-compliant fonts in the future.

Hopefully some new, fully Unicode-compliant fonts will come along soon. This is one of the most beautiful scripts I have come across.

(Btw, Iā€™m working on a set of notes for Balinese characters, linked from UniView, with some feature innovations to get around the font issue. Look out for that later. And Iā€™m thinking I should develop a Javanese picker to go with this one. Just need a bit of timeā€¦)

For the curious, hereā€™s the first article of the Universal Declaration of Human Rights, as typed in the Balinese picker. Translation by Tri Ediwan (reproduced from Omniglot).

A translate attribute was recently added to HTML5. At the three MultilingualWeb workshops we have run over the past two years, the idea of this kind of ā€˜translate flagā€™ has constantly excited strong interest from localizers, content creators, and from folks working with language technology.

How it works

Typically authors or automated script environments will put the attribute in the markup of a page. You may also find that, in industrial translation scenarios, localizers may add attributes during the translation preparation stage, as a way of avoiding the multiplicative effects of dealing with mistranslations in a large number of languages.

There is no effect on the rendered page (although you could, of course, style it if you found a good reason for doing so). The attribute will typically be used by workflow tools when the time comes to translate the text ā€“ be it by the careful craft of human translators, or by quick gist-translation APIs and services in the cloud.

The attribute can appear on any element, and it takes just two values: yes or no. If the value is no, translation tools should protect the text of the element from translation. The translation tool in question could be an automated translation engine, like those used in the online services offered by Google and Microsoft. Or it could be a human translatorā€™s ā€˜workbenchā€™ tool, which would prevent the translator inadvertently changing the text.

Setting this translate flag on an element applies the value to all contained elements and to all attribute values of those elements.

You donā€™t have to use translate="yes" for this to work. If a page has no translate attribute, a translation system or translator should assume that all the text is to be translated. The yes value is likely to see little use, though it could be very useful if you need to override a translate flag on a parent element and indicate some bits of text that should be translated. You may want to translate the natural language text in examples of source code, for example, but leave the code untranslated.

Why it is needed

You come across a need for this quite frequently. There is an example in the HTML5 spec about the Bee Game. Here is a similar, but real example from my days at Xerox, where the documentation being translated referred to a machine with text on the hardware that wasnā€™t translated.

<p>Click the Resume button on the Status Display or the
<span class="panelmsg" translate="no">CONTINUE</span> button
on the printer panel.</p>

Here are a couple more (real) examples of content that could benefit from the translate attribute. The first is from a book, quoting a title of a work.

<p>The question in the title <cite translate="no">How Far Can You Go?</cite> applies to both the undermining of traditional religious belief by radical theology and the undermining of literary convention by the device of "breaking frame"...</p>

The next example is from a page about French bread ā€“ the French for bread is ā€˜painā€˜.

<p>Welcome to <strong translate="no">french pain</strong> on Facebook. Join now to write reviews and connect with <strong translate="no">french pain</strong>. Help your friends discover great places to visit by recommending <strong translate="no">french pain</strong>.</p>

So adding the translate attribute to your page can help readers better understand your content when they run it through automatic translation systems, and can save a significant amount of cost and hassle for translation vendors with large throughput in many languages.

What about Google Translate and Microsoft Translator?

Both Google and Microsoft online translation services already provided the ability to prevent translation of content by adding markup to your content, although they did it in (multiple) different ways. Hopefully, the new attribute will help significantly by providing a standard approach.

Both Google and Microsoft currently support class="notranslate", but replacing a class attribute value with an attribute that is a formal part of the language makes this feature much more reliable, especially in wider contexts. For example, a translation prep tool would be able to rely on the meaning of the HTML5 translate attribute always being what is expected. Also it becomes easier to port the concept to other scenarios, such as other translation APIs or localization standards such as XLIFF.

As it happens, the online service of Microsoft (who actually proposed a translate flag for HTML5 some time ago) already supported translate="no". This, of course, was a proprietary tag until now, and Google didnā€™t support it. However, just yesterday morning I received word, by coincidence, that Webkit/Chromium has just added support for the translate attribute, and yesterday afternoon Google added support for translate="no" to its online translation service. See the results of some tests I put together this morning. (Neither yet supports the translate="yes" override.)

In these proprietary systems, however, there are a good number of other non-standard ways to express similar ideas, even just sticking with Google and Microsoft.

Microsoft apparently supports style="notranslate". This is not one of the options Google lists for their online service, but on the other hand they have things that are not available via Microsoftā€™s service.

For example, if you have an entire page that should not be translated, you can add <meta name="google" value="notranslate"> inside the head element of your page and Google wonā€™t translate any of the content on that page. (However they also support <meta name="google" content="notranslate">.) This shouldnā€™t be Google specific, and a single way of doing this, ie. translate="no" on the html tag, is far cleaner.

Itā€™s also not made clear, by the way, when dealing with either translation service, how to make sub-elements translatable inside an element where translate has been set to no ā€“ which may sometimes be needed.

As already mentioned, the new HTML5 translate attribute provides a simple and standard feature of HTML that can replace and simplify all these different approaches, and will help authors develop content that will work with other systems too.

Canā€™t we just use the lang attribute?

It was inevitable that someone would suggest this during the discussions around how to implement a translate flag, however overloading language tags is not the solution. For example, a language tag can indicate which text is to be spellchecked against a particular dictionary. This has nothing to do with whether that text is to be translated or not. They are different concepts. In a document that has lang="en" in the html header, if you set lang="notranslate" lower down the page, that text will now not be spellchecked, since the language is no longer English. (Nor for the matter will styling work, voice browsers pronounce correctly, etc.)

Going beyond the translate attribute

The W3Cā€™s ITS (International Tag Set) Recommendation proposes the use of a translate flag such as the attribute just added to HTML5, but also goes beyond that in describing a way to assign translate flag values to particular elements or combinations of markup throughout a document or set of documents. For example, you could say, if it makes sense for your content, that by default, all p elements with a particular class name should have the translate flag set to no for a specific set of documents.

Microsoft offers something along these lines already, although it is much less powerful than the ITS approach. If you use <meta name="microsoft" content="notranslateclasses myclass1 myclass2" /> anywhere on the page (or as part of a widget snippet) it ensures that any of the CSS classes listed following ā€œnotranslateclassesā€ should behave the same as the ā€œnotranslateā€ class.

Microsoft and Googleā€™s translation engines also donā€™t translate content within code elements. Note, however, that you donā€™t seem to have any choice about this ā€“ there donā€™t seem to be instructions about how to override this if you do want your code element content translated.

By the way, there are plans afoot to set up a new MultilingualWeb-LT Working Group at the W3C in conjunction with a European Commission project to further develop ideas around the ITS spec, and create reference implementations. They will be looking, amongst many other things, at ways of integrating the new translate attribute into localization industry workflows and standards. Keep an eye out for it.

  1. Arle Lommel Says:

    Thanks Richard for posting this explanation. I think this is good news, and not just for online MT (as you note). So much content that goes through human workflows is in HTML (or XHTML) format, and so far what to translate has, for the most part, been left up to the translatorā€™s discretion. While translators will generally get it right (which they can do because they are able to infer a lot about texts) they do sometimes get it wrong. Having this notion of a translate directive embedded at a fundamental level into core technologies is a major coup for the translation and localization industry because it will raise awareness and simplify processes.

    Iā€™m excited by the news that so many tools are already picking up on this new attribute. Iā€™ve seen far too many instances where good ideas languish for years without adoption. I think this success vindicates the agile approach we anticipate for MultilingualWeb-LT of picking small, manageable problems and addressing them with simple solutions. While there is a place and need for mega-standards like XLIFF, TMX, TBX, etc., I think the future of standards will increasingly revolve around small standards that work together to create a whole greater than the sum of their parts. Whatā€™s fantastic about this approach is that developers will be able to implement the parts that matter to them without the overhead of implementing complex formats of limited value.

  2. HTML5 Translatability Attribute: Donā€™t Translate, Wonā€™t Translate Revisited | Blogos Says:

    [ā€¦] attribute that allows fine-grained control over what HTML content should be translated, or not. Richard Ishida of the W3C has all the details of the attribute and its applicability, as well as some interesting insight [ā€¦]

Picture of the page in action.

Iā€™ve wanted to get around to this for years now. Here is a list of fonts that come with Windows7 and Mac OS X Snow Leopard/Lion, grouped by script.

This kind of list could be used to set font-family styles for CSS, if you want to be reasonably sure what the user will see, or it could be used just to find a font you like for a particular script. Iā€™m still working on the list, and there are some caveats.

>> See the list

Some of the fonts listed above may be disabled on the userā€™s system. Iā€™m making an assumption that someone who reads tibetan will have the Tibetan font turned on, but for my articles that explain writing systems to people in English, such assumptions may not hold.

The list I used to identify Windows fonts is Windows7-specific and fairly stable, but the Mac font spans more than one version of Mac OS X, and I could only find an unofficial list of fonts for Snow Leopard, and there were some fonts on that list that I didnā€™t have on my system. Where a Mac font is new with Lion (and there are a significant number) it is indicated. See the official list of fonts on Mac OS X Lion.

There shouldnā€™t be any fonts listed here for a given script that arenā€™t supplied with Windows7 or Mac OS X Snow Leopard/Lion, but there are probably supplied fonts that are not yet listed here (typically these will be large fonts that cover multiple scripts). In particular, note that I havenā€™t yet made a list of fonts that support Latin, Greek and Cyrillic (mainly because there are so many of them and partly because Iā€™m wondering how useful it will be.)

The text used is as much as would fit on one line of article 1 of the Universal Declaration of Human Rights, taken from this Unicode page, wherever I could find it. I created a few instances myself, where it was missing, and occasionally I resorted to arbitrary lists of characters.

You can obtain a character-based version of the text used by looking at the source text: look for the title attribute on the section heading.

Things still to do:

  • create sections for Latin, Greek and Cyrillic fonts
  • check for fonts covering multiple Unicode blocks
  • figure out how to tell, and how to show which is the system default
  • work out and show whatā€™s not available in Windows XP
  • work out whatā€™s new in Lion, and whether itā€™s worth including them
  • figure out whether people with different locale setups see different things
  • recapture all font images that need it at 36px, rather than varying sizes

Update, 19 Feb 2012

I uploaded a new version of the font list with the following main changes:

  • If you click on an image you see text with that font applied (if you have it on your system, of course). The text can be zoomed from 14px to 100px (using a nice HTML5 slider, if you have the right browser! [try Chrome, Safari or Opera]). This text includes a little Latin text so you can see the relationship between that and the script.
  • All font graphics are now standardised so that text is imaged at a font size of 36px. This makes it more difficult to see some fonts (unless you can use the zoom text feature), but gives a better idea of how fonts vary in default size.
  • I added a few extra fonts which contained multiple script support.
  • I split Chinese into Simplified and Traditional sections.
  • Various other improvements, such as adding real text for Nā€™Ko, correcting the Traditional Chinese text, flipping headers to the left for RTL fonts, reordering fonts so that similar ones are near to each other, etc.

Picture of the page in action.

>> Use UniView

The major change in this update is the update of the data to support Unicode version 6.1.0, which should be released today. (See the list of links to new Unicode blocks below.)

There are also a number of feature and bug related changes.

What UniView does: Look up and see characters (using graphics or fonts) and property information, view whole character blocks or custom ranges, select characters to paste into your document, paste in and discover unknown characters, search for characters, do hex/dec/ncr conversions, highlight character types, etc. etc. Supports Unicode 6.1 and written with Web Standards to work on a variety of browsers. No need to install anything.

List of changes:

  • One significant change enables you to display information in a separate window, rather than overwriting the information currently displayed. This can be done by typing/pasting/dragging a set of characters or character code values into the new Popout area and selecting the  icon alongside the Characters or Copy & paste input fields (depending on what you put in the popout window).

  • Two new icons were added to the Copy & paste area:

    Analyse Clicking on this will display the characters in the area in the lower right part of the page with all relevant characters converted to uppercase, lowercase and titlecase. Characters that had no case conversion information are also listed.

    Analyse Clicking on this produces the same kind of output as clicking on the icon just above, but shows the mappings for those characters that have been changed, eg. eā†’E.

  • Where character information displayed in the lower right panel has a case or decomposition mapping, UniView now displays the characters involved, rather than just giving the hex value(s), eg. Uppercase mapping: 0043 C. You will need a font on your system to see the characters displayed in this way, but whether or not you have a font, this provides a quick and easy way to copy the case-changed character (rather than having to copy the hex value and convert it first).

  • There is also a new line, slightly further down, when UniView is in graphic mode. This line starts with ā€˜As text:ā€™, and shows the character using whatever default font you have on your system. Of course, if you donā€™t have a font that includes that character you wonā€™t see it. This has been added to make it easier to copy and paste a character into text.

  • There is also a new line, slightly further down, when UniView is in graphic mode. This line starts with ā€˜As text:ā€™, and shows the character using whatever default font you have on your system. Of course, if you donā€™t have a font that includes that character you wonā€™t see it. This has been added to make it easier to copy and paste a character into text.

  • Fixed some small bugs, such as problems with search when U+29DC INCOMPLETE INFINITY is returned.

Enjoy.

Here are direct links to the new blocks added to Unicode 6.1:

These are notes on using CSS @font-face to gain finer control over the fonts applied to characters in particular Unicode ranges of your text, without resorting to additional markup. Possibilities and problems.

Changing the font used for certain characters

Most non-English fonts mix glyphs from different writing systems. Usually the font contains glyphs for Latin characters plus a non-Latin script, for example English+Japanese, or English+Thai, etc.

Normally the font designer will take care to harmonise the Latin script glyphs with the non-Latin, but there may be cases where you want to change the design of the glyphs for, say, and embedded script without adding markup to your page.

For example, if I apply the MS-Mincho font to some content in Japanese with embedded Latin text Iā€™m likely to see the following:

Letā€™s suppose Iā€™d like the English text to appear in a different, proportionally-spaced font. I could put markup around the English and set a class on the markup to apply the font I want, but this is very time consuming and bloats your code.

An alternative is to use @font-face. Here is an example:

@font-face { 
  font-family: myJapanesefont;
  src: local(MS-Mincho);
  }
@font-face {
  font-family: myJapanesefont;
  src: local(Gentium);
  unicode-range: U+41-5A, U+61-7A, U+C0-FF;
  }
p { 
  font-family: myJapanesefont; 
  }

Note: When specifying src the local() keyword indicates that font-face should look for the font on the userā€™s system. Of course, to improve interoperability, you may want to specify a number of alternatives here, or a downloadable WOFF font. The most interoperable value to use for local() is the Postscript name of the font. (On the Mac open Font Book, select the font, and choose Preview > Show Font Information to find this.)

The result would be:

The first font-face declaration associates the MS-Mincho font with the name ā€˜myJapanesefontā€™. The second font-face declaration associates the Gentium font with the Unicode code points in the Latin-1 letter range (of course, you can extend this if you use Latin characters outside that range and they are covered by the font).

Note how I was careful to set the unicode-range values to exclude punctuation (such as the exclamation mark) that would be used by (and harmonised with) the Japanese characters.

Adding support for new characters to a font

You can use the same approach for fonts that donā€™t have support for a particular Unicode range.

For example, the Nafees Nastaliq font has no glyphs for the Latin range (other than digits), so the browser falls back to the system default.

With the following code, I can produce a more pleasant font for the ā€˜W3Cā€™ part:

@font-face {; 
  font-family: myUrduFont;
  src: local(NafeesNastaleeq);
  }
@font-face {
  font-family: myUrduFont;
  src: local(BookAntiqua);
  unicode-range: U+30-FF;
  }
div p { 
  font-family: myUrduFont; 
  font-size: 60px;
  }

A big fly in the ointment

If you look at the ranges in the unicode-range value, youā€™ll see that I kept to just the letters of the alphabet in the Japanese example, and the missing glyphs in the Urdu case.

There are a number of characters that are used by all scripts, however, and these cause problems because you canā€™t apply fonts based on the context ā€“ even if you could work out what that context was.

In the case of the Japanese example above, numbers are left to be rendered by the Mincho font, but when those characters appear in the Latin text, they look incorrectly sized. Look, for example, at the 3 in W3C below.

The same problem arises with spaces and punctuation marks. The exclamation mark was left in the Mincho font in the Japanese example because, in this case, it is part of the Japanese text. Punctuation of this kind, could however be associated with the Latin text. See the question mark in this example.

Even more problematic are the spaces in that example. They are too wide in the Latin text. In Urdu text we have the opposite problem, use Urdu space glyphs in Latin text and you donā€™t see them at all (there should be a gap between W3C and i18n below).

With my W3C hat on, I start wondering whether there are any rules we can use to apply different glyphs for some characters depending on the script context in which they are used, but then I realise that this is going to bring in all the problems we already have for bidi text when dealing with punctuation or spaces between flows of text in different scripts. Iā€™m not sure itā€™s a tractable problem without resorting to markup to delimit the boundaries. But then, of course, we end up right back where we started.

So it seems, disappointingly, that the unicode-range property is destined to be of only limited usefulness for me. Thatā€™s a real shame.

Another small issue

The examples donā€™t show major problems, but I assume that sometimes the fonts you want to bring together using font-face will have very different aspect ratios, so you may need to use something like font-size-adjust to balance the size of the fonts being used.

Browser support

The CSS code above worked for me in Chrome and Safari on Mac OS X 10.6. but didnā€™t work in Firefox or Opera. Nor did it work in IE9 on Windows7.

  1. Sylvain Galineau Says:

    Richard, any links to the actual testcases? Iā€™d love to fix this! Thanks.

  2. r12a Says:

    I didnā€™t save the original tests, so I quickly recreated something you can use, which you can find below.

    Styles
    ------
    
    div.ja { font-family: "MS Mincho"; font-size: 60px; }
    div.ur { font-family: "Nafees", "Nafees Nastaleeq"; font-size: 60px; }
    
    @font-face {
      font-family: myJapanesefont;
      src: local(MS-Mincho);
      }
    @font-face {
      font-family: myJapanesefont;
      src: local(Gentium);
      unicode-range: U+41-5A, U+61-7A, U+C0-FF;
      }
    div.ja p.test {
      font-family: myJapanesefont;
      }
    @font-face {;
      font-family: myUrduFont;
      src: local(NafeesNastaleeq);
      }
    @font-face {
      font-family: myUrduFont;
      src: local(BookAntiqua);
      unicode-range: U+30-FF;
      }
    div.ur p.test {
      font-family: myUrduFont;
      font-size: 60px;
      }
    
    
    
    Code
    ----
    
    <div class="ja">
       <p>今ꗄćÆAmĆ©lie!</p>
       <p class="test">今ꗄćÆAmĆ©lie!</p>
    
       <p>å›½éš›åŒ–ę“»å‹• W3C</p>
       <p class="test">å›½éš›åŒ–ę“»å‹• W3C</p>
     
       <p>"Comment Ƨa va?"ćØčØ€ć„ć¾ć—ćŸć€‚</p>
       <p class="test">"Comment Ƨa va?"ćØčØ€ć„ć¾ć—ćŸć€‚</p>
      </div>
    
    
    <div class="ur" dir="rtl">
       <p>Ų¶Ų§ŲØŲ·Ū Ł„Ų³Ų§Ł†ŪŒ Ų¹ŲÆŁ…ŪŒŲŖ ŲŒ W3C</p>
       <p class="test" >Ų¶Ų§ŲØŲ·Ū Ł„Ų³Ų§Ł†ŪŒ Ų¹ŲÆŁ…ŪŒŲŖ ŲŒ W3C</p>
    
        <p>Ų¶Ų§ŲØŲ·Ū Ł„Ų³Ų§Ł†ŪŒ Ų¹ŲÆŁ…ŪŒŲŖ ŲŒ W3C i18n</p>
       <p class="test">Ų¶Ų§ŲØŲ·Ū Ł„Ų³Ų§Ł†ŪŒ Ų¹ŲÆŁ…ŪŒŲŖ ŲŒ W3C i18n</p>
    </div>
    
    
  3. John Daggett Says:

    The ā€˜unicode-rangeā€™ descriptor for @font-face rules simply provides a mechanism for selectively downloading font data for a set of fonts that support a wide range of scripts. It is *not* a way of stitching together fonts and making them work well together. Thatā€™s the job of a good type designer!! There are lots of folks these days focused on problems of how to harmonize type systems across scripts. These sorts of designers could design a set of typefaces that would stitch together well.

    And if youā€™re trying to combine anything with MS Mincho, er, well, good luck and godspeedā€¦

  4. Karim Ratib Says:

    Thanks for the explanation of unicode-range. Today (Oct 2013) FF still does not support this descriptor. Iā€™ve written a jQuery plugin to emulate it: https://github.com/infojunkie/jquery-unicode-range.

    You can also vote on the FF feature request to bump it up: https://bugzilla.mozilla.org/show_bug.cgi?id=475891

    Cheers!

There appears to be some confusion about XHTML1.0 vs XHTML5. Here is my best shot at an explanation of what XHTML5 is.

* This post is written for people with some background in MIME types and html/xml formats. In case thatā€™s not you, this may give you enough to follow the idea: ā€˜served asā€™ means sent from a server to the browser with a MIME type declaration in the HTTP protocol header that says that the content of the page is HTML (text/html) or XML (eg. application/xhtml+xml). See examples and more explanations.

XHTML5 is an HTML5 document served as* application/xhtml+xml (or another XML mime type). The syntax rules for XHTML5 documents are simply those rules given by the XML specification. The vocabulary (elements and attributes) is defined by the HTML5 spec.

Anything served as text/html is not XHTML5.

Note that HTML5 (without the X) can be written in a style that looks like XML syntax. For example, using a / in empty elements (eg. <br/>), or using quotes around attributes. But code written this way is still HTML5, not XHTML5, if it is served as text/html.

There are normally other differences between HTML5 and XHTML5. For example, XHTML5 documents may have an XML declaration at the start of the document. HTML5 documents cannot have that. XHTML5 documents are likely to have a more complicated doctype (to facilitate XML processing). And XHTML5 documents will have an xmlns attribute on the html tag. There are a few other HTML5 features that are not compatible with XML, and must be avoided.

Similar differences existed between HTML 4.01 and XHTML 1.0. However, moving on from XHTML 1.0 will typically involve a subtle but significant shift in thinking. You might have written XHTML 1.0 with no intention of serving it as anything other than text/html. XHTML in the XHTML 1.0 sense tended to be seen largely as a difference in syntax; it was originally designed to be served as XML, but (with some customisations to suit HTML documents) could be, and usually was, served with an HTML mime type. XHTML in the XHTML5 sense, means HTML5 documents served with an XML mime type (and appropriate customisations to suit XML documents), ie. itā€™s the MIME type, not the syntax, that makes it XHTML.

Which brings us to Polyglot documents. A polyglot document contains markup that is the subset of HTML5 and XML that can be processed as either HTML or XHTML, and can be served as either text/html or application/xhtml+xml, ie. as either HTML5 or XHTML5, without any errors or warnings in either case. The polyglot spec defines the things which allow this compatibility (such as using no XML declaration, proper casing of element names, etc.), and which things to avoid. It also mandates at least one additional extra, ie. disallowing UTF-16 encoded documents.

  1. Ivan Herman Says:

    Richard,

    so, if I am a client, how do I make the difference between XHTML1.1 and XHTML5? Both return application/xhtml+xml. Is there a standard way to do that?

  2. Daniel Miguel Says:

    The bad news on html5 is that some people think they traveled back to 1997 and write smelly markup. That is where the xml specs should be the way to go. I write regular html5 markup, but keep the xml specs in mind. The source is much more readable and just more ā€¦ elegant šŸ™‚ One more benefit of this approach is that you can port to several xml formats very quickly.

    Nice article.

  3. Gagan Jain Says:

    Simple article but essential details are included for clarity. Makes a lot of sense to move to XHTML5 rather than HTML5.

  4. George Walsh Says:

    Because I am not concerned with backward compatibility and happily accept the discipline of application/xhtml+xml, I was wondering whether it would be proper to amend apacheā€™s HTTP_ACCEPT by removing ā€˜text/htmlā€™ and force serving everything as application/xhtml+xml. The idea is to ONLY accept that type (no fall-back) and have php notify the user agent accordingly. Failing that, I guess content negotiation is my only alternative, which means I do not understand what xhtml/xml must be defined.

    I am struggling for a controlled, disciplined environment at the outset. I have used xhtml1.1 in the past without problems. (X)HTML5 is definitely my goal.

    George

One of the more useful features of UniView is its ability to list the characters in a string with names and codepoints. This is particularly useful when you canā€™t tell what a string of characters contains because you donā€™t have a font, or because the script is too complex, etc.

'ishida' in Persian in  nastaliq font style

For example, I was recently sent an email where my name was written in Persian as Ų§ŪŒŲ“ŪŒā€ŒŲÆŲ§. The image shows how it looks in a nastaliq font.

To see the component characters, drop the string into UniViewā€™s Copy & Paste field and click on the downwards pointing arrow icon. Here is the result:

list of characters

Note how you can now see that thereā€™s an invisible control character in the string. Note also that you see a graphic image for each character, which is a big help if the string you are investigating is just a sequence of boxes on your system.

Not only can you discover characters in this way, but you can create lists of characters which can be pasted into another document, and customise the format of those lists.

Pasting the list elsewhere

If you select this list and paste it into a document, youā€™ll see something like this:

  0627  ARABIC LETTER ALEF
  06CC  ARABIC LETTER FARSI YEH
  0634  ARABIC LETTER SHEEN
  06CC  ARABIC LETTER FARSI YEH
  200C  ZERO WIDTH NON-JOINER
  062F  ARABIC LETTER DAL
  0627  ARABIC LETTER ALEF

You can make the characters appear by deselecting Use graphics on the Look up tab. (Of course, you need an arabic font to see the list as intended.)

Ų§  ā€Ž0627  ARABIC LETTER ALEF
ŪŒ  ā€Ž06CC  ARABIC LETTER FARSI YEH
Ų“  ā€Ž0634  ARABIC LETTER SHEEN
ŪŒ  ā€Ž06CC  ARABIC LETTER FARSI YEH
ā€Œ  ā€Ž200C  ZERO WIDTH NON-JOINER
ŲÆ  ā€Ž062F  ARABIC LETTER DAL
Ų§  ā€Ž0627  ARABIC LETTER ALEF

Customising the list format

What may be less obvious is that you can also customise the format of this list using the settings under the Options tab. For example, using the List format settings, I can produce a list that moves the character column between the number and the name, like this:

  0627  Ų§  ARABIC LETTER ALEF
  ā€Ž06CC  ŪŒ  ARABIC LETTER FARSI YEH
  ā€Ž0634  Ų“  ARABIC LETTER SHEEN
  ā€Ž06CC  ŪŒ  ARABIC LETTER FARSI YEH
  ā€Ž200C  ā€Œ  ZERO WIDTH NON-JOINER
  ā€Ž062F  ŲÆ  ARABIC LETTER DAL
  ā€Ž0627  Ų§  ARABIC LETTER ALEF

Or I can remove one or more columns from the list, such as:

  Ų§  ARABIC LETTER ALEF
  ŪŒ  ARABIC LETTER FARSI YEH
  Ų“  ARABIC LETTER SHEEN
  ŪŒ  ARABIC LETTER FARSI YEH
  ā€Œ  ZERO WIDTH NON-JOINER
  ŲÆ  ARABIC LETTER DAL
  Ų§  ARABIC LETTER ALEF

With the option Show U+ in lists I can also add or remove the U+ before the codepoint value. For example, this lets me produce the following list:

  ā€ŽU+0627  ARABIC LETTER ALEF
  ā€ŽU+06CC  ARABIC LETTER FARSI YEH
  ā€ŽU+0634  ARABIC LETTER SHEEN
  ā€ŽU+06CC  ARABIC LETTER FARSI YEH
  ā€ŽU+200C  ZERO WIDTH NON-JOINER
  ā€ŽU+062F  ARABIC LETTER DAL
  ā€ŽU+0627  ARABIC LETTER ALEF

Other lists in UniView

Weā€™ve shown how you can make a list of characters in the Cut & Paste box, but donā€™t forget that you can create lists for a Unicode block, custom range of characters, search list results, or list of codepoint values, etc. And not only that, but you can filter lists in various ways.

Here is just one quick example of how you can obtain a list of numbers for the Devanagari script.

  1. On the Look up tab, select Devanagari from the Unicode block pull down list.
  2. Select Show range as list and deselect (optional) Use graphics.
  3. Under the Filter tab, select Number from the Show properties pull down list.
  4. Click on Make list from highlights

You end up with the following list, that you can paste into your document.

ą„¦  ā€Ž0966  DEVANAGARI DIGIT ZERO
ą„§  ā€Ž0967  DEVANAGARI DIGIT ONE
ą„Ø  ā€Ž0968  DEVANAGARI DIGIT TWO
ą„©  ā€Ž0969  DEVANAGARI DIGIT THREE
ą„Ŗ  ā€Ž096A  DEVANAGARI DIGIT FOUR
ą„«  ā€Ž096B  DEVANAGARI DIGIT FIVE
ą„¬  ā€Ž096C  DEVANAGARI DIGIT SIX
ą„­  ā€Ž096D  DEVANAGARI DIGIT SEVEN
ą„®  ā€Ž096E  DEVANAGARI DIGIT EIGHT
ą„Æ  ā€Ž096F  DEVANAGARI DIGIT NINE

(Of course, you can also customise the layout of this list as described in the previous section.)

Try it out.

Reversing the process: from list to string

To complete the circle, you can also cut & paste any of the lists in the blog text above into UniView, to explore each characterā€™s properties or recreate the string.

Select one of the lists above and paste it into the Characters input field on the Look up tab. Hit the downwards pointing arrow icon alongside, and UniView will recreate the list for you. Click on each character to view detailed information about it.

If you want to recreate the string from the list, simply click on the upwards pointing arrow icon below the Copy & paste box, and the list of characters will be reconstituted in the box as a string.

Voila!