I’ve been wanting to improve the editing behaviour of my pickers for quite some time, so that users could interact more easily with the keyboard, and insert characters into the middle of a composition, not just at the end. In fact, the output area maintains the focus all the time, now – which makes a major improvement to the usability of the pickers.

This week I made those things happen, and created a new template with some other changes, too.

An updated Bengali picker is first out of the box, but look out for a brand new Urdu-specific picker to follow close on its heels. I will retrofit the new template to other pickers as time allows, or need dictates.

I also beefed up the font selection list with a large number of TT and OT fonts, and improved the reference material at the bottom.

I improved the mechanism that highlights similar characters, to give more fine-grained control to the associations between characters.

I also added a field just under the title that gives information about the character the user is mousing over, and added a search field to help users find characters for which they know the Unicode name or number. I plan to extend the information associated with characters in future to include native names (eg. e-kar) and other useful search info.

I also changed the scripting and HTML so that a single click can now produce multiple characters in the composition field. This will allow users to input ligatures like the indic ‘ksha’ or Urdu aspirated consonants, or complex sequences tied to ligatures (like the word ‘Allah’) with a simple click.

Some things have also been removed. There is no DEL button now, since you can interact more easily with the keyboard for that. Spaces are available from the (now rationalised) character area, rather than a button. And there is no longer an option to switch between graphics and characters for the selection. This is partly for simplicity, and partly to make it easier to represent some of the slightly more complicated selection options I want to add in future – for example, specific shapes are appropriate for Urdu arabic characters, and I don’t want to leave it to chance as to whether the user’s system has the right fonts to produce the desired shapes.

Getting to this actually required a huge amount of unseen work, since I had to wrap all the images in button markup and move and change attributes, etc. so that the composition box retains the focus in IE (it worked fine for Firefox, Opera and Safari). I also, of course, made significant, but probably not noticeable, changes to the Javascript and CSS.

I just read a post by Ivan Herman about how Hungary has joined the Schengen Agreement, and will soon be removing border controls on the EU side. That put me in mind of the first time I tried to pass through the Iron Curtain.

I was travelling from Vienna to Budapest (probably about 25 years ago) and I had decided to go through Sopron, rather than Hegyeshalom, so I could see something a little more off the beaten track. This was a month-long InterRail trip, so I was able to follow my whim and jump on whatever train I wanted. The train connections worked, and I found myself heading south from Vienna.

Eventually, the train passed into Hungary and stopped. I needed a visa, so I got off with a bunch of other (Hungarian looking) people, and traipsed over to a small outbuilding, where I found myself at the back of a queue of people jostling bags of various sizes and dressed and coiffed in what looked to me to be a very Eastern European fashion. Looking out of the window, everything was grey. I could see rail tracks and points and small, grey buildings but also several very tall towers with machine gun nests perched on top (quite large looking machine guns). The queue moved slowly, and I was surprised at one point to see my train pulling away and disappearing. It seemed a bit odd (and I was glad I’d brought all my stuff with me), but I figured this was probably normal, and I’d just have to catch another train.

I finally arrived at the desk and asked for a visa. The guy behind the desk started talking to me in a somewhat animated fashion, but I had no idea what he was saying. I hadn’t learned German yet, and Hungarian was completely incomprehensible to me. I kept trying to explain, politely, in English, that I needed a visa. Finally, he gave me an exasperated look and called someone out of a nearby room. The guy who emerged was huge, bald and intimidatingly business-like. (Some time later I saw the film Midnight Express, and realised that the prison guard and he could have been the same person.) He shouted at me “Nicht visa!”. And I tried to explain, in English, that, yes, I had no visa, but would like to obtain one, please. This didn’t appear to get across clearly, because he simply repeated “Nicht visa!!” several times, increasing in volume.

Finally, the tension broke and gave way to action. He motioned for me to follow him out of the building, and we started walking away across a couple of sets of railway tracks. I noticed, feeling slightly less at ease but still hopeful, that I was flanked by a soldier with a gun on either side. They weren’t exactly giving me encouraging looks, and as I glanced up at the machine gun towers and at the surrounding barbed wire, I began to wish I knew what was happening.

Soon we arrived at the end of a short train. The very last carriage of this train looked like something you’d expect to see in a Wild West film. It had a kind of standing area at each end with a railing, a door into the carriage and steps leading down to the ground on either side. I was ushered up one set of steps and into what turned out to be an empty carriage. The door was shut behind me, and within a minute or so, as I remember it, the train started moving off, in the same direction my earlier train had disappeared. So I wasn’t just being sent back across the border.

That last realisation started to trouble me a little, since I still had no visa and no idea what was happening. It didn’t help that there was a small round window in the door at each end of the carriage, through which I could see guards sitting on the steps at each corner, all holding machine guns at the ready. As the towers slid away behind us, night started to fall.

Twenty five years has dulled the memory of some of what happened next, but eventually I got off at a small station, having reached the end of the line. The guards were gone, and the station turned out to be quite modern and clean looking. I still couldn’t understand anything anyone was saying, so I still had no idea where I was, but I was able figure out that I was somehow back in Austria. It was much later that I was to realise that Sopron is on a peninsular that sticks into Austria, and I had come in one side and been sent out the other.

I slept that night on the floor of the main station building, and the next morning set off to find someone who spoke English and could tell me where I was – and just as importantly how to get into Hungary. The town was quite small, maybe just a village. In spite of that it took me a while, but I eventually came across a chap in a supermarket who was able to explain to me that visas are not issued on entry into Hungary by train via Sopron. I was ahead of him there. He also offered to drive me to the border, telling me that I would be able to get a visa at the road entry point.

It’s nice to think about that person whenever I relive this story. He really went out of his way, leaving work to assist a complete stranger, with no fuss or thought for reward. I wonder whether he remembers me. I doubt it. Of course, these days he may even be reading this blog post


So it was that, eventually, I got the stamp in my passport that I needed, and somehow found my way onto another train heading for Budapest. Well, it wasn’t quite the end of the fun. That continued when I tried to meet up with my father in the capital. But that, as they say, is another story


Tim Greenwood just pointed out to me a ‘bug’ in my converter program, which I think is actually, in my mind, a bug in Firefox (although I imagine it was implemented by someone as a feature).

If you type A0 (the hex code for a non-breaking space) in the Hexadecimal code points field, then press Convert, you will get a blank space in the Characters field that should be U+00A0 NO-BREAK SPACE. Then press Convert or View Names above this Characters field and you’ll find that what was supposed to be a NBSP has changed into an ordinary space. IE7, Opera and Safari all continue to show the character in the field as a NBSP.

(However, all four browsers substitute an ordinary space when you copy and paste the text from the Characters field into something else.)

I tried this with a range of other types of space , but had no such behaviour (try it). They all remained themselves.


The word Mandalay in Myanmar script.

I’ve been brushing up on the Myanmar script, since major changes are on the way with Unicode 5.1.

I upgraded my myanmar picker to handle the new characters, and I edited my notes on how the script works.

The new characters will make a big difference to how you author text in Unicode, and people will need to update currently existing pages to bring them in line with the new approach. The changes should make it much easier to create content in Burmese, in addition to addressing some niggly problems with making the script work correctly. One reason the changes were sanctioned is that there is currently very little Burmese content out there in Unicode.

I’ll be updating my character by character notes later too.

The only problem with all this is that existing fonts will all need to be changed to support the new world order (or myanmar order). I found one font that is already 5.1 ready from the Myanmar Unicode & NLP Research Center. So if you don’t want to download that font, you’ll need to read the PDF version of my notes on the script.

That would be a pity, however, since i had some fun adding javascript to the article today, so that it displays a breakdown, character by character, of each example as you mouse over it (using images, so you see it properly).

>> Use it !

Picture of the page in action.

This web-based tool helps you convert between a number of Unicode escape and code formats.

Changes in the new version:

  • Convert from JavaScript, Java and C escape notation, and to JavaScript/Java escapes (with switch to show C-style supplementary characters)
  • Convert to and from CSS escape notation
  • Convert from HTML/XML code with escapes to code with just characters
  • Convert < > ” or & in HTML/XML code to entities
  • Option to show ASCII characters when converting to NCRs
  • View a set of characters in UniView by clicking on the View in UniView button

For CSS output I chose the 6-figure version with no optional space, since I thought it was clearest. I’ve had a request to change it to the shortest form (4 or 6 figures) followed by space. If other people prefer that, I may change it.

Update: Markus Scherer convinced me to change the CSS output. So rather than 6-figure escapes with no space, the output now contains 6-figure escapes followed by a space for supplementary characters, and 4-figure escapes followed by a space elsewhere.

>> See what it can do !

>> Use it !

Picture of the page in action.

I found a little more time to work on UniView while flying to the US for the I18n & Unicode Conference yesterday, adding a bunch of additional useful features.

Changes include:

  • Extended the ability to open UniView with data displayed from a URI. In addition to specifying a block and a character, you can now specify a range, a list of codepoints, a list of characters, or a search string. This is useful for pointing people to results using URIs in links or email.
  • Switching between graphics or fonts for display of characters now refreshes the right panel also.
  • Clicking on the information about the script group of a character displayed in the right panel will cause that block to be displayed in the left panel. This is particularly useful when you find a single character and want to know what’s around it.
  • Replaced the use of hyphens to specify block names in URI queries with underscores or %20. This may break some existing URIs, but fixes a bug that meant that block names that actually contain hyphens were not displaying.
  • Added an option to the right hand panel to display the current character in the Unicode Conversion tool.
  • Fixed some other bugs related to specifying Basic Latin block in a URI.
  • Reinstated CJK Unified Ideographics and Hangul Syllables in the block selection pull-down, but added a warning and opt out if the block you are about to display contains more than 2000 characters. Also added warning and opt out if you try to specify a range of over 2000 characters.

Please report any bugs to me, and don’t forget to refresh any UniView files in your cache before using the new version.

>> See what it can do !

>> Use it !

Picture of the page in action.

In little pockets of time recently I’ve been making some significant improvements to my UniView tool, the character map on steroids.

Changes include:

  • Substantially revised the code so that handling of ideographic and hangul characters and other characters not in the Unidata database is much improved. For example, ideographs now display in the left panel for a specified range and property values are available in the right panel.
  • Added regular expression support to the search input field.
  • Changes to the user interface: moved highlighting controls to the initial screens and move others, such as the chart numbering toggle, to the submenu under “Options”; provided wider input fields for codepoint and cut&paste input; replaced the graphics and list toggle icons with checkboxes; provided an icon to quickly clear the contents of the codepoint and cut&paste input fields. A link to the UniHan database was added alongside the Cut & paste input field: when clicked, this icon looks up the first character in either field. A link to the UniHan database was also added to the right panel when a Unified CJK character is displayed there.
  • The Codepoint input field now accepts more than one codepoint (separated by spaces).
  • When you double-click on a character in the left panel the codepoint is appended to the Codepoint input field as well as adding the character to the Cut & paste field.
  • When you click in the checkbox Show as graphics the change is immediately applied to whatever is in the left panel. It no longer redisplays the range if you are looking at, say, a list of characters generated by the Codepoint input, but redisplays the same list.
  • Set the default font to “Arial Unicode MS, sans-serif”.
  • Added a message for those who do not have JavaScript turned on, and messages to please wait while data is being downloaded on initial startup.
  • Fixed the icons linking to the converter tool, so that the contents of the adjacent field are passed to the converter and converted automatically.
  • Added links in the right panel to FileFormat pages (in addition to decodeUnicode). The FileFormat pages provide useful information for Java and .Net users about a given character.
  • Removed the option to specify your own character notes (I’m not aware that anyone ever did, since it hasn’t worked for a while now and no-one has complained). This is because AJAX technology will not allow an XML file to be included from another domain. When that is fixed I will reinstate it.
  • Fixed a number of other bugs, particularly related to supplementary character support and highlighting.

Please report any bugs to me, and don’t forget to refresh any UniView files in your cache before using the new version.

I’m at the ITS face-to-face meeting in Prague, Czech Republic and I’ve been trying to learn to read Czech words. Jirka Kosek showed me a Czech tongue-twister last night at dinner.

Strč prst skrz krk.

How amazing is that? A whole sentence without vowels! (Means “Put your finger down your throat.” – I’m wondering whether that has something to do with the missing vowels
)

See a video of Jirka pronouncing it.

>> Use it !

Picture of the page in action.

This web-based tool helps you convert between Unicode character numbers, characters, UTF-8 and UTF-16 code units in hex, percent escapes, Unicode U+hex notation, and Numeric Character References (hex and decimal).

Changes in the new version:

  • Convert to and from Unicode U+hex notation
  • Get a list of Unicode names for a sequence of characters by clicking on the View Names button
  • You now have to click a button to start the conversion, rather than remove focus from the input area. This provides better control and a more intuitive approach.

It also allows you to separate a sequence of characters by spaces. Paste the characters into the Characters field and click Convert. Then click Convert immediately in the Unicode U+hex notation field. (The latter field is the only one that changes the data after an initial conversion.)

I acquired the domain name rishida.net this weekend, and I have been setting up the server so that all my content is accessible via http://rishida.net/
 rather than http://r12a.github.io/


Update: Note that i no longer own rishida.net or have anything to do with the internationalisation-related content that someone has put up there. Go to https://r12a.github.io/ instead.

This should make it much easier for me to type in the URI and tell people over the phone what the URI is. It should also be easier for people to remember the address for, say, my photos (r12a.github.io/photos).

You could, of course, still type in the old address if you prefer, but I’d suggest you use the new one in your bookmarks from now on.


Multiple scripts in XMetal’s tags-on view (click to enlarge).

I received a query from someone asking:

I try to edit lao and thai text with XMetal 5.0, but nothing is displayed but squares. In fact, Unicode characters seems to be correctly saved in the XML file and displayed in Firefox (for example), but i can’t get a correct display in XMetal. Is it a font problem ?

There are two places this needs to be addressed:

  1. in the plain text view
  2. in the tags-on view

For the plain text view, it is a question of setting a font that shows Lao and Thai (or whatever other language/script you need) in Tools>Options>Plain Text View>Font. You can only set one font at a time, so a wide ranging Unicode font like Arial Unicode MS or Code2000 may be useful for Windows users.

For the tags-on view (which is the view I use most of the time) you need to edit the CSS file that sets the editor’s styling for the DOCTYPE you are working with. This may be in one of a number of places. The one I edit is C:\Program Files\Blast Radius\XMetaL 4.6\Author\Display\xhtml1-transitional.css.

I added the following to mine. I chose fonts I have on my PC and sets font sizes relative to the size I set for my body element. You should, of course, choose your own fonts and sizes.

[lang="am"] { font-family: "Code2000", serif; font-size: 120%; }
[lang="ar"] {font-family: "Traditional Arabic", sans-serif; font-size: 200%; }
[lang="bn"] {font-family: SolaimanLipi, sans-serif; font-size: 200%; }
[lang="dz"] { font-family: "Tibetan Machine Uni", serif; font-size: 140%; }
[lang="he"] {font-family: "Arial Unicode MS", sans-serif; font-size: 120%;}
[lang="hi"] {font-family: Mangal, sans-serif;  font-size: 120%;}
[lang="kk"] {font-family: "Arial Unicode MS", sans-serif;  }
[lang="iu"] {font-family: Pigiarniq, Uqammaq, sans-serif; font-size: 120%; }
[lang="ko"] { font-family: Batang, sans-serif; font-size: 120%;}
[lang="ne"] {font-family: Mangal, sans-serif;  font-size: 120%; }
[lang="pa"] { font-family: Raavi, sans-serif; font-size: 120%;}
[lang="te"] {font-family: Gautami, sans-serif; font-size: 140%;}
[lang="my"] {font-family: Myanmar1, sans-serif; font-size: 200%;}
[lang="th"] {font-family: "Cordia New", sans-serif; font-size: 200%; }
[lang="ur"] { font-family: "Nafees Nastaleeq", serif; font-size: 130%;}
[lang="ve"] { font-family: "Arial Unicode MS", sans-serif; }
[lang="zh-Hans"] { font-family: "Simsun", sans-serif; font-size: 140%; }
[lang="zh-Hant"] { font-family: "Mingliu", sans-serif; font-size: 140%; }

Note that I would have preferred to say :lang(am) { font-family
 } etc, but XMetal 4.6 seems to require you to specify the attribute value as shown above. (You also have to specify class selectors as [class=”myclass”] {
} rather than .myclass {
}.)

I see from a recent bugzilla report and some cursory testing that a (very) long-standing bug in Mozilla related to complex scripts has now been fixed.

Complex scripts include many non-Latin scripts that use combining characters or ligatures, or that apply shaping to adjacent characters like Arabic script.

It used to be that, when you highlighted text in a complex script, as you extended the edges of the highlighted area you would break apart combining characters from their base character, split ligatures and disrupt the joining behaviour of Arabic script characters.

The good news is that this no longer happens – it was fixed by the new text frame code. The bad news is that the highlighting still happens character by character, rather than at grapheme boundaries – which can make it tricky to know whether you got the combining characters or not.

UPDATE: I hear from Kevin Brosnan that the following will be fixed in Firefox 3. Hurrah! And thank you Mozilla team.

What doesn’t appear to be fixed is the behaviour of asian scripts when the CSS text-align:justify is applied. 🙁

I raised a bug report about this. I was amazed, after hearing about this from Indians and Pakistanis too, that there didn’t seem to be a bug report already. Come on users, don’t leave this up to the W3C!

Basically, the issue is that if you apply text-align: justify to some text in an Indian or Tibetan script the combining characters all get rendered alongside their base characters, ie. you go from this (showing, respectively, tibetan, devanagari (hindi and nepali), punjabi, telegu and thai text):

Picture of text with no alignment.

to this:

Picture of text with justify alignment.

Strangely the effect doesn’t seem to apply to the Thai text, nor to other text with combining characters that I’ve tried.

That’s a pretty big bug for people in the affected region because it effectively means that text-align:justify can’t be used.

>> Use it !

Picture of the page in action.

This tool allows you to see what is assigned to event.keyCode and event.charCode in the DOM after the events keydown, keypress, and keyup are detected by the browser. Use it across different browsers with different keyboard mappings to see how things differ.

It’s a bit esoteric, but it may be of interest to someone. I wanted to play with this a bit to help me understand the background to the DOM Level 3 Events Specification.

Sarmad Hussain, at the Center for Research in Urdu Language Processing FAST National University, Pakistan, is looking at enabling Urdu IDNs based on ICANN recommendations, but this may lead to similar approaches in a number of other countries.

Sarmad writes: “We are trying to make the URL enabled in Urdu for people who are not literate in any other language (a large majority of literate population in Pakistan). ICANN has only given specs for the Domain Name in other languages (through its RFCs). Until they allow the TLDs in Urdu, we are considering an application end solution: have a plug in for a browser for people who want to use it, which URL in Urdu, strips and maps all the TLD information to .com, .pk, etc. and converts the domain name to punycode Thus, people can type URLs in pure Urdu which are converted to the mixed English-Urdu URLs by the application layer which ICANN currently allows.”

“We are currently trying to figure out what would be the ‘academic’ requirements/solutions for a language. To practically solve the problem, organizations like ICANN would need to come up with the solutions.”

There are some aspects to Sarmad’s proposal, arising from the nature of the Arabic script used for Urdu, that raise some interesting questions about the way IDN works for this kind of language. These have to do with the choice of characters allowed in a domain name. For example, there is a suggestion that users should be able to use certain characters when writing a URI in Urdu which are then either removed (eg. vowel diacritics) or converted to other characters (eg. Arabic characters) during the conversion to punycode.

This is not something that is normally relevant for English-only URIs, because of the relative simplicity of our alphabet. There is much more potential ambiguity in Urdu for use of characters. Note, however, that the proposals Sarmad is making are language-specific, not script-specific, ie. Arabic or Persian (also written with the Arabic script) would need some slightly different rules.

I find myself wondering whether you could use a plug-in to strip out or convert the characters while converting to punycode. People typing IDNs in Urdu would need to be aware of the need for a plug-in, and would still need to know how to type in IDNs if they found themselves using a browser that didn’t have the plug-in (eg. the businessman who is visiting a corporation in the US that prevents ad hoc downloads of software). On the one hand, I wonder whether we can expect a user who sees a URI on a hard copy brochure containing vowel diacritics to know what to do if their browser or mail client doesn’t support the plug-in. On the other hand, a person writing a clickable URI in HTML or an email would not be able to guarantee that users would have access to the plug-in. In that case, they would be unwise to use things like short vowel diacritics, since the user cannot easily change the link if they don’t have a plug-in. Imagine a vowelled IDN coming through in a plain text email, for example: the reader may need to edit the email text to get to the resource rather than just click on it. Not likely to be popular.

Another alternative is to do such removal and conversion of characters as part of the standard punycode conversion process. This, I suspect, would necessitate every browser to have access to standardised tables of characters that should be ignored or converted for any language. But there is an additional problem in that the language would need to be determined correctly before such rules were applied – that is, the language of the original URI. That too seems a bit difficult.

So I can see the problem, but I’m not sure what the solution would be. I’m inclined to think that creating a plug-in might create more trouble than benefit, by replacing the problems of errors and ambiguities with the problems of uninteroperable IDNs.

I have posted this to the www-international list for discussion.

Follow this link to see lists of characters that may be removed or converted.
(more
)


Ruby text above and below Japanese characters.

My last post mentioned an extension that takes care of Thai line breaking. In this post I want to point to another useful extension that handles ruby annotation.

Typically ruby is used in East Asian scripts to provide phonetic transcriptions of obscure characters, or characters that the reader is not expected to be familiar with. For example it is widely used in education materials and children’s texts. It is also occasionally used to convey information about the meaning of ideographic characters. For more information see Ruby Markup and Styling.

Ruby markup (called æŒŻă‚Šä»źć [furigana] in Japan) is described by the W3C’s Ruby Annotation spec. It comes in two flavours, simple and complex.

Ruby markup is a part of XHTML 1.1 (served as XML), but native support is not widely available. IE doesn’t support XHTML 1.1, but it does support simple ruby markup in HTML and XHTML 1.0. This extension provides support in Firefox for both simple and complex ruby, in HTML, XHTML 1.0 and XHTML 1.1.

It passes all the I18n Activity ruby tests, with the exception of one *very* minor nit related to spacing of complex ruby annotation.


Before and after applying the extension.

Samphan Raruenrom has produced a Firefox extension based on ICU to handle Thai line breaking.

Thai line breaks respect word boundaries, but there are no spaces between words in written Thai. Spaces are used instead as phrase separators (like English comma and full stop). This means that dictionary-based lookup is needed to properly wrap Thai text.

The current release works on Windows and the current Firefox release, 2.0.0.4. The next release will also support Linux and will support future Mozilla Firefox/Thunderbird releases.

You can test this on our i18n articles translated into Thai.

This replaces work on a separate Thai version of Firefox.

UPDATE: This post has now been updated, reviewed and released as part of a W3C article. See http://www.w3.org/International/questions/qa-personal-names.

Here are some more thoughts on dealing with multi-cultural names in web forms, databases, or ontologies. See the previous post.

Script

The first thing that English speakers must remember about other people’s names is that a large majority of them don’t use the Latin alphabet, and a majority of those that do use accents and characters that don’t occur in English. It seems obvious, once I’ve said it, but it has some important consequences for designers that are often overlooked.

If you are designing an English form you need to decide whether you are expecting people to enter names in their own script or in an ASCII-only transcription. What people will type into the form will often depend on whether the form and its page is in their language or not. If the page is in their language, don’t be surprised to get back non-Latin or accented Latin characters.

If you hope to get ASCII-only, you need to tell the user.

The decision about which is most appropriate will depend to some extent on what you are collecting people’s names for, and how you intend to use them.

  • Are you collecting the person’s name just to have an identifier in your system? If so, it may not matter whether the name is stored in ASCII-only or native script.
  • Or do you plan to call them by name on a welcome page or in correspondence? If you will correspond using their name on pages written in their language, it would seem sensible to have the name in the native script.
  • Is it important for people in your organization who handle queries to be able to recognise and use the person’s name? If so, you may want to ask for a transcription.
  • Will their name be displayed or searchable (for example Flickr optionally shows people’s names as well as their user name on their profile page)? If so, you may want to store the name in both ASCII and native script, in which case you probably need to ask the user to submit their name in both native script and ASCII-only form, using separate fields.

Note that if you intend to parse a name, you may need to use country or language-specific algorithms to do so correctly (see the previous blog on personal names).

If you do accept non-ASCII names, you should use UTF-8 encoding in your pages, your back end databases and in all the scripts in between. This will significantly simplify your life.


Icons chosen by McDonalds to represent, from left to right, Calories, Protein, Fat, Carbohydrates and Salt.

I just read a fascinating article about how McDonalds set about testing cultural acceptability of a range of icons intended to point to nutritional information. It talks about the process and gives examples of some of the issues. Very nice.

Interesting, also, that they still ended up with local variants in some cases.

Creating a New Language for Nutrition: McDonald’s Universal Icons for 109 Countries

Some applications insert a signature or Byte Order Mark (BOM) at the beginning of UTF-8 text. For example, Notepad always adds a BOM when saving as UTF-8.

Older text editors or browsers will display the BOM as a blank line on-screen, others will display unexpected characters, such as . This may also occur in the latest browsers if a file that starts with a BOM is included into another file by PHP.

For more information, see the article Unexpected characters or blank lines and the test pages and results on the W3C site.

If you have problems that you think might be related to this, the following may help.

Checking for the BOM

I created a small utility that checks for a BOM at the beginning of a file. Just type in the URI for the file and it will take a look. (Note, if it’s a file included by PHP that you think is causing the problem, type in the URI of the included file.)

Removing the BOM

If there is a BOM, you will probably want to remove it. One way would be to save the file using a BOM-aware editor that allows you to specify that you don’t want a BOM at the start of the file. For example, if Dreamweaver detects a BOM the Save As dialogue box will have a check mark alongside the text “Include Unicode Signature (BOM)”. Just uncheck the box and save.

Another way would be to run a script on your file. Here is some simple Perl scripting to check for a BOM and remove it if it exists (developed by Martin DĂŒrst and tweaked a little by myself).

# program to remove a leading UTF-8 BOM from a file
# works both STDIN -> STDOUT and on the spot (with filename as argument)

if ($#ARGV > 0) {
    print STDERR "Too many arguments!\n";
    exit;
    }

my @file;   # file content
my $lineno = 0;

my $filename = @ARGV[0];
if ($filename) {
    open( BOMFILE, $filename ) || die "Could not open source file for reading.";
    while (<BOMFILE>) {
        if ($lineno++ == 0) {
            if ( index( $_, 'ï»ż' ) == 0 ) {
                s/^\xEF\xBB\xBF//;
                print "BOM found and removed.\n";
                }
            else { print "No BOM found.\n"; }
            }
        push @file, $_ ;
        }
    close (BOMFILE)  || die "Can't close source file after reading.";

    open (NOBOMFILE, ">$filename") || die "Could not open source file for writing.";
    foreach $line (@file) {
        print NOBOMFILE $line;
        }
    close (NOBOMFILE)  || die "Can't close source file after writing.";
    }
else {  # STDIN -> STDOUT
    while (<>) {
    if (!$lineno++) {
        s/^\xEF\xBB\xBF//;
        }
    push @file, $_ ;
    }

    foreach $line (@file) {
        print $line;
        }
    }