Picture of the page in action.

The page Mongolian variant forms has been significantly changed and moved to a new location. The main changes include the following:

  • The ‘NP’ column is now the ‘DS01’ column. It shows the shapes discussed by the mongolian experts on the public-i18n-mongolian@w3.org list with a view to updating and standardising the results of using Unicode variants. The column reflects the state of that document as of 25 Nov 2016. Expected updates to the document are shown using “[current] ≫ [expected]”. As those and other changes are made to DS01 I will update this variant comparison page.
  • I added a column for L2/16-309 (the output of the Hohot discussions) as of 26 Oct 2016. Differences between DS01 and L2/16-309 are all highlighted for quick reference.
  • I also added a column for the Unicode Standard v9 chart data, and again highlighted differences from DS01 or L2/16-309. To see this column, click on the vertical blue bar, bottom right, then set Hidden Columns to Show.
  • Showing the hidden columns also shows a usage column and a notes
    column (information in both cases taken from L2/16-309).
  • All the font information has been rechecked, and some bugs fixed.
  • Other editorial changes include the hiding of former notes, and the
    simplification and update of the page intro.

Picture of the page in action.

Another significant development is the creation of a Shape Index. That document lists all shapes used in the page Mongolian Variants, and enables you to jump to the appropriate table in that page so that you can see how it’s used.

The classification used is not intended to be etymologically or philosophically pure. It is intended as a simple practical tool to help locate shapes, and one that also works for novices.

I hope these changes will help us identify and resolve the remaining differences between the documents. I will update these pages as the source documents change (please advise me when new versions are developed). Of course, comments are welcome.

If you want to join the discussion about Mongolian variants you could join this mailing list. (Hit the subscribe link.)

Some more notes for future reference. The following scripts in Unicode 9.0 have both upper and lower case characters.

In modern use

Latin
Greek
Cyrillic
Armenian
Cherokee*
Georgian**
Adlam

Limited modern use

Coptic
Warang Citi
Old Hungarian
Osage

No longer used

Glagolitic
Deseret

I include characters used for phonetic transcriptions in Latin.

There are also case correspondences in the following Unicode blocks: Number Forms, Enclosed Alphanumerics, Alphabetic Presentation Forms, Halfwidth and Fullwidth Forms. Mathematical Alphanumeric Symbols also has both upper and lower case forms, but they are not convertible, one to the other.

* Cherokee is quite unusual, in that until Unicode version 8 there were no lowercase Unicode characters, and Cherokee was written using only what are now referred to as uppercase letters. The lowercase letters had been used in the Cherokee New Testament, and are beginning to see more use in modern text too (which is why they were added to Unicode), but the overwhelming majority of pre-existing Cherokee text is still written using the 'uppercase' letters. Furthermore, many Cherokee fonts still only support the uppercase forms.

** Georgian is a little special, too. Like Cherokee, the archaic, liturgical Khutsuri version of the script is bicameral, with asomtravruli capitals and nuskhuri lowercase. A new attempt was made to write Georgian in a bicameral way in the 1950s, using asomtavruli capitals and mkhedruli lowercase, but it didn’t catch on. For modern Georgian text, caseless Mkhedruli characters are now the standard, however Unicode 11 added a set of mtavruli characters with case mappings to mkedruli. The mtavruli letters are have similar forms to the mkhedruli except that, in principle, all letters written in the mtavruli style appear with an equal height standing on the baseline, similar to small caps in the Latin script. These are never used for sentence- or word-initial uppercasing, only for the equivalent of ALL-CAPS, for example, in headings, signage, or emphasised text. Read more.

Updated 27 Nov 2017: Added explanation for Cherokee, and expanded on Georgian explanation.

Updated 8 Apr 2019: Updated Georgian to take into account changes in Unicode 11.

See also the list of Right-to-left scripts.

  1. Amir Aharoni Says:

    Isn’t Adlam bicameral?

  2. r12a Says:

    Yes, indeed. Hmmm. I used http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B:General_Category=Uppercase_Letter:%5D to find the cased letters, but it seems that that CLDR table doesn’t know about Adlam or Osage (which is also bicameral) :(. I searched using UniView instead, and added Adlam and Osage to the list, plus a note about Mathematical Alphanumeric Symbols. Thanks, Amir.

Picture of the page in action.

A new Pinyin phonetics web app is now available. As you type Hanyu pinyin in the top box, a phonetic transcription appears in the lower box.

I put this app together to help me get closer to the correct pronunciation of names of people, cities, etc. that I come across while reading about Chinese history. There was often some combination of letters that I couldn’t quite remember how to pronounce.

It’s not intended to be perfect, though I think it’s pretty good overall. It works with input whether or not it has tonal accents. If you can suggest improvements, please raise a github issue.

See the notes file for a description of the less obvious phonetic symbols.

In case you want something to play with, here’s the text in the picture: Dìzào zhēnzhèng quánqiú tōngxíng de wànwéiwǎng. It means “Making the World Wide Web truly worldwide”, and the Han version is 缔造真正全球通行的万维网.

This shows the durations of dynasties and kingdoms of China during the period known as the 16 Kingdoms. Click on the image below to see an interactive version with a guide that follows your cursor and indicates the year.

Chart of timelines

See a map of territories around 409 CE. The dates and ethnic data are from Wikipedia.

Update 2016-10-03: I found it easier to work with the chart if the kingdoms are grouped by name/proximity, so changed the default to that. You can, however, still access the strictly chronological version.

Picture of the page in action.

A new Persian Character Picker web app is now available. The picker allows you to produce or analyse runs of Persian text using the Arabic script. Character pickers are especially useful for people who don’t know a script well, as characters are displayed in ways that aid identification.

The picker is able to produce UN transcriptions of the text in the box. The transcription appears just below the input box, where you can copy it, move it into the input box at the caret, or delete it. In order to obtain a full transcription it is necessary to add short vowel diactritics to places that could have more than one pronunciation, but the picker can work out the vowels needed for many letter combinations.

See the help file for more information.

This shows the durations of dynasties and kingdoms of China in the 900s. Click on the image below to see an interactive version that shows a guide that follows your cursor and indicates the year.

Chart of timelines

See a map of territories around 944 CE.

Examples of case conversion.

These are notes culled from various places. There may well be some copy-pasting involved, but I did it long enough ago that I no longer remember all the sources. But these are notes, it’s not an article.

Case conversions are not always possible in Unicode by applying an offset to a codepoint, although this can work for the ASCII range by adding 32, or by adding 1 for many other characters in the Latin extensions. There are many cases where the corresponding cased character is in another block, or in an irregularly offset location.

In addition, there are linguistic issues that mean that simple mappings of one character to another are not sufficient for case conversion.

In German, the uppercase of ß is SS. German and Greek cannot, however, be easily transformed from upper to lower case: German because SS could be converted either to ß or ss, depending on the word; Greek because all tonos marks are omitted in upper case, eg. does ΑΘΗΝΑ convert to Αθηνά (the goddess) or Αθήνα (capital of Greece)? German may also uppercase ß to ẞ sometimes for things like signboards.

Also Greek converts uppercase sigma to either a final or non-final form, depending on the position in a word, eg. ΟΔΥΣΣΕΥΣ becomes οδυσσευς. This contextual difference is easy to manage, however, compared to the lexical issues in the previous paragraph.

In Serbo-Croatian there is an important distinction between uppercase and titlecase. The single letter dž converts to DŽ when the whole word is uppercased, but Dž when titlecased. Both of these forms revert to dž in lowercase, so there is no ambiguity here.

In Dutch, the titlecase of ijsvogel is IJsvogel, ie. which commonly means that the first two letters have to be titlecased. There is a single character IJ (U+0132 LATIN CAPITAL LIGATURE IJ) in Unicode that will behave as expected, but this single character is very often not available on a keyboard, and so the word is commonly written with the two letters I+J.

In Greek, tonos diacritics are dropped during uppercasing, but not dialytika. Greek diphthongs with tonos over the first vowel are converted during uppercasing to no tonos but a dialytika over the second vowel in the diphthong, eg. Νεράιδα becomes ΝΕΡΑΪΔΑ. A letter with both tonos and dialytika above drops the tonos but keeps the dialytika, eg. ευφυΐα becomes ΕΥΦΥΪΑ. Also, contrary to the initial rule mentioned here, Greek does not drop the tonos on the disjunctive eta (usually meaning ‘or’), eg. ήσουν ή εγώ ή εσύ becomes ΗΣΟΥΝ Ή ΕΓΩ Ή ΕΣΥ (note that the initial eta is not disjunctive, and so does drop the tonos). This is to maintain the distinction between ‘either/or’ ή from the η feminine form of the article, in the nominative case, singular number.

Greek titlecased vowels, ie. a vowel at the start of a word that is uppercased, retains its tonos accent, eg. Όμηρος.

Turkish, Azeri, Tatar and Bashkir pair dotted and undotted i’s, which requires special handling for case conversion, that is language-specific. For example, the name of the second largest city in Turkey is “Diyarbakır”, which contains both the dotted and dotless letters i. When rendered into upper case, this word appears like this: DİYARBAKIR.

Lithuanian also has language-specific rules that retain the dot over i when combined with accents, eg. i̇̀ i̇́ i̇̃, whereas the capital I has no dot.

Sometimes European French omits accents from uppercase letters, whereas French Canadian typically does not. However, this is more of a stylistic than a linguistic rule. Sometimes French people uppercase œ to OE, but this is mostly due to issues with lack of keyboard support, it seems (as is the issue with French accents).

Capitalisation may ignore leading symbols and punctuation for a word, and titlecase the first casing letter. This applies not only to non-letters. A letter such as the (non-casing version of the) glottal stop, ʔ, may be ignored at the start of a word, and the following letter titlecased, in IPA or Americanist phonetic transcriptions. (Note that, to avoid confusion, there are separate case paired characters available for use in orthographies such as Chipewyan, Dogrib and Slavey. These are Ɂ and ɂ.)

Another issue for titlecasing is that not all words in a sequence are necessarily titlecased. German uses capital letters to start noun words, but not verbs or adjectives. French and Italian may expect to titlecase the ‘A’ in “L’Action”, since that is the start of a word. In English, it is common not to titlecase words like ‘for’, ‘of’, ‘the’ and so forth in titles.

Unicode provides only algorithms for generic case conversion and case folding. CLDR provides some more detail, though it is hard to programmatically achieve all the requirements for case conversion.

Case folding is a way of converting to a standard sequence of (lowercase) characters that can be used for comparisons of strings. (Note that this sequence may not represent normal lowercase text: for example, both the uppercase Greek sigma and lowercase final sigma are converted to a normal sigma, and the German ß is converted to ‘ss’.) There are also different flavours of case folding available: common, full, and simple.

  1. François Yergeau Says:

    “Sometimes French Canadian omits accents from uppercase letters, whereas French typically does not.” It’s the converse actually. And the reason is the same as for œ: keyboard support. French keyboards lack accented uppercase, whereas French Canadian ones provide them (shift + whatever produces the accented lowercase).

    This difference goes back to mechanical typewriter keyboards, so it’s not that recent using human generations as a yardstick. It has been around long enough that numerous French people believe that uppercase should not be accented (which is totally wrong), sometimes having been thought that in school!

  2. r12a Says:

    Thanks for catching that, François. I fixed the text.

  3. Steven Pemberton Says:

    “In Dutch, the titlecase of ijsland is IJsland, ie. the first two letters have to be titlecased.”

    Two points here. Strictly speaking, IJ is a single character U+0132: IJ, and its lower case character is +0133: ij, although they are often typed as two characters I J and i j because (as with the French examples above) the single character versions are not on the keyboard. However, if you spell IJsland out loud, you always spell it with 6 letters, not seven. IJ is always considered a single letter no matter how it is typed.

    A nit-pick: IJsland is a proper noun, and so doesn’t have a lower-cased version. You could use IJsvogel (kingfisher)

  4. r12a Says:

    @Steven you’re right, I should have spelled that out a little more clearly, and i have done so now. And thanks for ijsvogel, thats a useful substitution.

  5. Patrick Schlüter Says:

    For french in the European Commission it is official policy to write french with accented uppercase letters.

The language subtag lookup tool now has links to Wikipedia search for all languages and scripts listed. This helps for finding information about languages, now that SIL limits access to their Ethnologue, and offers a new source of information for the scripts listed.

Picture of the page in action.

These are just some notes for future reference. The following scripts in Unicode 9.0 are normally written from right to left.

Scripts containing characters with the property ARABIC RIGHT-to-LEFT have an asterisk. The remaining scripts have characters with the property RIGHT:

In modern use

Adlam
Arabic *
Hebrew
Nko
Syriac *
Thaana *

Limited modern use

Mende Kikakui (small numbers)
Old Hungarian
Samaritan (religious)

Archaic

Avestan
Cypriot
Hatran
Imperial Aramaic
Kharoshthi
Lydian
Manichaean
Meroitic
Mandaic
Nabataean
Old South Arabian
Old North Arabian
Old Turkic
Pahlavi, (Inscriptional)
Palmyrene
Parthian, (Inscriptional)
Phoenician

See also the list of bicameral scripts.

Picture of the page in action.

An updated version of the Unicode Character Converter web app is now available. This app allows you to convert characters between various different formats and notations.

Significant changes include the following:

  • It’s now possible to generate EcmaScript6 style escapes for supplementary characters in the JavaScript output field, eg. \u{10398} rather than \uD800\uDF98.
  • In many cases, clicking on a checkbox option now applies the change straight away if there is content in the associated output field. (There are 4 output fields where this doesn’t happen because we aren’t dealing with escapes and there are problems with spaces and delimiters.)
  • By default, the JavaScript output no longer escapes the ASCII characters that can be represented by \n, \r, \t, \’ and \”. A new checkbox is provided to force those transformations if needed. This should make the JS transform much more useful for general conversions.
  • The code to transform to HTML/XML can now replace RLI, LRI, FSI and PDI if the Convert bidi controls to HTML markup option is set.
  • The code to transform to HTML/XML can convert many more invisible or ambiguous characters to escapes if the Escape invisible characters option is set.
  • UTF-16 code points are all at least 4 digits long.
  • Fixed a bug related to U+00A0 when converting to HTML/XML.
  • The order of the output fields was changed, and various small improvements were made to the user interface.
  • Revamped and updated the notes

Many thanks to the people who wrote in with suggestions.

Picture of the page in action.

UniView now supports Unicode version 9, which is being released today, including all changes made during the beta period. (As before, images are not available for the Tangut additions, but the character information is available.)

This version of UniView also introduces a new filter feature. Below each block or range of characters is a set of links that allows you to quickly highlight characters with the property letter, mark, number, punctuation, or symbol. For more fine-grained property distinctions, see the Filter panel.

In addition, for some blocks there are other links available that reflect tags assigned to characters. This tagging is far from exhaustive! For instance, clicking on sanskrit will not show all characters used in Sanskrit.

The tags are just intended to be an aid to help you find certain characters quickly by exposing words that appear in the character descriptions or block subsection titles. For example, if you want to find the Bengali currency symbol while viewing the Bengali block, click on currency and all other characters but those related to currency will be dimmed.

(Since the highlight function is used for this, don’t forget that, if you happen to highlight a useful subset of characters and want to work with just those, you can use the Make list from highlights command, or click on the upwards pointing arrow icon below the text area to move those characters into the text area.)

Picture of the page in action.
>> See the chronology
>> See the maps

This blog post introduces the first of a set of historical maps of Europe that can be displayed at the same scale so that you can compare political or ethnographic boundaries from one time to the next. The first set covers the period from 362 AD to 830 AD.

A key aim here is to allow you to switch from map to map and see how boundaries evolve across an unchanging background.

The information in the maps is derived mostly from information in Colin McEvedy’s excellent series of books, in particular (so far) The New Penguin Atlas of Medieval History, but also sometimes brings in information from the Times History of Europe. Boundaries are approximate for a number of reasons: first, in the earlier times especially, the borders were only approximate anyway, second, I have deduced the boundary information from small-scale maps and (so far) only a little additional research, third, the sources sometimes differ about where boundaries lay. I hope to refine the data during future research, in the meantime take this information as grosso modo.

The link below the picture takes you to a chronological summary of events that lie behind the changes in the maps. Click on the large dates to open maps in a separate window. (Note that all maps will open in that window, and you may have to ensure that it isn’t hidden behind the chronology page.)

The background to the SVG overlay is a map that shows relief and rivers, as well as modern country boundaries (the dark lines). These were things which, as good as McEvedy’s maps were, I was always missing in order to get useful reference points. Since the outlines and text are created in SVG, you can zoom in to see details.

This is just the first stage, and the maps are still largely first drafts. The plan is to refine the details for existing maps and add many more. So far we only deal with Europe. In the future I’d like to deal with other places, if I can find sources.

  1. Ann Bassetti Says:

    Thanks, Richard; these are very interesting! I’m intrigued to learn of many ancient cultures I never heard of, and even better to see their locations and movements on sequential charts. History illustrated and enlivened (especially with SVG)! Makes me wonder what cultural or linguistic shadows still exist from these ancient peoples; I’m sure there are many.

    I would love to see similar information for parts of the world not typically represented in American / European history lessons, e.g., Africa, Asia, South America. So much to learn …

Picture of the page in action.

UniView now supports the characters introduced for the beta version of Unicode 9. Any changes made during the beta period will be added when Unicode 9 is officially released. (Images are not available for the Tangut additions, but the character information is available.)

It also brings in notes for individual characters where those notes exist, if Show notes is selected. These notes are not authoritative, but are provided in case they prove useful.

A new icon was added below the text area to add commas between each character in the text area.

Links to the help page that used to appear on mousing over a control have been removed. Instead there is a noticeable, blue link to the help page, and the help page has been reorganised and uses image maps so that it is easier to find information. The reorganisation puts more emphasis on learning by exploration, rather than learning by reading.

Various tweaks were made to the user interface.

Picture of the page in action.

I’ve been doing more work on the Egyptian Hieroglyph picker over the weekend.

The data behind the keyword search has now been completely updated to reflect descriptions by Gardiner and Allen. If you work with those lists it should now be easy to locate hieroglyphs using keywords. The search mechanism has also been rewritten so that you don’t need to type keywords in a particular order for them to match. I also strip out various common function words and do some other optimisation before attempting a match.

The other headline news is the addition of various controls above the text area, including one that will render MdC text as a two-dimensional arrangement of hieroglyphs. To do this, I adapted WikiHiero’s PHP code to run in javascript. You can see an example of the output in the picture attached to this post. If you want to try it, the MdC text to put in the text area is:
anx-G5-zmA:tA:tA-nbty-zmA:tA:tA-sw:t-bit:t-< -zA-ra:.-mn:n-T:w-Htp:t*p->-anx-D:t:N17-!

The result should look like this:

Picture of hieroglyphs.

Other new controls allow you to convert MdC text to hieroglyphs, and vice versa, or to type in a Unicode phonetic transcription and find the hieroglyphs it represents. (This may still need a little more work.)

I also moved the help text from the notes area to a separate file, with a nice clickable picture of the picker at the top that will link to particular features. You can get to that page by clicking on the blue Help box near the bottom of the picker.

Finally, you can now set the text area to display characters from right to left, in right-aligned lines, using more controls > Output direction. Unfortunately, i don’t know of a font that under these conditions will flip the hieroglyphs horizontally so that they face the right way.

For more information about the new features, and how to use the picker, see the Help page.

Picture of the page in action.

Over the weekend I added a set of new features to the picker for Egyptian Hieroglyphs, aimed at making it easier to locate a particular hieroglyph. Here is a run-down of various methods now available.

Category-based input

This was the original method. Characters are grouped into standard categories. Click on one of the orange characters, chosen as a nominal representative of the class, to show below all the characters in that category. Click on one of those to add it to the output box. As you mouse over the orange characters, you’ll see the name of the category appear just below the output box.

Keyword-search-based input

The app associates most hieroglyphs with keywords that describe the glyph. You can search for glyphs using those keywords in the input field labelled Search for.

Searching for ripple will match both ripple and ripples. Searching for king will match king and walking. If you want to only match whole words, surround the search term with colons, ie. :ripple: or :king:.

Note that the keywords are written in British English, so you need to look for sceptre rather than scepter.

The search input is treated as a regular expression, so if you want to search for two words that may have other words between them, use .*. For example, ox .* palm will match ox horns with stripped palm branch.

Many of the hieroglyphs have also been associated with keywords related to their use. If you select Include usage, these keywords will also be selected. Note that this keyword list is not exhaustive by any means, but it may occasionally be useful. For example, a search for Anubis will produce 𓁢 𓃢 𓃣 𓃤 .

(Note: to search for a character based on the Unicode name for that character, eg. w004, use the search box in the yellow area.)

Searching for pronunciations

Many of the hieroglyphs are associated with 1, 2 or 3 consonant pronunciations. These can be looked up as follows.

Type the sequence of consonants into the output box and highlight them. Then click on Look up from Latin. Hieroglyphs that match that character or sequence of characters will be displayed below the output box, and can be added to the output box by clicking on them. (Note that if you still have the search string highlighted in the output box those characters will be replaced by the hieroglyph.)

You will find the panel Latin characters useful for typing characters that are not accessible via your keyboard. The panel is displayed by clicking on the higher L in the grey bar to the left. Click on a character to add it to the output area.

For example, if you want to obtain the hieroglyph 𓎝, which is represented by the 3-character sequence wꜣḥ, add wꜣḥ to the output area and select it. Then click on Latin characters. You will see the character you need just above the SPACE button. Click on that hieroglyph and it will replace the wꜣḥ text in the output area. (Unhighlight the text in the output area if you want to keep both and add the hierglyph at the cursor position.)

Input panels accessed from the vertical grey bar

The vertical grey bar to the left allows you to turn on/off a number of panels that can help create the text you want.

Latin characters. This panel displays Latin characters you are likely to need for transcription. It is particularly useful for setting up a search by pronunciation (see above).

Latin to Egyptian. This panel also displays Latin characters used for transcription, but when you click on them they insert hieroglyphs into the output area. These are 24 hieroglyphs represented by a single consonant. Think of it as a shortcut if you want to find 1-consonant hieroglyphs by pronunciation.

Where a single consonant can be represented by more than one hieroglyph, a small pop-up will present you with the available choices. Just click on the one you want.

Egyptian alphabet. This panel displays the 26 hieroglyphs that the previous panel produces as hieroglyphs. In many cases this is the quickest way of typing in these hieroglyphs.

Picture of the page in action.

I have just published a picker for Egyptian Hieroglyphs.

This Unicode character picker allows you to produce or analyse runs of Egyptian Hieroglyph text using the Latin script.

Characters are grouped into standard categories. Click on one of the orange characters, chosen as a nominal representative of the class, to show below all the characters in that category. Click on one of those to add it to the output box. As you mouse over the orange characters, you’ll see the name of the category appear just below the output box.

Just above the orange characters you can find buttons to insert RLO and PDF controls. RLO will make the characters that follow it to progress from right to left. Alternatively, you can select more controls > Output direction to set the direction of the output box to RTL/LTR override. The latter approach will align the text to the right of the box. I haven’t yet found a Unicode font that also flips the glyphs horizontally as a result. I’m not entirely sure about the best way to apply directionality to Egyptian hieroglyphs, so I’m happy to hear suggestions.

Alongside the direction controls are some characters used for markup in the Manuel de Codage, which allows you to prepare text for an engine that knows how to lay it out two-dimensionally. (The picker doesn’t do that.)

The Latin Characters panel, opened from the grey bar to the left, provides characters needed for transcription.

In case you’re interested, here is the text you can see in the picture. (You’ll need a font to see this, of course. Try the free Noto Sans font, if you don’t have one – or copy-paste these lines into the picker, where you have a webfont.)
𓀀𓅃𓆣𓁿
<-i-mn:n-R4:t*p->
𓍹𓇋-𓏠:𓈖-𓊵:𓏏*𓊪𓍺

The last two lines spell the name of Amenhotep using Manuel de Codage markup, according to the Unicode Standard (p 432).

  1. tripu Says:

    Great addition, Richard!
    Unicode is awesome 🙂

    The last couple of paragraphs, with the example, made me think that many people (me included) really appreciate samples of the language/script, especially in the form of historical or scientific details or curiosities… I mean that when talking about such an esoteric script, one really wants to learn a little bit more, and see what the language looks like, how it’s composed… Those comments make the post more enjoyable, for those who are curious about languages and glyphs. So, more of that, please! 🙂

  2. r12a Says:

    @tripu then you may also like the Script Summaries pages. And the Script features by language page. That material is still in development, but has some useful info already. (Both were developed to support the tutorial An Introduction to Writing Systems & Unicode.)

    The other thing you may want to check out is the set of Script links, which point you to useful information about particular scripts, including some more in-depth write-ups on my site about certain scripts.

I just received a query from someone who wanted to know how to figure out what characters are in and what characters are not in a particular legacy character encoding. So rather than just send the information to her I thought I’d write it as a blog post so that others can get the same information. I’m going to write this quickly, so let me know if there are parts that are hard to follow, or that you consider incorrect, and I’ll fix it.

A few preliminary notes to set us up: When I refer to ‘legacy encodings’, I mean any character encoding that isn’t UTF-8. Though, actually, I will only consider those that are specified in the Encoding spec, and I will use the data provided by that spec to determine what characters each encoding contains (since that’s what it aims to do for Web-based content). You may come across other implementations of a given character encoding, with different characters in it, but bear in mind that those are unlikely to work on the Web.

Also, the tools I will use refer to a given character encoding using the preferred name. You can use the table in the Encoding spec to map alternative names to the preferred name I use.

What characters are in encoding X?

Let’s suppose you want to know what characters are in the character encoding you know as cseucpkdfmtjapanese. A quick check in the Encoding spec shows that the preferred name for this encoding is euc-jp.

Go to https://r12a.github.io/apps/encodings/ and look for the selection control near the bottom of the page labelled show all the characters in this encoding.

Select euc-jp. It opens a new window that shows you all the characters.

picture of the result

This is impressive, but so large a list that it’s not as useful as it could be.

So highlight and copy all the characters in the text area and go to https://r12a.github.io/apps/listcharacters/.

Paste the characters into the big empty box, and hit the button Analyse characters above.

This will now list for you those same characters, but organised by Unicode block. At the bottom of the page it gives a total character count, and adds up the number of Unicode blocks involved.

picture of the result

What characters are not in encoding X?

If instead you actually want to know what characters are not in the encoding for a given Unicode block you can follow these steps.

Go to UniView (https://r12a.github.io/uniview/) and select the block you are interested where is says Show block, or alternatively type the range into the control labelled Show range (eg. 0370:03FF).

Let’s imagine you are interested in Greek characters and you have therefore selected the Greek and Coptic block (or typed 0370:03FF in the Show range control).

On the edit buffer area (top right) you’ll see a small icon with an arrow point upwards. Click on this to bring all the characters in the block into the edit buffer area. Then hit the icon just to its left to highlight all the characters and then copy them to the clipboard.

picture of the result

Next open https://r12a.github.io/apps/encodings/ and paste the characters into the input area labelled with Unicode characters to encode, and hit the Convert button.

picture of the result

The Encoding converter app will list all the characters in a number of encodings. If the character is part of the encoding, it will be represented as two-digit hex codes. If not, and this is what you’re looking for, it will be represented as decimal HTML escapes (eg. &#880;). This way you can get the decimal code point values for all the characters not in the encoding. (If all the characters exist in the encoding, the block will turn green.)

(If you want to see the list of characters, copy the results for the encoding you are interested in, go back to UniView and paste the characters into the input field labelled Find. Then click on Dec. Ignore all ASCII characters in the list that is produced.)

Note, by the way, that you can tailor the encodings that are shown by the Encoding converter by clicking on change encodings shown and then selecting the encodings you are interested in. There are 36 to choose from.

Picture of the page in action.
>> Use the picker

Following closely on the heels of the Old Norse and Runic pickers comes a new Old English (Anglo-Saxon) picker.

This Unicode character picker allows you to produce or analyse runs of Old English text using the Latin script.

In addition to helping you to type Old English latin-based text, the picker allows you to automatically generate phonetic and runic transcriptions. These should be used with caution! The transcriptions are only intended to be a rough guide, and there may occasionally be slight inaccuracies that need patching.

The picture in this blog post shows examples of old english text, and phonetic and runic transcriptions of the same, from the beginning of Beowulf. Click on it to see it larger, or copy-paste the following into the picker, and try out the commands on the top right: Hwæt! wē Gār-Dena in ġēar-dagum þēod-cyninga þrym gefrūnon, hūðā æþelingas ellen fremedon.

If you want to work more with runes, check out the Runic picker.

Picture of the page in action.
>> Use the picker

Character pickers are especially useful for people who don’t know a script well, as characters are displayed in ways that aid identification. These pickers also provide tools to manipulate the text.

The Runic character picker allows you to produce or analyse runs of Runic text. It allows you to type in runes for the Elder fuþark, Younger fuþark (both long-branch and short-twig variants), the Medieval fuþark and the Anglo-Saxon fuþork. To help beginners, each of the above has its own keyboard-style layout that associates the runes with characters on the keyboard to make it easier to locate them.

It can also produce a latin transliteration for a sequence of runes, or automatically produce runes from a latin transliteration. (Note that these transcriptions do not indicate pronunciation – they are standard latin substitutes for graphemes, rather than actual Old Norse or Old English, etc, text. To convert Old Norse to runes, see the description of the Old Norse pickers below. This will soon be joined by another picker which will do the same for Anglo-Saxon runes.)

Writing in runes is not an exact science. Actual runic text is subject to many variations dependent on chronology, location and the author’s idiosyncracies. It should be particularly noted that the automated transcription tools provided with this picker are intended as aids to speed up transcription, rather than to produce absolutely accurate renderings of specific texts. The output may need to be tweaked to produce the desired results.

You can use the RLO/PDF buttons below the keyboard to make the runic text run right-to-left, eg. ‮ᚹᚪᚱᚦᚷᚪ‬, and if you have the right font (such as Junicode, which is included as the default webfont, or a Babelstone font), make the glyphs face to the left also. The Bablestone fonts also implement a number of bind-runes for Anglo-Saxon (but are missing those for Old Norse) if you put a ZWJ character between the characters you want to ligate. For example: ᚻ‍ᛖ‍ᛚ. You can also produce two glyphs mirrored around the central stave by putting ZWJ between two identical characters, eg. ᚢ‍ᚢ. (Click on the picture of the picker in this blog post to see examples.)

Picture of the page in action.
>> Use the picker

The Old Norse picker allows you to produce or analyse runs of Old Norse text using the Latin script. It is based on a standardised orthography.

In addition to helping you to type Old Norse latin-based text, the picker allows you to automatically generate phonetic and runic transcriptions. These should be used with caution! The phonetic transcriptions are only intended to be a rough guide, and, as mentioned earlier, real-life runic text is often highly idiosyncratic, not to mention that it varies depending on the time period and region.

The runic transcription tools in this app produce runes of the Younger fuþark – used for Old Norse after the Elder and before the Medieval fuþarks. This transcription tool has its own idiosyncracies, that may not always match real-life usage of runes. One particular idiosyncracy is that the output always regularly conforms to the same set of rules, but others include the decision not to remove homorganic nasals before certain following letters. More information about this is given in the notes.

You can see an example of the output from these tools in the picture of the Old Norse picker that is attached to this blog post. Here’s some Old Norse text you can play with: Ok sem leið at jólum, gørðusk menn þar ókátir. Bǫðvarr spurði Hǫtt hverju þat sætti; hann sagði honum at dýr eitt hafi komit þar tvá vetr í samt, mikit ok ógurligt.

The picker also has a couple of tools to help you work with A New Introduction to Old Norse.