NOTE: Nowadays, items are only posted here very occasionally. For updates to the various apps and documents, see the r12a Twitter feed.

I've often heard from people that there needs to be a way to represent linguistic interlinear glossed text in HTML (with CSS).

(To my mind, the term 'interlinear gloss' isn't especially helpful in describing the particular use case I have in mind. Rather than inline annotations, where a word or two appears above or below the main text of a document in the interline space, here I'm talking about table-like, multi-line text containing aligned items, as in the example below. This arrangement is common for linguistic analyses of text.)

I also often hear the suggestion that an extension to ruby markup is needed to allow for this. I disagree. I also often hear that we need some new HTML markup for this. I don't think so.

I suspect that the following approach may tick all the necessary boxes, without any additions to anything. This example is quite simple, but it can easily be extended to include more lines and more interesting styling options. Note also that the text wraps according to the width of the window.

Ge'ez:Pronunciation:Gloss:
ወሶበ፡wä-sobäand-when
ሰማዐ፡sämʾäheard.he
ኢሳይያሰ፡ʾIsayəyyasʾIsayəyyas
ዘንተ፡zäntäthis
ነገረ፡nägärästatement
እምአፉሆሙ፡ʾəm-ʾafu-homufrom-mouth-their
ለአግብርተ፡läʾägbərtäservants.of
ሰይጣን፡säyṭanSatan
ቦአ፡boʾäwent.he
ኀበ፡ḫäbäto
ንጉሥ፡nəguśking
ወይቤሎ፡wä-yəbel-oand-he.said-to.him
ለንጉሥ፡ ...lä-nəguś ...to-king ...

When ʾIsayəyyas heard this statement from the mouth of the servants of Satan, he went to the king and said to the king ...

(Don't read anything in to the text itself! It's just the first part of the gloss provided by Daniels & Bright for the Ethiopic section of The World's Writing Systems.)

The code

The markup uses divs and spans, which are then styled using CSS flexbox properties.

<div class="multilineGlossedText">
<div class="stack"><span class="legend">Ge'ez:</span><span class="legend">Pronunciation:</span><span class="legend">Gloss:</span></div>
<div class="stack"><span class="base" lang="gez">ወሶበ፡</span><span class="trans">wä-sobä</span><span class="gloss">and-when</span></div>
<div class="stack"><span class="base" lang="gez">ሰማዐ፡</span><span class="trans">sämʾä</span><span class="gloss">heard.he</span></div>
<div class="stack"><span class="base" lang="gez">ኢሳይያሰ፡</span><span class="trans">ʾIsayəyyas</span><span class="gloss">ʾIsayəyyas</span></div>
...
<div class="stack"><span class="base" lang="gez">ለንጉሥ፡ ...</span><span class="trans">lä-nəguś ...</span><span class="gloss">to-king ...</span></div>
</div>

This is the CSS code.

.multilineGlossedText { display: flex; flex-direction: row; flex-wrap: wrap; } .stack { display: flex; flex-direction: column; flex-wrap: nowrap; margin-right: .75em; margin-top: .5em; } .legend { font-style: italic; }

RTL text example

This example annotates text that is written right-to-left. It shows transcriptions written RTL and LTR, but without actually needing to reverse the string in the source. There's a lot more styling going on for the different types of data. It also applies small caps to a morphological identifier. In the original, the Arabic and reversed transcription were separate from the rest of the annotations, but here I bring them all together.

Ottoman :Rev. Translit :Transliteration :Mod.Turkish :Transliteration :Gloss :
←خواجه←ḫvachḫvachHocahoʤʼɑteacher
مركبنيmrkbnymrkbnymerkebinimɛrkɛbɪ'nihis.donkey
ضايعżay’żay’zayizajilost
ايتمشaytmşaytmşetmişɛt'mɪʃhe.made
همhmhmhem'hɛmboth
آرارʾararʾarararara'rarsearching.for
همhmhmhem'hɛmand
شكرşkrşkrşükrʃykrthanks
ايدرaydraydrederɛdɛrhe.made
ايمشɑymşɑymşimişɪ'mɪʃpast
سببsbbsbbsebebisɛ'bɛbithe.cause.of
تشكرىtşkrytşkrytɛşɛkkürütɛʃɛkːyrythanks
صورمشلرṣvrmşlr.ṣvrmşlr.sormuşlar.sormʊʃ-'lar.asked-they

The teacher lost his donkey. He was both searching for it and was expressing his thanks. They asked the cause of being grateful.

I created a GitHub issue for further discussion of this idea.

 

A new template has been applied to the Script summary pages.

It includes the following changes:

  • a new section containing historical and usage information, drawn from ScriptSource and Wikipedia.
  • a panel that can be revealed to include information from Script links
  • character lists, by language, that point to Character usage lookup app
  • more items moved under the Text Layout section
  • the dialog box to change fonts (blue vertical bar, bottom right) now allows you to apply fonts and font sizes separately to the samples and the examples.
  • the text in the floating summary tables no longer links to sections in the text: instead the orange highlighting indicates features that differ from Latin script.

In addition, an effort was made to better standardise the organization of information across pages, and there were editorial changes to most pages.

Future changes will include a clearer separation of the summary of the script itself and information about a particular writing system that uses that script. For example, the Arabic script summary page describes the Arabic script in general, and also describes the way the script is used for Arabic language text. There is a separate page, however, describing use of the script for Urdu, including differences from Arabic and relevant phonetic information, etc.

 

The previous version of the Arabic script summary page was quite short on detail. A significant amount of new information has been added to the page, affecting most of the sections.

Picture of the page in action.

A new Mandaic Character Picker web app is now available. The Mandaic script is used for writing Mandaic, an Iraqi language spoken by about 5,500 people. It is also the script of Classical Mandaic, the liturgical language of the Mandaean religion.

The picker allows you to produce or analyse runs of text using the Makasar script. Character pickers are especially useful for people who don’t know a script well, as characters are arranged in ways that aid identification.

Apart from the standard functions, a Transliterate Mandaic selector above the text area converts the content of the text area (or the highlighted text) to a Latin transliteration. A second selector, Transcribe Mandaic, attempts to produce a slightly more phonetic transcription (though still only approximate), and typically presents alternatives for ambiguous text.

Mandaic is a right-to-left script, so there are controls that allow you to change the base direction for the selection panels and for the output area.

See the help file for more information.

There is also a Mandaic script summary and a set of Mandaic character notes, which respectively give an overview of how the script works, and provide detailed information for individual characters in the Mandaic block.

Picture of the app in action.

A new Character usage lookup web app is now available. It draws on data sources (initially CLDR and Unicode's UDHR pages, with additional information from Wikipedia) to associate characters with a range of languages (440 at the moment). The app lets you find characters used by a particular language, or languages that use a given non-ASCII character.

ASCII characters are ignored. Only the core characters from CLDR are shown (not the auxiliary), but every character that appears in a UDHR transcription is shown. Going forward, I expect to add information from other sources. Characters shown for a language include all characters produced by applying uppercase, lowercase, NFC, and NFD to the set of characters attributed to that language by its source. Chinese languages, Japanese, and Korean are not listed.

The Native speakers column indicates the estimated number of native speakers for all the languages listed, in order to give a rough idea of the prevalence of that character. It doesn't represent the number of people who speak it as a second language, and often that is a multiple of the native speaker total. However, this number also represents speakers rather than literate users, so they are potential users of the character. Depending on the language, therefore, the figures may be low or at least conservative for speakers of many languages, and possibly high for speakers of some languages (typically small languages, or when using an alternate orthography).

The following tips may be useful:

  • Mouse over the characters displayed to see their Unicode code point value and name. The U icon will show all characters in that cell in UniView. This can be useful if you don't have fonts for that script, since UniView uses images by default.
  • The control Use language name works as follows: Type in a name, or part of a name, of a language. Select the language you want from the suggestions offered. This will put the BCP47 code into the control. Hit return and you'll display information for that language.
    Unfortunately, this doesn't work with Safari, or (therefore) iOS. If you need to find a BCP47 code for a language, go to https://r12a.github.io/app-subtags/.
  • When adding characters you want to look up to the input field, you can add Unicode code point numbers with space to either side, or escapes. For example, for આ any of the following escapes will work: &#x0A86; \u0A86 \u{A86} \0A86 U+0A86 0xA86. No extra space is needed between escapes, and supplementary characters work too.
  • After you have generated a list of languages that use a given character, if you click on a language name then details for that language will be displayed above.

Update, 8 Jan 2018:

New: Bambara & Eastern Maninkakan using N'Ko script, Sundanese using Sundanese script, & Neo-Mandaic.

Changes to: Assyrian Neo-Aramaic.

Character counts were also added, just above the Source field.

Picture of the app in action.

The List characters web app lets you find out what characters are contained in some text. Simply drop the text into the large box and click Analyse the text above. The result is grouped by Unicode blocks.

This update adds three new buttons: Convert to NFC, Convert to NFD, and Get all forms. The first two buttons are self-explanatory if you are familiar with Unicode Normalization forms. (If not, see the explanation in my Unicode tutorial.)

The third button makes four copies of the text, and to each applies one of the following: (1) all lowercase, (2) all uppercase, (3) all NFC, and (4) all NFD. When you then hit the Analyse the text above button you get far more results, as shown in the picture above, which shows what the same (Vietnamese) text produces before and after using this new button.

There is also a new line showing the total number of unique characters in the text.

Picture of the page in action.

A new Makasar Character Picker web app is now available. The Makassar script was used formerly for the Makassarese language and will be added to the Unicode Standard in version 11. The script is sometimes called Makassarese Bird Script or Old Makassarese.

The picker allows you to produce or analyse runs of text using the Makasar script. Character pickers are especially useful for people who don’t know a script well, as characters are arranged in ways that aid identification.

Apart from the standard functions, the Makasar to Latin selector above the text area converts the content of the text area (or the highlighted text) to a Latin transliteration. Note that this is unable to capture geminated consonant sounds, syllable-final consonants, or missing vowel sounds, since those are not indicated by the Makasar script. It does, however, handle consonant reduplication when signalled by angka or doubled vowel-signs.

I was unable to make a functional webfont from MakasarGraphite, which is the only font I know of for Makasar, so you will need to download the font for the picker to work correctly. Even then, because this is a Graphite font, it will only work on Firefox. Also note that Makasar code points are not officially assigned until Unicode 11 is released.

See the help file for more information.

Picture of the page in action.

A new Buginese Character Picker web app is now available. The picker allows you to produce or analyse runs of text using the Buginese script. Character pickers are especially useful for people who don’t know a script well, as characters are arranged in ways that aid identification.

Apart from the standard functions, the Buginese to Latin selector above the text area converts the content of the text area (or the highlighted text) to a Latin transliteration. (Note that it only converts characters in the Buginese text, it doesn't add invisible endings or gemination.)

See the help file for more information.

In addition there is a Buginese script summary and accompanying individual character notes.

Information about Buginese has also been added to the Script comparison chart.

Picture of the page in action.

A new Sundanese Character Picker web app is now available. The picker allows you to produce or analyse runs of text using the Sundanese script. Character pickers are especially useful for people who don’t know a script well, as characters are arranged in ways that aid identification.

Apart from the standard functions, the Sundanese to Latin selector above the text area converts the content of the text area (or the highlighted text) to a Latin transliteration.

See the help file for more information.

Picture of the page in action.

A Sundanese script summary and accompanying individual character notes are now available.

There is some sample text near the top of the document. If you highlight part of it, the page shows you the Unicode characters used for that selection. Clicking on red example text also shows the Unicode characters that underlie the text.

There's also a panel on the right that summarises key features of the script, taking its information from the Script comparison chart.

As always, this is not authoritative, peer-reviewed information – these are just notes I have gathered or copied from various places as I learned. But maybe it will be helpful to someone. Please raise a github issue if you want to propose some changes.

Picture of the page in action.

A new Javanese Character Picker web app is now available. The picker allows you to produce or analyse runs of text using the Javanese script. Character pickers are especially useful for people who don’t know a script well, as characters are arranged in ways that aid identification.

Apart from the standard functions, there are a few special features:

  • Click on the SS button to replace the standard set of consonants with subjoined forms on the selection area.
  • The picker comes with a shape selector (click on S on the grey bar to the left). This is particularly useful for people unfamiliar with the script. The orange bar shows a set of characters, each of which begins (on the left) with a different shape. Click on one and the Javanese letters that start with the same shape are highlighted above and displayed below. Click on either to add to the text area.
  • The Javanese to Latin selector above the text area converts the content of the text area (or the highlighted text) to a Latin transliteration.

See the help file for more information.

Picture of the page in action.

A Javanese script summary and accompanying individual character notes are now available.

There is some sample text near the top of the document. If you highlight part of it, the page shows you the Unicode characters used for that selection. Clicking on red example text also shows the Unicode characters that underlie the text.

There's also a panel on the right that summarises key features of the script, taking its information from the Script comparison chart.

As always, this is not authoritative, peer-reviewed information – these are just notes I have gathered or copied from various places as I learned. But maybe it will be helpful to someone. Please raise a github issue if you want to propose some changes.

Picture of the page in action.

A new Gujarati Character Picker web app is now available. The picker allows you to produce or analyse runs of text using the Gujarati script. Character pickers are especially useful for people who don’t know a script well, as characters are arranged in ways that aid identification.

This picker so far only has basic functionality. Under the hood, however, it uses a new architecture for pickers, which centralises as much of the code as possible, which is why it is version 21. However, this should not be visually apparent.

I used to have a Gujarati picker, but it fell into disuse several years ago. This version, although still basic, is already a signficant improvement on the previous one.

See the help file for more information.

It's now possible to save any changes you make to the settings, such as font size and family, direction, language, etc, so that they are unchanged when you start your next session.

Under more controls there is a reset button, which will restore the original default settings, if you need it.

Picture of the page in action.

In the above picture you can also see a new control that allows you to change the language for examples generated using Make example or Character markup controls. You should use a BCP47 language tag here (ie. the value that will appear in the lang attribute in the HTML code).

In addition, it is now possible to increase the contrast for the text on the user interface. Click on the toggle button at the top right of the page.

Picture of the page in action.

The settings for both these new controls are remembered for your next session, if you agree to store them on your device. Setttings are only saved for the computer or device you are using. If you open the picker on a different computer or device you'll need to set things up again for that one.

See the list of pickers.

Picture of the page in action.

A new Cherokee Character Picker web app is now available. The picker allows you to produce or analyse runs of text using the Cherokee script. Character pickers are especially useful for people who don’t know a script well, as characters are arranged in ways that aid identification.

The picker supports both uppercase and lowercase letters. Click on the Shift button to switch between them.

The default selection panel reflects the syllabic sounds of Cherokee. If you select S in the vertical grey bar, the selection panel changes to show letters arranged based on (broad) similarity to ASCII characters. This may help people unfamiliar with the Cherokee script to find characters.

See the help file for more information, and see an earlier blog post for new features in the version 20 pickers.

Picture of the page in action.

A Cherokee script summary and accompanying individual character notes are now available.

There is some sample text near the top of the document. If you highlight part of it, the page shows you the Unicode characters used for that selection. Clicking on red example text also shows the Unicode characters that underlie the text.

There's also a panel on the right that summarises key features of the script, taking its information from the revamped Script comparison chart.

As always, this is not authoritative, peer-reviewed information – these are just notes I have gathered or copied from various places as I learned. But maybe it will be helpful to someone. Please raise a github issue if you want to propose some changes.

Picture of the page in action.

A new N'Ko Character Picker web app is now available. The picker allows you to produce or analyse runs of text using the N'Ko script. Character pickers are especially useful for people who don’t know a script well, as characters are arranged in ways that aid identification.

This is a fairly simple picker for now, with no vertical grey bar selections. There are just two things to note:

  • Switch base direction of the text area using →︎ ↔︎ ←︎, just below the text area.
  • Switch the direction of the input table. (Under more controls.)

See the help file for more information, and see an earlier blog post for new features in the version 20 pickers.

Picture of the page in action.

A N'Ko script summary and accompanying individual character notes are now available.

There is some sample text near the top of the document. If you highlight part of it, the page shows you the Unicode characters used for that selection. Clicking on red example text also shows the Unicode characters that underlie the text.

There's also a panel on the right that summarises key features of the script, taking its information from the revamped Script comparison chart.

As always, this is not authoritative, peer-reviewed information – these are just notes I have gathered or copied from various places as I learned. But maybe it will be helpful to someone. Please raise a github issue if you want to propose some changes.

Picture of the page in action.

A new Thaana Character Picker web app is now available. The picker allows you to produce or analyse runs of Dhivehi text using the Thaana script. Character pickers are especially useful for people who don’t know a script well, as characters are arranged in ways that aid identification.

Highlights:

  • Click on the M on the grey bar to display a panel that allows you to generate Thaana text from characters used for the official Maldivian Latin transcription. Click on the transcription character and the appropriate Thaana character will be produced. Where there are multiple possible choices, these choices are presented in a small pop-up box; click on the choice you want in order to add it to the text area. Arabic characters are included.
  • Switch base direction of the text area using →︎ ↔︎ ←︎, just below the text area.
  • Switch the direction of the input table. (Under more controls.)

See the help file for more information, and see an earlier blog post for new features in the version 20 pickers.

Picture of the page in action.

A Thaana script summary and accompanying individual character notes are now available.

There is some sample text near the top of the document. If you highlight part of it, the page shows you the Unicode characters used for that selection. Clicking on red example text also shows the Unicode characters that underlie the text.

There's also a panel on the right that summarises key features of the script, taking its information from the revamped Script comparison chart.

As always, this is not authoritative, peer-reviewed information – these are just notes I have gathered or copied from various places as I learned. But maybe it will be helpful to someone. Please raise a github issue if you want to propose some changes.

Picture of the page in action.

A new Aramaic Character Picker web app is now available. The picker allows you to produce or analyse runs of Assyrian Neo-Aramaic text using the Syriac script. Character pickers are especially useful for people who don’t know a script well, as characters are arranged in ways that aid identification.

Highlights:

  • Tailored specifically to typing Aramaic. Let me know if there are characters missing. There are some additional characters for Garshuni transcriptions hidden under a reveal.
  • See the cursive shapes associated with a character as you mouse over it. (Click on S in the vertical grey bar.)
  • Speed up transcription from transliteration characters to syriac. (Click on the bottom T in the vertical grey bar.)
  • Switch base direction of the text area using →︎ ↔︎ ←︎, just below the text area.
  • Switch the direction of the input table. (Under more controls.)
  • New features in the latest picker template:
    • Create a URL to share with others that will show what you have in the text area when they follow the link.
    • Add some sample text to the text area for experimentation.
    • Lowercase the bicameral text in the text area.

See the help file for more information.

Picture of the page in action.

A new Syriac Character Picker web app is now available. The picker allows you to produce or analyse runs of text using the maḏnḥāyā (ܡܲܕ݂ܢܚܵܝܵܐ) (eastern), ʾesṭrangēlā (ܐܣܛܪܢܓܠܐ), and serṭā (ܣܶܪܛܳܐ) (western) styles of the Syriac script. Character pickers are especially useful for people who don’t know a script well, as characters are arranged in ways that aid identification.

Highlights:

  • Covers the three main styles of Syriac writing, with webfonts for each included. Let me know if there are characters missing. There are additional characters for Garshuni, Persian and Sogdian transcriptions hidden under a reveal, and a good many extra punctuation characters and marks under another. I suspect i should pull some of the characters in the reveals up into the main area. Let me know if you think there are some clear candidates for that.
  • See the cursive shapes associated with a character as you mouse over it. (Click on S in the vertical grey bar.) This panel shows joining forms for all three different styles at the same time.
  • Switch base direction of the text area using →︎ ↔︎ ←︎, just below the text area.
  • Switch the direction of the input table. (Under more controls.)
  • New features in the latest picker template:
    • Create a URL to share with others that will show what you have in the text area when they follow the link.
    • Add some sample text to the text area for experimentation.
    • Lowercase the bicameral text in the text area.

See the help file for more information.

Picture of the page in action.

A Syriac script summary and accompanying individual character notes are now available. They cover maḏnḥāyā (ܡܲܕ݂ܢܚܵܝܵܐ) (eastern), ʾesṭrangēlā (ܐܣܛܪܢܓܠܐ), and serṭā (ܣܶܪܛܳܐ) (western) styles, with a particular emphasis on Assyrian Neo-Aramaic.

The summary uses a new approach, which has been applied to the other notes of this kind. There is some sample text near the top of the document (and if you highlight part of it, the page shows you the Unicode characters used for that selection).

There's also a panel on the right that summarises key features of the script, taking its information from the revamped Script comparison chart.

As always, this is not authoritative, peer-reviewed information – these are just notes I have gathered or copied from various places as I learned. But maybe it will be helpful to someone. Please raise a github issue if you want to propose some changes.

 

Version 20 is the first time all the pickers share the same template. It comes with a few new features worth mentioning.

Icons at the top left of the text area.

New text area icons. The icons on the left above the input box allow you (listing them from left to right) to copy the text to the clipboard, select the text, and delete it. The other icons on the line are new. The box with the arrow generates a URL - if you send the URL to someone and they click on it, they will see what you see in the text box. The plus sign adds some sample text to the text area, so you can play around with the controls, or just see what the script looks like. The blue circle opens the associated help file.

The help text is in a separate location in v20, rather than under a reveal at the bottom of the page. This allows for easier reading. There's also a picture of the picker; click on a feature to jump to a description.

The small L only appears on a couple of pickers. It turns any bicameral text in the text area into uniform lowercase.

Direction-setting icons below the text area.

Direction arrows. The set of arrows immediately below the text area, on the right, appear for scripts normally written right-to-left. They allow you to set the direction of the text area to LTR, auto, and RTL, respectively.

Example generator. The feature that generates an example (Make example) now outputs the transliteration before IPA transcription (ie. the order is reversed). This seemed to make more sense, as it keeps the transliteration closer to the actual native text. (This may not be reflected yet in all the help files.)

Smallprint. The smallprint line at the very bottom of the picker page now has links to the GitHub commit list for the picker, and a link that generates a new GitHub issue for any comments you want to make.

Other pickers. Finally, there's a reveal at the bottom of the page to let you link to other pickers quickly.