Bidi in plain text (draft!)

Updated Thu 2 Jul 2015 • tags bidi, scriptnotes

Increasingly, questions are arising at the W3C about how to specify bidi handling for plain text environments or structures that are not part of an HTML/XML page. Most recently this includes JSON, CSV, Web Annotation, Activity Stream, WebShare, etc., formats.

My aim in this page is to provide background information that can help with those discussions, and carry useful ideas from one discussion to the next. I also add some personal thoughts on implementation alternatives, given current data.

I'm using this page to clarify things for myself. It is only a draft, and will be updated from time to time, as new information becomes available, as feedback arrives, and as ideas are clarified. Latest update was 2016-07-28 15:59.

Establishing base direction

The importance of establishing the base direction

A base direction needs to be determined (ie. directional context) in order to display bidirectional text correctly. By hook or by crook this will be set, whether explicitly or by default.

If you are not familiar with what the Unicode Bidirectional Algorithm (UBA) does and doesn't do, read this first.

What is a paragraph?

In what follows, the word paragraph indicates a run of text followed by a hard line-break in plain text, but may signify different things in other situations. In CSV it equates to 'cell', so a single line of comma-separated items is actually a set of comma-separated paragraphs.  In HTML it equates to the lowest level of block element, which is often a p element, but may be things such as div, li, etc., if they only contain text and/or inline elements. In JSON, it often equates to a quoted string value, but if a string value uses markup then paragraphs are associated with block elements, and if the string value is multiple lines of plain text then each line is a paragraph.

Ways base direction can be set for paragraphs

There are a number of possible ways of setting the base direction. In an attempt to keep this from being overly abstract, I'll use HTML to illustrate some alternatives.

When I say metadata I mean information which could be an annotation associated with the data, could be markup in scenarios that allow that, could be a higher-level protocol such as HTTP (in theory), etc.

  1. the base direction of a paragraph may be set by the application applying metadata to the paragraph. (This is what happens when you set the dir attribute on the html tag or some other tag in HTML.) If the base direction is set by metadata, there is no need to look for the first-strong character to determine the base direction – it is already determined.
    • the metadata may specifically indicate that first-strong heuristics should be used. Then you would expect to consider the actual characters used in order to determine the base direction. (This is what happens if you set dir=auto on an HTML element.)
    • the application may expect metadata, but there may be no such information provided. In this case you would usually expect there to be a default direction specified, and the base direction for a cell would be set to that default. The default is usually ltr. (This is what happens if you have no dir attributes in your HTML file.)
    • Where a format contains many paragraphs or chunks of information, and the language of text in all those chunks is the same, it is best to allow a default base direction to be set for and inherited by all. For example, a subtitling file containing many cues, all written in Arabic, it would be best to allow the author to say that the whole file is in Arabic at the start of the file an for every cue to apply that default unless local metadata is available to override the default.
  2. if the application expects no metadata it should use heuristics to determine the base direction for each paragraph/cell. A typical solution, and one described by UAX 9 Unicode Bidirectional Algorithm, is to look for the first-strong character in the paragraph/cell. (This doesn't apply to HTML, since HTML specifies a default direction. It is likely to apply if you are looking at plain text files that are not expected to be associated with metadata.)
    • Not all paragraphs using the first-strong method will have the correct base direction applied. In some cases, an Arabic or Hebrew, etc, paragraph may start with strong LTR characters. There must be a way to deal with this.
    • Where a syntactic unit contains multiple lines of plain text (for example, a multiline cue text in a subtitling file), the first-strong heuristic needs to be applied to each line separately.
    • There may be special rules that involve ignoring something at the start of the paragraph before finding the first strong character.
    • In some cases there are no strong characters in a paragraph, and the base direction can be critically important for the data to be understood correctly, eg. telephone numbers or Mac addresses. There needs to be a way to resort to an appropriate default for these cases.
  3. whether or not there is any metadata specified, if the paragraph contains a string that starts with one of the unicode bidi control characters RLI, LRI, FSI, LRE, RLE, LRO, or RLO and ends with PDF/PDI, these characters will determine the base direction for the contained string. These characters, when placed in the content, explicitly override any previously set direction by creating an inline range and assigning a base direction to it.
    • The effect of such characters does not extend past paragraph boundaries, but the range ought to be explicitly ended using the PDF/PDI control character, especially if a paragraph end is not easily detectable by the application.)
    • Because isolation is needed for bidirectional text to work properly, the Unicode Standard says that the isolating control codes RLI, LRI and FSI should be used rather than LRE or RLE. Unfortunately, those characters are still not widely supported.

Where there is a possibility to use metadata rather than control codes, it is advisable to use it – particularly for content that is created by an author.

In fact, for structural components, above the paragraph level, it is not possible to use the Unicode bidi control characters to define direction for the paragraphs it contains, since the effect is terminated by a paragraph end.

Problems with control characters

Reasons to avoid relying on control characters at the paragraph level to set direction include the following:

  1. they are invisible in most editors and are therefore difficult to work with.
  2. it is sometimes necessary to choose which to use based on context or the type of the data, and this means that a content author typically needs to select the control codes – specifying control codes in this way for all paragraphs is time-consuming and error-prone.
  3. processors that extract parts of the data, add to it or reuse in combination with other text it may incorrectly handle the control codes
  4. search and comparison algorithms should ignore these characters, but typically don't.

Inline changes to baseline

There may also be embedded ranges of text within a single paragraph that need to have a different base direction. For example,

"The title was '!NOITASILANOITANRETNI'."

where the span within the single quotes is in Hebrew/Arabic/Divehi, etc., and needs to have a RTL base direction, instead of the LTR base direction of the surrounding paragraph, in order to place the exclamation mark correctly.

Again, it's typically easier and safer for authors to use markup to indicate such inline ranges. In HTML you would usually use an inline element with dir attribute to establish the base direction for such runs of text. If you can't mark up the text, such as in HTML's title element, or any environment that handles only plain text content, you have to resort to Unicode's paired control characters to establish the base direction for such an internal range.

Furthermore, inline ranges where the base direction is changed should be isolated from surrounding text, so that the UBA doesn't produce incorrect results due to interference across boundaries. See an example of how this can produce incorrect ordering of things such as text followed by numbers in HTML, or another example of how it can affect lists.

This means that if you are using Unicode control codes you should use RLI/LRI...PDI rather than RLE/LRE...PDF.  These isolating codes are fairly new, and applications may not yet support them.

RLM and LRM

A word about the Unicode characters U+200F RIGHT-TO-LEFT MARK (RLM) and U+200E LEFT-TO-RIGHT MARK (LRM) is warranted at this point.

The first point to be clear about is that neither RLM nor LRM establish the base direction for a range of text.  They are simply invisible characters with strong directional properties.

This means that you cannot use RLM for example, to make the text W3C appear to the left of the Hebrew text in the following example.

The title is "פעילות הבינאום, W3C".

For this you can only use metadata or the paired control characters.

Of course, if you are detecting base direction using first-strong heuristics then RLM and LRM can be useful for setting the base direction where the text in question begins with something that would otherwise give the wrong result, eg.

"نشاط التدويل" is how you say "i18n Activity" in Arabic.

Here an LRM could be placed at the start of the text, before the strong RTL Arabic characters, to prevent the algorithm from assuming that the text should be right-to-left. (Remember that if metadata is used to set the base direction, that character is ignored, unless the metadata specifically says that first-strong heuristics should be used.)

Implications for CSV data

It's worth noting that the order in which columns in a plain-text csv file are displayed will be affected by the contents, and bidirectional or rtl+numeric data will be hard for humans to read unless unicode control characters are used in abundance. For example, without any additional information over and above the UBA, take the following data:

  1. col 1 (region code): EG
  2. col 2 (per capita GDP): $3.724
  3. col 3 (country name): مصر
  4. col 4 (capital): القاهرة
  5. col 5 (population): "88,978,000"
  6. col 6 (gps): 30°2′N 31°13′E

If we put that data into a single line, separated by commas, and opened it in a simple text editor that supports the Unicode bidi algorithm, we would see the following if the base direction is LTR:

If the base direction is RTL, we would see:

Note how there are different problems in each case, and that some of the values appear to be different from what was intended. Don't overlook, by the way, that although the order of the arabic text looks the same, the items appear to be in the wrong columns, respectively, in the LTR version.

Of course, the above is only smoke and mirrors: the underlying order of characters is accurate, and readable by an application, and starts always with column 1.

Now lets look at what's possible if we can associate metadata with the table, columns, cells, etc.

If most of the content of a csv file should be treated as rtl, it is easiest to indicate this in the table metadata, and allow it to be inherited by all cells. It's also worth specifying a default direction, ie. for the case where no metadata is provided.

However, certain cells may need to have a specific direction in order for the data to be readable, and it's not always easy to detect for which cells that applies. For example, cells in a rtl table that contains mac address numbers, equations, negative signs, telephone numbers, and such may need to be given a LTR direction within an overall RTL dataset in order to be comprehensible to the end user. Consider these examples, which show how, in some cases, such as Mac addressses, the user could actually be completely unaware that the data they are seeing is incorrect if the appropriate base direction setting was missing.

It is possible that what is needed for these cases can be achieved much of the time by setting the direction for the column, such that it is inherited by those cells. This is based on the assumption that all cells in a column contain the same type of information in the same format.

There may also be some linguistic variations for things like equations and ranges: for example, in Arabic a range of 'ten to twenty' is likely to be expressed visually as "20-10" (ie. base direction needs to be rtl), whereas in Hebrew you may see "10-20" (base direction ltr). (As always, the underlying sequence of codepoints should be the same.)

The bottom line seems to be that, unless you take drastic action and fill the file with directional control codes, CSV files containing bidi text or RTL text with numbers are likely to not always be human readable.  They should, however, be machine readable, as long as the data is in logical, rather than visual, order.

The next questions to be answered are what direction information is needed for machines to correctly display results, eg. on a web page, in a spreadsheet, etc? And how is it best to provide that information?

tbc...

Implications for WebVTT data

If a WebVTT script is in a language such as Arabic, Hebrew, Divehi, Persian, Urdu, etc., then most, if not all, of the cue text will need to have a base direction of RTL. There needs to be a way to apply that automatically to all the cues in the file, with mechanisms to change the text in a particular cue or line where a different base direction is needed.

This base direction would only be applied to the content of the cue, not to any of the additional information in the file, such as time settings, ids, etc.  Unless you have a clever editor, it would also not be applied to the display of the cue text in the raw text file, either.  This may produce occasional difficulties for editing of bidirectional text in the source.

It appears to me that there are two possible approaches to automatically propagating base direction to the cue text: heuristically or declaratively. Heuristically means testing the text itself, and declaratively means providing markup or other metadata to indicate the preferred base direction.

The currently specc'ed approach relies on heuristics, though with a slight twist which will be explained below. What follows is my current understanding of how the WebVTT spec does this.

[1] heuristic (currently specced) approach

Establish the base direction of all of the lines in a given cue text by detecting the direction of the first strong character in the first line of the cue, ie. where there are multiple lines, assign the base direction for following lines based on the of the first line, ignoring the normal UBA approach of treating each line as a paragraph. (Line breaks in UBA constitute paragraph breaks and the base direction needs to be redetected using first-strong heuristics for each paragraph. This is also different to the way CSS deals with plain text, since it follows the UBA rules.)

Strategies for where this approach fails:

(a) a line that should be rtl, but starts with non-rtl characters (and vice versa), such as

00:38.500 --> 00:39.500 
  <v Maha>"C مدخل إلى!"

The C should appear to the right of the line, and the ! to the left, but you will get the reverse.

Solution: put &rlm; at the start of the line.

(b) same applies to a line with no strong character (such as a telephone number) or a mixture of strong and non-strong characters (such as a Mac address) but that has to ordered in a particular way, eg.

00:38.500 --> 00:39.500
bahrain مصر kuwait

(c) multiline cues in multiple scripts/languages, eg.

00:22.000 --> 00:24.000
שלום!
Hello!

The exclamation mark in the first line will appear to the left, as expected, because the first-strong character in the cue is a Hebrew character (rtl). That of the second line will also appear to the left, which is NOT expected, since the second line's directionality is set by that of the first line.

Solution: &lrm; at the start of the second line will not have any effect, since the implementation doesn't test for first-strong characters in the second line. The only way i can think of to fix this is to set an embedding level using RLI ... PDF characters around 'Hello!'. Apart from the fact that it is a pain to do this for a large number of cases (such as in the video from which this example was taken), that the characters at either end of the text lend some fragility to the line, and that those Unicode control characters are not available on keyboards, RLI/LRI/FSI ... PDI are not currently supported by browsers or keyboards and it will be necessary to use RLE/LRE instead (which actually have the same problems).

(d) the implementation must ignore anything that appears before the actual cue text, eg.

00:16.000 --> 00:18.000
<v Maha>السلام عليكم!

the characters <v Maha> must be ignored. Same goes if you have span or other markup at the start of the line.

(e) inline text may need a different base direction, eg.

00:37.000 --> 00:37.500
The title of my new book is "مدخل إلى C++". 
No wait...!

One way to do this would be to use RLI/LRI ... PDI control characters. These are generally not preferred because they are invisible and can be difficult to manage, and because they are not easy to input.

Since WebVTT supports span elements(?), this would offer the opportunity to apply directionality declaratively, which is often to work with.

[2] declarative approach

Give WebVTT the ability to say

STYLE
direction:rtl;

at the top of the file, then the default base direction for the content is established by that statement, and displayed text for all lines of cue text should get a base direction of rtl, regardless of their first-strong character, unless some lower level directive intervenes. The important thing to bear in mind is that this approach is incompatible with first-strong heuristics, and &lrm; or &lrm; at the start of the para are of no consequence.

When you have paragraphs/lines that should not have a direction of rtl (like those mentioned above) you need a way to change their base direction using some kind of metadata annotation, on a per paragraph basis.

One could probably easily enough allow for some metadata declaration at the cue level to change the direction of content, however it is actually necessary to be able to change the direction of content for any paragraph/line level, eg. it may be the second line in the cue that has to be set to ltr. Since lines in WebVTT cues are not bounded by markup, i'm not sure how one would do this using metadata/markup.

So what i'm saying is that, if we have the file-level declaration for direction, it has to come with some other mechanism for indicating the desired base direction for individual paragraphs.

One solution might be to use Unicode embedding control characters, but, as described above, many people would prefer to use a declarative approach because of the difficulties involved in using control characters. It may also be possible to surround each line in question with a span, but this is a rather cumbersome and inefficient approach.

First published 3 Dec 2010. This version 2016-07-28 15:59 GMT.  •  Copyright r12a@w3.org. Licence CC-By.