Abstract

I argue that the explicit embedding codes specified in UAX 9 (LRE, RLE, LRO, RLO, PDF) are awkward, both for the programmer who needs to know too much information to emit them and for the user, who gets a stream contaminated with too much semi-visual information which reduces its usefulness as a logical-order text. I also argue that the flat document model targeted by UAX 9 is insufficient for defining the bidirectional behavior of real-life documents.

I propose an alternative scheme for implicit bidi in (possibly) hierachical documents and an alternative set of codes with better properties, which together with RLM/LRM (against which I have nothing) allow representing embedded text in purely logical order and more implicitly.

Compatibility?

First, let's make clear: I do not propose Unicode's current codes should be immediately replaced. In fact that isn't possible - at best they can be supplemented with the new codes and the old ones can be deprecated, if that proves to be wise. I do not even propose specific code points at this stage. First I want to argue that it would be better if Unicode would have used a scheme like this from the beginning, without considering the practical difficulty of adopting them now. So please don't come up with "no chance to introduce it now" arguments before discussing the technical merit of this.

There is one other way in which this proposal can be immediately useful. UAX 9 says:

The algorithm is designed so that the use of explicit codes can be equivalently represented by out-of-line information, such as stylesheet information. However, any alternative representation will be defined by reference to the behavior of the explicit codes in this algorithm.

It follows from my arguments that either a higher-level protocol would be as awkward as the existing codes or it's definition by reference to the existing codes would be awkward ;-). I claim that the proposed codes are much more suitable as a model for higher-level protocols. I don't see any practical value in conformance to this requirement of UAX 9.

Motivation

Embedding is not implicit

The most important virtue of the Unicode bidirectional algorithm is its implicitness. This means that most programs can support bidi, at least in simple cases, with a low effort. Low effort directly translates into wide support. It's easy to underestimate the number of programs that lose bidi support for each extra degree of effort required. Unfortunately, when it comes to text embedding, UAX 9 loses it's implicitness - any application desiring to emit and/or process them must gain quite full understanding of the bidi situation.

The implicit bidi algorithm basically starts with flat information: a categorization of the text characters. Assuming that there are only two levels of language nesting ¹, this is sufficient to determine the structure, if you also know the base direction. This is a basic fact of bidirectional life: the same mix of RTL and LTR characters can be correctly interpretted only with one base direction. In most cases assuming the wrong base direction gives results as nonsensical as compiling a C program in which all opening and closing braces were interchanged.

[1]	Actually it can sometimes implicitly determine a three-level structure, taking numbers into account. That doesn't change my basic assertions, so for simplicity, I'll keep numbers out of my examples.

Thus, the base direction of a text should be considered part of that text and deserves to be encoded in it. The "first strong character" heuristic specified by UAX 9 allows to guess this direction correctly from the text itself in 90% of the cases - this is great! In the remaining cases it can be overriden by the relatively unobtrusive LRM/RLM marks.

Now consider two pieces of text being combined in some way. For example consider the message template the file "%s" wasn't found. where a second text - the filename some name - is substituted instead of %s. Now, the filename might be an arbitrary mix of RTL and LTR text and the same can be said of the template, because it can be translated (e.g. via gettext). Each of them has an intrinsic base direction, that can be encoded in the text itself, by the first strong character. That's very good - if it couldn't have been encoded in the text, the programmer would have to associate an external base direction property with each string - who would bother to do that?

However when such texts are naively combined, the result is still unsatisfactory. The whole resulting text is interpretted according to the base direction of the template - so if the filename has another direction, you might see it garbled. The filename can also interfere with ordering of the surrounding template. Moreover, the nesting of the filename as a continuos part of the string could get broken.

Let's look at a few examples of a template and a filename. In all examples in this file, capital letters represent strongly RTL characters while lowercase letters represent strongly LTR ones; bidi control codes are shown in angle brackets (e.g. <LRM>). The texts in all examples are the same, only the case (=directionality) varies; this makes the examples look quite artificial but simplifies comparison between them.

First, let's take a worst-case mix and naively check whether it works.

Template:

Logical: THE file "%s" wasn't FOUND.

Base dir: R

Levels: 111122222222222222221111111

Visual: .DNUOF file "%s" wasn't EHT

Logical:	`THE file "%s" wasn't FOUND.`
Base dir:	R
Levels:	`111122222222222222221111111`
Visual:	`.DNUOF file "%s" wasn't EHT`

Filename:

Logical: some NAME

Base dir: L

Levels: 000001111

Visual: some EMAN

Logical:	`some NAME`
Base dir:	L
Levels:	`000001111`
Visual:	`some EMAN`

Combined message:

Logical: THE file "some NAME" wasn't FOUND.

Base dir: R

Levels: 1111222222222211111112222221111111

Visual: .DNUOF wasn't "EMAN file "some EHT

Logical:	`THE file "some NAME" wasn't FOUND.`
Base dir:	R
Levels:	`1111222222222211111112222221111111`
Visual:	`.DNUOF wasn't "EMAN file "some EHT`

Here I managed to combine all three kinds of breakage I could think of into one example:

Note how some and NAME became separated - the nesting was violated! The implicit bidi algorithm picked a single run of LTR text across the embedding boundary (of which it's not aware).
Note how file and wasn't which should be in LTR order (see visual string of the original template) became RTL. The NAME RTL string went one level shallower ² than the surrounding text instead of deeper than it (since there can only be 2 levels with implicit bidi).
Note how the base direction of embedded text is RTL instead of LTR (the template's base direction is the result's base direction).

[2]

Unfortunately the Unicode bidi algorithm is specified in terms of levels starting with zero for the outermost text and increasing as you enter embedded texts. This contradicts my instincts of "higher level" meaning the bigger thing and "lower level" meaning the inner details. To avoid confusion I'll be using the terms "deeper" and "shallower".

After some experimentation, I learnt that one should not rely on implicit bidi except at the deepest two levels. Indeed, most of these problems go away if the outer text - the template in our scenario - has only LTR or only RTL characters. This immediately kills the opportunity for the first two kinds of breakage. When one must mix two directions in the template, he can insulate the base one with explicit embeding marks, so only the inner directionality will interact with the embedded text. For example:

Template:

Logical: THE <LRE>file "%s" wasn't<PDF> FOUND.

Base dir: R

Levels: 1111 2222222222222222 1111111

Visual: .DNUOF file "%s" wasn't EHT

Logical:	`THE <LRE>file "%s" wasn't<PDF> FOUND.`
Base dir:	R
Levels:	`1111 2222222222222222 1111111`
Visual:	`.DNUOF file "%s" wasn't EHT`

Combined message:

Logical: THE <LRE>file "some NAME" wasn't<PDF> FOUND.

Base dir: R

Levels: 1111 22222222222333322222222 1111111

Visual: .DNUOF file "some EMAN" wasn't EHT

Logical:	`THE <LRE>file "some NAME" wasn't<PDF> FOUND.`
Base dir:	R
Levels:	`1111 22222222222333322222222 1111111`
Visual:	`.DNUOF file "some EMAN" wasn't EHT`

Now it happens to be perfect because the direction of the explicit embedding matches the intrinsic direction of the filename. However if the filename had a base RTL direction, the result would be wrong:

Filename:

Logical: SOME name

Base dir: R

Levels: 111112222

Visual: name EMOS

Logical:	`SOME name`
Base dir:	R
Levels:	`111112222`
Visual:	`name EMOS`

Combined message:

Logical: THE <LRE>file "SOME name" wasn't<PDF> FOUND.

Levels: 1111 22222233332222222222222 1111111

Visual: .DNUOF file "EMOS name" wasn't EHT

Logical:	`THE <LRE>file "SOME name" wasn't<PDF> FOUND.`
Levels:	`1111 22222233332222222222222 1111111`
Visual:	`.DNUOF file "EMOS name" wasn't EHT`

Note how the filename's order was changed by the embedding. To correct this we would have to add another explicit embedding:

Combined message:

Logical: THE <LRE>file "<RLE>SOME name<PDF>" wasn't<PDF> FOUND.

Levels: 1111 222222 333334444 22222222 1111111

Visual: .DNUOF file "name EMOS" wasn't EHT

Logical:	`THE <LRE>file "<RLE>SOME name<PDF>" wasn't<PDF> FOUND.`
Levels:	`1111 222222 333334444 22222222 1111111`
Visual:	`.DNUOF file "name EMOS" wasn't EHT`

So we have added embedding codes between levels 4-3 and 3-2; only the shallowest two levels 2-1 are separated implicitly, which is generally the right way to use the embedding codes - you can't "skip" levels except for the deepest two.

Now I claim that the first addition of embedding codes was acceptable but the second is very problematic - it's the main argument I have against the embedding codes of UAX 9! Why, what's the difference between them?

In the first case, the embedding was done inside the template string which was all created at the same time by a human. Adding embedding characters then isn't a problem. In the second, the embedding should or should not be done, dependant on the filename which isn't known beforehand. So the decicion whether to add these codes must be done at run time. Alternatively, you could always add embedding codes but then you'd have to decide dynamically whether to use RLE or LRE. There is no escape from this - the explicit embedding codes of UAX 9 force a base direction upon the embedded text but only this text itself knows what this direction should be.

So put yourself in the place of the good-hearted B. D. Hacker who wanted to make his program bidi-friendly. He now learns that he must scan any string he embeds to determine its base direction and duplicate this information in the embedding codes. Instead of the simple printf he needs some dynamic code-insertion function... Who will bother with this???

The Compatible "Solution"

Before continuing I must admit there exists one way to use the explicit embedding codes without dynamically scanning the embedded strings for their base direction. It's very simple: just wrap all strings in either LRE..PDF or RLE..PDF. Then you can embed strings simply by inserting them into another string (which in turn starts/ends with explicit codes to allow futher embedding). This actually makes more sense than any other scheme, since as we said (almost) any text has an intirnisic base direction.

The trouble with this is that the price is too high. This requires all texts out there to be wrapped with explicit codes, even in simple cases when it's only English or only Hebrew. So that's what it's all about - a difference between N and N-1 codes for an N-levels-deep embedding - because the later means no codes at all in the simple cases. Assigning bidi categories to characters allows us to handle the deepest couple of levels implicitly, which allows 90% of the cases to just work and the others to break more gracefully.

One way to perhaps improve this would be to omit the LRE..PDF from LTR texts (only RTL text would be wrapped in RLE..PDF). To allow unmarked LTR texts top be embedded, each embedding place would be always wrapped in LRE..PDF; this would still work because if the embedded text would be RTL you would one embedding directly inside another, with correct final results. Still, this leaves the same unacceptable burden on RTL text users.

Note: Yudit advocates a similar approach link here Yudit.bidi.txt

Flat document model is insufficient

UAX9 assumes plain text or equivallent uses. It assumes that bidi is confined to paragraphs; within a paragraph it treates the text as an unstructured sequence of characters and runs the bidi algorithm over the whole paragraph to resolve bidi levels. The paragraph is then split to lines and each line is reordered according to the resolved levels.

This fails to account for real-life documents being hierarchical. It is more or less enough if all you care for is the order of the text. However there is another part to bidi: layout. RTL documents (or document parts) use a mirrored layout: the paragraphs are usually right-aligned, bullets are on the right, table columns go from right to left, etc.

The minimal thing almost everybody does, even though UAX 9 says nothing about it, is to align a paragraph in plain text to the right or left according to it's base direction. For more complex documents, it is necessary to know the direction of bigger document structures than paragraphs - lists, tables, etc. We always want to know the direction of elements contained in paragraphs - usually for the sake of reordering sometimes also for layout. In short - we'd like to derive a direction property for each element in the document hierarchy.

For example, let's look at the CSS2 spec for bidi behaivour. It defines a direction property which defines basic layout direction and maps the logical (start, end) directions to physical directions (left, right) to allow logical direction specifications in CSS2 rules.

It also defines a unicode-bidi property which, together with direction, can emulate the effect of RLE/LRE/RLO/LRO..PDF embeddings:

The final order of characters in each block-level element is the same as if the bidi control codes had been added as described above, markup had been stripped, and the resulting character sequence had been passed to an implementation of the Unicode bidirectional algorithm for plain text that produced the same line-breaks as the styled text. In this process, non-textual entities such as images are treated as neutral characters, unless their unicode-bidi property has a value other than normal, in which case they are treated as strong characters in the direction specified for the element.

Definitely a tribute to the requirement to define alternative representations by reference to the explicit codes in UAX 9. Awkward at best! Note that this does not suggest running the Unicode bidi algorithm on the original html (minus markup) because the placing of the imaginary bidi codes is only known after CSS has been applied to the document, which must be in some tree form for that. The possibility of DOM manipulation completely kills this. So this description seems to suggest the most convoluted processing order I've seen in a long time:

Parse XHTML source to element tree.
Apply CSS2, among other resolving bidi properties.
Serialize the XHTML text (omitting tags) with bidi properties mapped to control codes. Record the mapping to the element tree when doing this.
Apply the Unicode bidi algorithm.
Map the resulting visual order back onto the element tree.
Reorder the boxes accordingly and display.

Hopefully, nobody would be crazy enough to actually do this. What can be done is to apply the Unicode bidi algorithm directly at the element tree as if it was serial. This is still awkward. It requires re-implementing the Unicode bidi algorithm to work with an elaborate element tree pretending to be a sequence of characters.

The artificial definition of tree processing through flat paragraph processing also means that the implication of the algorithm on the resulting display are hard to understand. The only way to keep sanity is to think directly of the effect of the bidi codes on the tree - which is quite comprehendable - but how can you be sure the prescribed process doesn't differ from your understanding in subtle ways?

Directional overrides usually inadequate

So far I've talked of LRE/RLE; let's now talk of the directional overrides. What are they useful for?

The first obvious use is what their name means: to override the direction of characters, when you expect that the recieving software might not know their correct bidi category (because they are recently-added or simple from the Private Use Area). There is a flaw in the behavior of the Unicode algorthim with respect to this usage: characters inside directional overrides are not considered strongly directional when determining the paragraph direction. This suggests that UAX 9 doesn't intend LRO/RLO to really override the directionality of characters but only to serve as an awkward mean of getting desired output. If you account for this problem, this use seems completely valid (though rare) and I don't propose any replacement for it. Interestingly, UAX 9 never mentions it.
Suppressing downstream bidi processing. Imagine an application that does its own bidi processing; it wants to make sure the visual order it produces will be respected and not bidi-processed again. Directional overrides provide a great way to achieve this because:
- They can be used locally, only for specific parts of the text.
- They are convenient to use: you can use LRO..PDF on the whole text to really get visual order but you can use them in a structured manner, supporting RTL fragments as easily as LTR.
  - If you do the later, you get complete order control while preserving crucial bidi information that enables correct line wrapping and text selection, things that "classic" visual order doesn't allow.
This use is also completely valid and I don't propose any replacement for it. Interestingly, UAX 9 never mentions it either...
"Part numbers": UAX 9 suggests:

The right-to-left override, for example, can be used to force a part number made of mixed English, digits and Hebrew letters to be written from right to left.

Let's think of this a bit more. Using RLO..PDF will obviously work for part numbers where each component is one character. When it's more than one character, it will still be displayed strictly right-to-left, which is hardly what you'd want for the english parts or numbers! Other examples of similar nature are file paths or domain names containing mixed RTL and LTR parts; it'd be nice to get all parts in a single progressive order, restricting bidi to each part individually (although these things are usually not marked for bidi in any way and there is little we can do about it, so perhaps this would only add to the confusion).

Directional overrides are the wrong tool for this task: they simply don't describe the needed embedding structure! They force the whole "part number" a level deeper and directionally overridden. However, what you really want is to embed each part separately and leave the punctuation between parts at it's present level, holding it in some way from the promotion to the level of surrounding directional runs that are normally applied to neutrals.

So you must put each part inside LRE/RLE..PDF to prevent its directionality from being overriden.

Logical:
section <RLE>AB<PDF>'<LRE>12<PDF>'<RLE>CD<PDF>

Base dir:
L

Levels:
00000000 11 0 22 0 11

Visual:
section BA'12'DC

Note that the outer codes can in this case be non-overriding (LRE/RLE), with precisely the same results!

Logical:	`section <RLE>AB<PDF>'<LRE>12<PDF>'<RLE>CD<PDF>`
Base dir:	L
Levels:	`00000000 11 0 22 0 11`
Visual:	`section BA'12'DC`

Desired solution outlines

The basic idea is that direction should be derived implicitly from

My proposal

My proposal consists of two parts: implicit embedding codes that determine the base direction of the embedded from the text itself and a hierarchical processing model. The behaviour of the proposed codes is defined in terms of this model which makes embedding behavoiur easier to define/understand; while the two parts could be separated I do not make such attempts.

Processing model

Since complex documents already take some tree form when we want to resolve their bidi directions and that we want to derive the directions of all nodes of the tree, I propose a hierarchical processing model. Let's assume a document is a tree of nested elements, where every node knows the logical order of it's children and [some of] the leafs are characters.

Now let's generalize of the "first strong character" heuristic. If it allows to implicitly determine the base direction of a paragraph in 90% of the cases, it should be able to determine the base direction of e.g. a table. The bidi categories will propogate upward through the tree. A character's category is defined by Unicode; the category of an element is the category of the first sub-element with strong directionality (for now let's assume there is one).

The unicode bidi algorithm can then be run on each element alone

The UAX 9 model (the document is a sequence of paragraphs and a paragraph is a sequence of characters) is a subset of this model. If you consider each paragraph a separate document and its children

UAX 9 already hints at it, saying:

For the purpose of the bidirectional algorithm, inline objects (such as graphics) are treated as if they are an OBJECT REPLACEMENT CHARACTER (U+FFFC).

Explicit BiDi Codes Considered Awkward