[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Substituting malformed UTF-8 sequences in a decoder



Florian Weimer wrote on 2000-07-22 18:23 UTC:
> "H. Peter Anvin" <hpa@zytor.com> writes:
> > The alternate spelling
> > 
> > 	11000001 10001011
> > 
> > ... is not the character K <U+004B> but INVALID SEQUENCE.  One
> > possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
> > CHARACTER on encountering illegal sequences.
> 
> Is there any consensus whether to use one or two U+FFFD characters in
> such situations?

I believe to have finally found the clear RightThingToDo[TM] variant
[see option D) below], but it is none of your options and probably also
not yet widely known.

It might be worth to discuss first in more detail the reasons for the
various choices before presenting the solution:

A) Emit a single U+FFFD per malformed sequence

This goes essentially back to the interpretation of some holly
scripture, namely the words of the Lord in the book ISO 10646-1:1993(E).
Section R.7 on "Incorrect sequences of octets: Interpretation by
receiving devices" says what a malformed UTF-8 sequence is
(unfortunately it does not yet define overlong sequences as malformed!)
and says that a receiving device "shall interpret that malformed
sequence in the same way that it interprets a character that is outside
the adopted subset that has been identified for the device (see 2.3c)".
Here a malformed sequence is suggested to be treated like a single
unknown character, irrespective of how many bytes there are in the
sequence. Section 2.3c just says the obvious thing:

  "[...] Any corresponding characters that are not within the adopted
   subset shall be indicated to the user in a way which need not allow
   them to be distinguished from each other.

   NOTES

   1  An indication to the user may consist of making available the same
   character to represent all characters not in the adopted subset, or
   providing a distinctive audible or visible signal when appropriate to
   the type of user.

   2  See also annex H for receiving devices with retransmission capability."

References:

  http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html

This is the approach that I have chosen for the UTF-8 decoder in xterm.
Xterm has a definition of what a malformed sequence is that is closely
aligned with ISO 10646-1 section R.7, except that xterm also treats
overlong sequences as malformed, as I hope a future version of the
standard will as well. The UTF-8 decoder in xterm does not keep track of
how many bytes it has already read for a character, so implementing
semantics B) would have required me to add an additional variable to the
decoder data structure. This might indicate that semantics A) might also
be slightly simpler to implement if you write your own UTF-8 decoder
(not really clear though).

The final column indicator bar in

  http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

should line up nicely if this test file, which is full of malformed
UTF-8 sequences, is sent to xterm or another monospaced output device
that follows semantics A).

B) Emit a U+FFFD for every byte in a malformed UTF-8 sequence

If you use an existing UTF-8 decoder, for example the one provided by
your C runtime environment in the form of mbtowc() function (ISO C,
section 7.20.7.2), then the interface to this UTF-8 decoder might not
provide you with any way of finding out, where a malformed sequence ends
(especially if it is followed immediately by another malformed
sequence). All you get back from mbtowc() and similar functions is the
information that the start of the byte sequence that you want to have
converted is the start of a malformed sequence. A very simple way of
using this information is to emit a U+FFFD character and then try again
to call mbtowc() one byte later. This results in a U+FFFD being emitted
for every byte in a malformed UTF-8 sequence.

I expect that a significant number of applications might delegate their
UTF-8 encoding/decoding to C's multibyte functions (which then will
automatically add support for various legacy multibyte encodings as
well), and will therefore most likely adopt this semantics.

In order to allow semantics A) to be used with the C multibyte
functions, a locale would have to be added that never signals a
malformed sequence by returning -1 in mbtowc(), but that decodes it
properly into U+FFFD instead. If no -1 is returned for a malformed
sequence, the return value will be the length of the malformed sequence,
such that it can be skipped like a regular sequence. Note that malformed
UTF-8 sequences cannot be longer than the longest normal UTF-8 sequence.

Glibc 2.2 does at the moment not provide such a UTF-8 locale with
in-band signalling of malformed sequences. Note also that in such a
locale, a correctly UTF-8 encoded U+FFFD value could not be
distinguished from a malformed sequence.

References:

  http://www.cl.cam.ac.uk/~mgk25/volatile/ISO-C-FDIS.1999-04.txt

C) Emit a U+FFFD only for every first malformed sequence in a sequence
   of malformed UTF-8 sequences

A user of mbtowc() could also issue a single U+FFFD when it encounters
the first valid UTF-8 sequence and then suppress all further emissions
of the REPLACEMENT CHARACTER until a valid character or EOF is
encountered. This is perhaps slightly more complicated to implement and
leaves the receiver of the decoded character stream no indication of how
many bytes were skipped during the decoding process as part of malformed
sequences. In other words, someone could (maliciously?) hide a few
gigabytes of data in the form of malformed UTF-8 sequences behind a
single displayed REPLACEMENT CHARACTER, which might have potentially
interesting security implications. The UTF-8/UTF-16 length ratio looses
its upper limit. With my security hat on, I would therefore advise
strongly against this approach.

D) Emit a malformed UTF-16 sequence for every byte in a malformed
   UTF-8 sequence

All the previous options for converting malformed UTF-8 sequences to
UTF-16 destroy information. This can be highly undesirable in
applications such as text file editors, where guaranteed binary
transparency is a desireable feature. (E.g., I frequently edit
executable code or graphic files with the Emacs text editor and I hate
the idea that my editor might automatically make U+FFFD substitutions at
locations that I haven't even edited when I save the file again.)

I therefore suggested 1999-11-02 on the unicode@unicode.org mailing list
the following approach. Instead of using U+FFFD, simply encode malformed
UTF-8 sequences as malformed UTF-16 sequences. Malformed UTF-8 sequences
consist excludively of the bytes 0x80 - 0xff, and each of these bytes
can be represented using a 16-bit value from the UTF-16 low-half
surrogate zone U+DC80 to U+DCFF. Thus, the overlong "K" (U+004B)
0xc1 0x8b from the above example would be represented in UTF-16 as
U+DCC1 U+DC8B. If we simply make sure that every UTF-8 encoded surrogate
character is also treated like a malformed sequence, then there is no
way that a single high-half surrogate could precede the encoded
malformed sequence and cause a valid UTF-16 sequence to emerge.

This way 100% binary transparent UTF-8 -> UTF-16 -> UTF-8 round-trip
compatibility can be achieved quite easily.

On an output device, a lonely low-half surrogate character should be
treated just like a character outside the adopted subset of
representable characters, that is for the end user, the display would
look exactly like with semantics B), i.e. one symbol per byte of a
malformed sequence. However in contrast to semantics B), no information
is thrown away, and a cut&paste in an editor or terminal emulator will
be guaranteed to reconstruct the original byte sequence. This should
greatly reduce the incidence of accidental corruption of binary data by
UTF-8 -> UTF-16 -> UTF-8 conversion round trips.


Summary:

I personally think now that a loss-less conversion of malformed UTF-8
sequences as outlined in option D) is ultimately the RightThingToDo[TM]
and technically superior to all other discussed alternatives. There are
potentially a couple of ways of doing this, and in the interest of
interoperability, it would be useful to have a quotable more formal
standard or recommendation in this area. May be, it is time to write a
Unicode technical report on this issue.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/