Index of /~est/libutf8b

      Name                       Last modified       Size  Description

[DIR] Parent Directory 08-Apr-2008 22:55 - [DIR] old/ 03-Nov-2006 03:35 - [   ] libutf8b-0.1rc3.tar.gz.md5 03-Nov-2006 08:20 1k [   ] libutf8b-0.1rc3.tar.gz.sha 03-Nov-2006 08:20 1k [   ] libutf8b-0.1rc3.tar.gz 03-Nov-2006 03:06 569k

This directory contains a C implementation of a UTF-8b codec and a
Python codec based on it.


UTF-8B

utf-8b is a mapping from byte streams to unicode codepoint streams
that provides an exceptionally clean handling of garbage (i.e.,
non-utf-8) bytes (i.e., bytes that are not part of a utf-8 encoding)
in the input stream.  They are mapped to 256 different, guaranteed
undefined, unicode codepoints.  This leads to:

  - a very clean error interface: the decoder client can just check for
      garbage codepoints being returned

  - decode/encode idempotency on arbitrary byte streams

  - no loss of input information

  - stronger assurances of decoder correctness since the decoder has to
      be defined over arbitrary data

See kuhn-utf-8b.html for more background on UTF-8b.


LIBUTF8B

In addition to bulk codec functions, an incremental decoder which can
be fed input bytes one-at-a-time is provided.

This source code is released under the a BSD license.  See the file
COPYING for more details.

The included `decode' program provides an example use of the
incremental decoder.

INSTALL
To install the C library:

  ./configure;  make install

To install the Python codec:

  sudo python setup.py install


TEST

To build and run the tests:

  ./configure;  make test

The last line of output of both scripts should be "all passed".