Index of /~est/libutf8b
Name Last modified Size Description
Parent Directory 08-Apr-2008 22:55 -
old/ 03-Nov-2006 03:35 -
libutf8b-0.1rc3.tar.gz.sha 03-Nov-2006 08:20 1k
libutf8b-0.1rc3.tar.gz.md5 03-Nov-2006 08:20 1k
libutf8b-0.1rc3.tar.gz 03-Nov-2006 03:06 569k
This directory contains a C implementation of a UTF-8b codec and a
Python codec based on it.
UTF-8B
utf-8b is a mapping from byte streams to unicode codepoint streams
that provides an exceptionally clean handling of garbage (i.e.,
non-utf-8) bytes (i.e., bytes that are not part of a utf-8 encoding)
in the input stream. They are mapped to 256 different, guaranteed
undefined, unicode codepoints. This leads to:
- a very clean error interface: the decoder client can just check for
garbage codepoints being returned
- decode/encode idempotency on arbitrary byte streams
- no loss of input information
- stronger assurances of decoder correctness since the decoder has to
be defined over arbitrary data
See kuhn-utf-8b.html for more background on UTF-8b.
LIBUTF8B
In addition to bulk codec functions, an incremental decoder which can
be fed input bytes one-at-a-time is provided.
This source code is released under the a BSD license. See the file
COPYING for more details.
The included `decode' program provides an example use of the
incremental decoder.
INSTALL
To install the C library:
./configure; make install
To install the Python codec:
sudo python setup.py install
TEST
To build and run the tests:
./configure; make test
The last line of output of both scripts should be "all passed".