libutf8 - a Unicode/UTF-8 locale plugin

This library provides UTF-8 locale support, for use on systems which don't have UTF-8 locales, or whose UTF-8 locales are unreasonably slow.

It provides support for

All this according to the ISO/ANSI C specifications, and with support for old 8-bit locales and Unicode UTF-8 locales.

libutf8 is for you if your application supports 8-bit and multibytes locales like chinese or japanese, and you wish to add UTF-8 locale support but the corresponding support lacks from your system.

libutf8 is for you also if your application supports only 8-bit locales, and you wish to add UTF-8 locale support. Because libutf8 implements an ISO/ANSI C compatible set of types and functions, the support for libutf8 you add will also automatically work (without libutf8) with other multibytes locales, as far as supported by the system.

libutf8 concentrates on 8-bit and UTF-8 encodings and therefore does not suffer from the complexity needed to support other multibytes locales.

To use this library, as a C/C++ package developer:

Installation

As usual for GNU packages:
$ ./configure --prefix=/usr/local
$ make
$ make install

Special configuration options:

--with-traditional-mbstowcs
The traditional semantics of the mbrtowc, mbrlen, mbsrtowcs, mbsnrtowcs functions in ISO C 89 Amendment 1 is to process complete multibyte characters. When an incomplete multibyte character is encountered, processing stops before this character. An mbstate_t contains shift state only (i.e., for 8-bit and UTF-8 encodings, no information at all).

The new ISO C 99 semantics is to process all available bytes of an incomplete multibyte character, and store in an mbstate_t the parse state of an incomplete multibyte character, as far as it has been read.

libutf8 by default implements the new semantics. --with-traditional-mbstowcs enables the traditional one instead.

--with-nontraditional-wcstombs
The traditional semantics of the wcsrtombs, wcsnrtombs functions in ISO C 89 Amendment 1 is to process complete multibyte characters. When a multibyte character cannot be stored in the destination buffer without overflowing it, conversion stops before this character. An mbstate_t contains shift state only (i.e., for 8-bit and UTF-8 encodings, no information at all).

The new ISO C 99 semantics is to write as many bytes as allowed, even at the risk of writing an incomplete multibyte character. An mbstate_t keeps track of how far the current multibyte character has been written.

libutf8 by default implements the traditional semantics. --with-nontraditional-wcstombs enables the new one instead.

This library can be built and installed in two variants:

Distribution: http://www.haible.de/bruno/libutf8-0.8.1.tar.gz
libutf8 package
Bruno Haible <bruno-antispam@antispam.clisp.org>

Last modified: 27 May 2009.