Advice for Programmers

Advice for application writers and protocol designers

Don't use encodings with slightly differing variants for exchanging text between different machines.
Example: Don't use ISO-8859-1 here - half of the computers in Europe use ISO-8859-15 nowadays, and errors such this one are not noticed in a simple test but then later cause problems to users.

Don't label text in a wrong way.
Example: Text in windows-1252 should not be labelled as ISO-8859-1 except after verification that it contains no byte in the range 0x80..0x9F. Similarly, text in windows-936 should not be labelled as GBK or GB2312 unless you verified that it contains no character from the windows-936 superset.

Advice for programmers dealing with encodings

Don't write a conversion facility of your own.
Reason: You don't have the manpower to write converters for all encodings that are needed around the world. Your program would be unusable in half of the world. Better use the character conversion facilities in the operating system, or GNU libiconv.

Don't create a new charset of your own.
Reason: We already have too many charsets. We already have the choice among dozens of charsets for Chinese, three for Georgian, three for Kazakh, etc. The Unicode charset (preferred encoding: UTF-8) is widely implemented. Better use Unicode, the sooner the better, and get rid of legacy encodings that need support in every charset conversion facility.

Don't make small modifications to the conversion tables of an encoding without renaming it.
This leads to big confusion: The same text document will look differently on different systems, and users will often not even be aware of the problem (since they normally use just one system). Example: The Big5 family. Different companies added extensions to this encoding. Often these extensions are incompatible to each other, and never are they labelled as such. This effectively limits the set of computers and the timespan in which a given document can be viewed.

Don't make small modifications to the conversion tables while creating a variant.
This causes the same document to be shown differently on different systems or in different situations.

Don't convert between Shift_JIS, EUC-JP and ISO-2022-JP-2 using the formulas.
It's because nowadays Microsoft dictates the evolution of Shift_JIS, but ISO-2022-JP-2 is standardized. There are some characters which have been "redefined" or "remapped" by Microsoft in the newer CP932 definitions; these characters do not agree any more with the characters at the same position in Shift_JIS of 15 years ago - which are those to which the formulas map.

If an encoding evolves following a vendor's monopoly, follow that vendor.
If a vendor has a monopoly and defines a charset differently than 3 years ago, over time up to 99% of the users will rely on the new definition. Software that uses a different conversion table - even if it's the original conversion table from 3 years ago - will then be confusing the users.

Advice regarding character sets
Bruno Haible <bruno@clisp.org>

Last modified: 17 June 2007.