Advice for Programmers
Advice for application writers and protocol designers
- Don't use encodings with slightly differing variants for exchanging
text between different machines.
- Example: Don't use ISO-8859-1 here - half of the computers in Europe use
ISO-8859-15 nowadays, and errors such this one are not noticed in a
simple test but then later cause problems to users.
- Don't label text in a wrong way.
- Example: Text in windows-1252 should not be labelled as ISO-8859-1 except
after verification that it contains no byte in the range 0x80..0x9F.
Similarly, text in windows-936 should not be labelled as GBK or GB2312
unless you verified that it contains no character from the windows-936
superset.
Advice for programmers dealing with encodings
- Don't write a conversion facility of your own.
- Reason: You don't have the manpower to write converters for all
encodings that are needed around the world. Your program would be
unusable in half of the world. Better use the character conversion
facilities in the operating system, or GNU libiconv.
- Don't create a new charset of your own.
- Reason: We already have too many charsets. We already have the choice
among dozens of charsets for Chinese, three for Georgian, three for
Kazakh, etc. The Unicode charset (preferred encoding: UTF-8) is
widely implemented. Better use Unicode, the sooner the better, and
get rid of legacy encodings that need support in every charset conversion
facility.
- Don't make small modifications to the conversion tables of an encoding
without renaming it.
- This leads to big confusion: The same text document will look differently
on different systems, and users will often not even be aware of the
problem (since they normally use just one system). Example: The Big5
family. Different companies added extensions to this encoding. Often
these extensions are incompatible to each other, and never are they
labelled as such. This effectively limits the set of computers and the
timespan in which a given document can be viewed.
- Don't make small modifications to the conversion tables while creating
a variant.
- This causes the same document to be shown differently on different
systems or in different situations.
- Don't convert between Shift_JIS, EUC-JP and ISO-2022-JP-2 using the
formulas.
- It's because nowadays Microsoft dictates the evolution of Shift_JIS, but
ISO-2022-JP-2 is standardized. There are some characters which have been
"redefined" or "remapped" by Microsoft in the newer CP932 definitions;
these characters do not agree any more with the characters at the same
position in Shift_JIS of 15 years ago - which are those to which the
formulas map.
- If an encoding evolves following a vendor's monopoly, follow that
vendor.
- If a vendor has a monopoly and defines a charset differently than 3 years
ago, over time up to 99% of the users will rely on the new definition.
Software that uses a different conversion table - even if it's the
original conversion table from 3 years ago - will then be confusing the
users.
Advice regarding character sets
Bruno Haible <bruno@clisp.org>
Last modified: 17 June 2007.