Unraveling Unicode problems in WikkaWiki

While I was setting up the wiki, I realized some problems with non-English letters, such as ö. Therefore I had to find out more details about Unicode and encoding (a task that does not happen frequently if you program in languages such as Fortran, where you normally have other class of problems to handle). I found this very interesting page on JoelOnSoftware, and another FAQ for Unix/Linux.

Basically, if I understood correctly, everything can be summed up to the following:

The problems I had with the ö letter arised from a wrong encoding WikkaWiki was assuming: it specified the document as iso-8859-1 encoded. ISO-8859-1 (also known as Latin-1) is basically a single-byte encoding like ASCII, where values from 0x80 to 0xFF map to most European symbols. However, the data I wrote was UTF-8 encoded, and this introduced a mismatch between the data and the decoding performed by the browser (in ISO-8859-1, as specified in the HTML). Since ö has code point U+00F6 (binary 11110110), this resulted in a two-byte UTF-8 encoding of the type 110yyyyy 10zzzzzz: 11000011 10110110, or 0xC3 0xB6. Each of these two bytes was interpreted as a single-byte ISO-8859-1 value, leading to a weird ö (capital A tilde and pilcrow end paragraph) instead of an ö (o with umlaut).

To solve, I basically changed the default meta tag content in WikkaWiki, from

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">


<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

and now it works correctly.

Other references: