Text, Characters, and Encoding

: Written by: Thomas Weise; Created: 22 May 2017

Programming

If you do not live in an English-speaking country, your language will make use of all sorts of odd characters, such as "ä", "ö", "ü", and "ß" in German or "你" and "我" in Chinese. Of course, you intend that such characters, if used in your web page, blog posts, or text files, come out correctly on the screen of your readers. In my course on Distributed Computing, there actually is a lesson on text encoding, but I can only briefly touch the topic there. Here I just want to summarize it a bit more comprehensively.

Character Set and Encoding

Your web page resides on your computer or server as text and will be transmitted to the reader's computer via a network connection as a stream of bits. From this we can follow that there must be at least two conversion steps: from text to bits and back from bits to text.

But there is more to it: character sets and encodings. A letter, punctuation mark, or other symbol is called a character. In the memory of a computer, each character is represented as a number, called a "code point". The meaning of this number is depends on the character set in use. For example, in the character set defined by the Windows-1252 encoding, the number 252 stands for the lower case character "ü". Under DOS Codepage 872, code point 252 stands for "Ч" (and "ü" is undefined).

Anyway, knowing the character set, a computer program (or operating system function) which paints (renders) the text, "knows" how to paint the right picture (glyph) on the screen for a given code point. We can conclude that the relationship between code points and text characters is one part of the text encoding. The second part is how to represent the sequence of code points as bit strings, which must be done when storing them or sending them over the internet. Traditionally, this is done by simply translating these code points, which are nothing else than integer numbers, to their default binary representation. Since this was the only relevant approach for quite some time, the terms character set and encoding are often used interchangeable and not distinguished. However, today, several different encoding methods may exist even for the same character set, as we will discuss later.

In summary, whenever creating and publishing a website or text document, the following process will take place:

Writing the text in the editor. The data is entered by pressing buttons depicting the right characters on the keyboard.
In the memory of the computer, however, each character is represented as code point (integer number) of the character set used by the operating system or program. Each time you hit a key on the keyboard and input a character, another corresponding code point will be stored in memory. The program will render this text using an appropriate font containing the right glyphs.
You save or send your text.
Before transmitting the list of numbers representing the text over the web or storing it in a file, the list of code points is encoded as stream of bits. For every character set, there might be several different methods to do that. The most primitive way is to just store the integers directly, say with one or two bytes per integer/character.
The data is now stored in a file or travelling over a network connection.
When loading the file again or receiving the contents via a network connection, the bit stream must be decoded again into separate code points. This requires knowledge of both the right decoding procedure and the character set.
The code points must then be displayed, again by using the right character set (and corresponding font).

Steps 2 to 4 are referred to as encoding, steps 6 and 7 are the decoding. Any error during these processes can turn a "我" into a "ÎÒ", which, for the Chinese reader, is significantly less useful. The most likely error to happen is that the receiving side performs a decoding which does not fit to the encoding and character set used on the sending side.

Character Set

There are basically two families of character sets that are relevant today: those that use at most one byte of memory per character (code point) and are thus limited to at most 256 characters, and those which can use more. The former group is the older one and its most prominent family members are ASCII, Windows-1252, and ISO/IEC_8859-1. For quite a few languages based on Latin alphabets together (such as English, German, French, etc.), one byte per character is enough to represent all possible characters.

However, languages like Chinese have significantly more than the 256 characters that can be identified with the 8 bits of a byte. For such languages, own character sets that use more than one byte have been designed, such as GB2312 [国家标准2312]. The problem with per-language character sets and sets is that there are many different languages and characters in the world. Having many different character sets will necessarily lead to incompatibilities, which, in turn, require more and more code to deal with.

At some point in time, a globally unique way to identify all characters from all languages was designed: the Universal Character Set (UCS), which basically assigns a number (code point) to any character in the world. The term Unicode is used almost synonymously with UCS, but Unicode provides additional definitions for, e.g., text string comparison. Anyway, with UCS and Unicode defining more than 110'000 characters from 100 scripts, half of the problem of dealing with text on computers, the translation between an integer number and the corresponding character, is solved. Most computers can correctly display the right character for a given UCS code point, or at least show some placeholder symbol like "?" if encountering a character which has no useful representation in the current font.

Moreover, the first code points in the UCS, ASCII, and ISO/IEC_8859-1 are the same — namely those for the Latin/English alphabet plus some western European languages' special characters. After that, characters from other languages like Russian, Chinese, Korean, as well as mathematical symbols follow (not in that order). Today, the single-byte character sets disappear and Unicode/UCS has become the way to go.

Encoding

In the old days of ASCII, Windows-1252, and ISO/IEC_8859-1, text encoding was easy: Since each character occupied one byte in memory, writing these bytes directly into a file or sending them directly over the network was the way to go. No special encoding rules were required.

With the appearance of multi-byte character sets (such as GB2312 and Unicode), this changed. A Unicode character can occupy at most four bytes of memory.

The question arises how to encode these code points into bit sequences. If following the idea to just write these characters to a file as is, all file sizes would quadruple. Thus, there are several ways to encode Unicode text, such as UTF-8, UTF-7, UTF-16, and UTF-32.

UTF-8 is the today de-factor standard in the internet. It has the advantage that the Western European/Latin characters, that have the same code point in Unicode as in ASCII and ISO/IEC_8859-1, will also be encoded to bytes that have the same values as in those encodings. In other words, an English or German text encoded in UTF-8 looks exactly the same as if encoded in ASCII or ISO/IEC_8859-1. For web development and data exchange, this encoding is thus quite useful if working in these languages, since the characters have a high chance to be displayed and processed correctly even if, e.g., the bit stream was decoded as ASCII string instead of UTF-8. Of course, this only works if no characters from, say, the Chinese language are included, which would just appear as rubbish on the screen. The second advantage of UTF-8 is that the data size for English, German, or French text does not change at all. A Chinese or Russian character will, however, occupy multiple bytes.

Still, different encodings from UTF-8 to GB2312 have different rules to match bit patterns to characters and may belong to different character sets. If your browser receives text that it assumes is encoded using method A but which is actually encoded with method B, it will potentially display rubbish. Since web pages are text, you need to tell the browser in which way this text is encoded.

Where is the Encoding?

So far, we have discussed that there are different character sets and encodings, and how it is important that the program loading your document knows which you used when storing/sending it. But how does it know that? In case of the web, there are two places, to my knowledge, where this information is conveyed: The HTTP header and the meta tags in HTML.

HTTP Header

Whenever visiting a web page in the internet, your browser first opens a TCP/IP connection to the web server. Over this connection, the two will use a text-based protocol, HTTP, to discuss what should be done. The web browser will tell the web server which document you want to view (request), and the web server will answer with an response that either that it has this document (and after that, send it) or that it does not have this document (e.g., with the 404 Not Found error).

During the request, your browser not only tells the web server which document you want, but can also state which character sets it can deal with (Accept-Encoding) and which languages you like (Accept-Language). In its response, the web server will not only send the document, but can also state in which language it is (Content-Language) and what document format is used (Content-Type), the latter may also contain an identifier of the character set and encoding.

Of course, if you have your website in a shared hosting environment, you typically have little control about what the web server is doing and sending. In some cases, you may be able to edit the .htaccess file of the server, but that requires quite some knowledge and also is not always possible. In other words, in many cases, you will not be able to efficiently control this part of your web page delivery. Thus, the web server may send no encoding information or, in the worst case, even a wrong encoding.

HTML Meta Tag

The other option is to use the meta tag in the HTML header. By putting the following line into the the <header> element in your HTML document, you tell the browser that the document is a HTML document based on the Unicode character set and encoded using UTF-8.

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

The http-equiv is not only interpreted by the web browser, but it should (theoretically) also be interpreted by the web server before sending the HTML document. In the ideal case, the web server would put the field Content-Type: text/html; charset=UTF-8 into the HTTP header of its response. If that does not happen, the browser can still find the right way to decode the web page: It will probably begin decoding it by using UTF-8 or ISO/IEC_8859-1 until it encounters this meta tag. Then it can switch to the right encoding and, if necessary, begin to decode the page again from the beginning.

Potential Sources of Error

Still, if you work on a web page, there are quite a few things that can go wrong. For instance, the web server may not parse the <meta> tag you have set in your HTML file. It may thus send a content type without encoding to the user, or maybe even the wrong content type. OK, in a correctly configured system, this should not happen.

What can happen is that the web browser of the person visiting your page is misconfigured. Some browsers have a menu "View" with another sub-menu such as "Character Encoding" under which the user can select the encoding to be used to display a web page. This will then override whatever settings you made in your page and whatever the web server said in its HTTP Response. So even if you do everything right, your non-English characters may still come out as garbage.

HTML/XML Entities

If your text documents are in HTML, XHTML, XML, or plain SGML format, there is one more way of "encoding" Unicode characters: Using XML/HTML Character Entities. Since web pages usually are in HTML or XHTML format, this is an interesting additional option. The basic idea is to represent "special" characters like "φ" or "猫" to special character sequences that only use the basic Latin/English alphabet. These two characters have correspond to the Unicode code points 966 (hexadecimal 3C6) and 29483 (hexadecimal 732B). The numeric character entities identifying these two Unicode code points are φ (or &#x3c6) and 猫 (or 猫). In other words, we just take the code point as decimal number and put an ampersand and a hash in front of it and a semicolon after it. If we want to write the hexadecimal value of the code point instead, we additionally insert an "x" after the hash.

The code points 966 and 29483 are way out of the range of a single byte. In other words, any binary encoding, be it UTF-8 or UTF-16, would encode them by using at least two bytes. However, if we can represent these code points by using character entities, which only include basic Latin characters, punctuation marks, and numbers, such two-or-more-byte codes would not need to occur in our data stream. Moreover, we could encode the whole text using legacy encodings like ASCII as well. This is cool since UTF-8, ASCII, ISO/IEC_8859-1, and GB2312 are all compatible for these characters. Even if the program loading our text data would use a wrong decoding procedure, chances are very high that our text still comes out correct! And this is true regardless of what settings the browsers visiting your page have — in the worst case, they plot the characters as "?" if they have no appropriate fonts to render them.

The price is that the character decoding procedure is shifted up by one abstraction level. Entities only work in XML, HTML, XHTML, and other SGML-based document formats. If you use them in plain text files, they will appear as is, without being translated to the proper Unicode characters. A text editor using an UTF-8 based decoding procedure will display the entity text as, say, as "猫". Only an HTML or XML parsers will further decode "猫" to "猫". For the web, however, we can expect that this will work, though, since web pages are HTML or XHTML documents.

Another price you pay is that the characters occupy more space. Obviously, when writing a purely Chinese text, encoding everything into HTML entities will at least double its size. However, if you only occasionally include foreign characters or use a language like German where there are few special characters, that increase in size may be acceptable.

Summary

When working with texts containing non-English characters, I suggest to:

Use Unicode characters and store/send the data UTF-8 encoded.
If serving files from a web server, make sure that it sends the right encoding as part of the HTTP Response. Edit the .htaccess file if necessary.
If creating HTML documents, insert the correct <meta> tag with the right encoding in the header.
If creating HTML or XML documents, use entities instead of plain special characters where possible.
Consider to do that conversion automatically instead of manually, in order to not overlook characters.

More Stuff

There are several other topics concerning text representation and encoding on computers, such as:

Endianness and Byte Order Marks (BOMs) in Unicode encodings
Writing directions: Left-to-right versus right-to-left (e.g., in Arabic script).
Different character sequences for "new line" under Windows and Unix/Linux.
The existence or lack of data type support for (Unicode?) strings in different programming languages.

(Just to name a few…)