Yongji Wang's Blog: HTTP The Definitive Guide (Internationalization)

Internationalization

This chapter covers two primary internationalization issues for the Web: character set encodings and language tags.
HTTP applications use character set encodings to request and display text in different alphabets, and they use language tags to describe and restrict content to languages the user understands.

HTTP Support for International Content
Servers tell clients about a document’s alphabet and language with the HTTP Content-Type charset parameter and Content-Language headers. These headers describe what’s in the entity body’s “box of bits,” how to convert the contents into the proper characters that can be displayed onscreen, and what spoken language the words represent.

At the same time, the client needs to tell the server which languages the user understands and which alphabetic coding algorithms the browser has installed. The client sends Accept-Charset and Accept-Language headers to tell the server which character set encoding algorithms and languages the client understands, and which of them are preferred.

Accept-Language: fr, en;q=0.8
Accept-Charset: iso-8859-1, utf-8

Character Sets and HTTP
Charset Is a Character-to-Bits Encoding

How Character Sets and Encodings Work

Bits-to-character conversions happen in two steps,

In Figure 16-2a, bits from a document are converted into a character code that identifies a particular numbered character in a particular coded character set. In the example, the decoded character code is numbered 225.
In Figure 16-2b, the character code is used to select a particular element of the coded character set. In iso-8859-6, the value 225 corresponds to “ARABIC LETTER FEH.” The algorithms used in Steps a and b are determined from the MIME charset tag.

The Wrong Charset Gives the Wrong Characters
Standardized MIME Charset Values
Content-Type Charset Header and META Tags
Web servers send the client the MIME charset tag in the Content-Type header, using
the charset parameter:
Content-Type: text/html; charset=iso-2022-jp

For HTML content, character sets might be found in <META HTTP-EQUIV="Content-Type"> tags that describe the charset.

If a client cannot infer a character encoding, it assumes iso-8859-1.

The Accept-Charset Header
HTTP clients can tell servers precisely which character systems they support, using the Accept-Charset request header.

Multilingual Character Encoding Primer
Character Set Terminology

Character
Glyph - A character may have multiple glyphs if it can be written different ways
Coded character
Coding space - A range of integers that we plan to use as character code values.
Code width - The number of bits in each (fixed-size) character code.
Character repertoire - A particular working set of characters (a subset of all the characters in the world).
Coded character set
Character encoding scheme - An algorithm to encode numeric character codes into a sequence of content bits (and to decode them back).

Charset Is Poorly Named
Technically, the MIME charset tag (used in the Content-Type charset parameter and the Accept-Charset header) doesn’t specify a character set at all. The MIME charset value names a total algorithm for mapping data bits to codes to unique characters. It combines the two separate concepts of character encoding scheme and coded character set.

Characters

Glyphs, Ligatures, and Presentation Forms

Here’s the general rule: if the meaning of the text changes when you replace one glyph with another, the glyphs are different characters. Otherwise, they are the same characters, with a different stylistic presentation.

Coded Character Sets
US-ASCII: The mother of all character sets
“American Standard Code for Information Interchange.”
HTTP messages (headers, URIs, etc.) use US-ASCII.

iso-8859
The iso-8859 character set standards are 8-bit supersets of US-ASCII that use the high bit to add characters for international writing.

iso-8859-1, also known as Latin1, is the default character set for HTML.

JIS X 0201
JIS X 0201 is an extremely minimal character set that extends ASCII with Japanese half width katakana characters. JIS is an acronym for “Japanese Industrial Standard.”

JIS X 0208 and JIS X 0212
The JIS X 0208 character set was the first multi-byte Japanese character set; it defined 6,879 coded characters, most of which are Chinese-based kanji. The JIS X 0212 character set adds an additional 6,067 characters.

UCS
The Universal Character Set (UCS) is a worldwide standards effort to combine all of the world’s characters into a single coded character set.

Character Encoding Schemes

Fixed width
Variable width (nonmodal) -Variable-width encodings use different numbers of bits for different character code numbers.
Variable width (modal) - Modal encodings use special “escape” patterns to shift between different modes. For example, a modal encoding can be used to switch between multiple, overlapping character sets in the middle of text.

Encoding schemes:

8-bit - It supports only character sets with a code range of 256 characters. The iso-8859 family of character sets uses the 8-bit identity encoding.

UTF-8 - UTF stands for “UCS Transformation Format” UTF-8 uses a nonmodal, variable-length encoding for the character code values, where the leading bits of the first byte tell the length of the encoded character in bytes, and any subsequent byte contains six bits of code value. For example, character code 90 (ASCII “Z”) would be encoded as 1 byte (01011010), while code 5073 (13-bit binary value 1001111010001) would be encoded into 3 bytes:
11100001 10001111 10010001

iso-2022-jp
iso-2022-jp is a variable-length, modal encoding, with all values less than 128 to prevent problems with non–8-bit-clean software.
The encoding context always is set to one of four predefined character sets.* Special “escape sequences” shift from one set to another.

euc-jp
EUC stands for “Extended Unix Code,” first developed to support Asian characters on Unix operating systems.
Like iso-2022-jp, the euc-jp encoding is a variable-length encoding that allows the use of several standard Japanese character sets. But unlike iso-2022-jp, the euc-jp encoding is not modal. There are no escape sequences to shift between modes.

Language Tags and HTTP
The Content-Language Header

The Accept-Language Header
Clients use Accept-Language and Accept-Charset to request content they can understand.

Types of Language Tags
Language tags can be used to represent:

General language classes (as in “es” for Spanish)
Country-specific languages (as in “en-GB” for English in Great Britain)
Dialects of languages (as in “no-bok” for Norwegian “Book Language”)
Regional languages (as in “sgn-US-MA” for Martha’s Vineyard sign language)
Standardized nonvariant languages (e.g., “i-navajo”)
Nonstandard languages (e.g., “x-snowboarder-slang”*)

Subtags
Language tags have one or more parts, separated by hyphens, called subtags:

The first subtag called the primary subtag. The values are standardized.
The second subtag is optional and follows its own naming standard.
Any trailing subtags are unregistered.

Capitalization
However, lowercasing conventionally is used to represent general languages, while uppercasing is
used to signify particular countries.

IANA Language Tag Registrations

First Subtag: Namespace
If the first subtag has:

Two characters, it is a language code from the ISO 639† and 639-1 standards
Three characters, it is a language code listed in the ISO 639-2‡ standard and extensions
The letter “i,” the language tag is explicitly IANA-registered
The letter “x,” the language tag is a private, nonstandard, extension subtag

Second Subtag: Namespace
If the second subtag has:

Two characters, it’s a country/region defined by ISO 3166*
Three to eight characters, it may be registered with the IANA
One character, it is illegal

Remaining Subtags: Namespace
There are no rules for the third and following subtags, apart from being up to eight characters (letters and digits).

Configuring Language Preferences

Internationalized URIs
Global Transcribability Versus Meaningful Characters

URI Character Repertoire

Escaping International Characters
Note that escape values should be in the range of US-ASCII codes (0–127).

Modal Switches in URIs

Other Considerations
Headers and Out-of-Spec Data
HTTP headers must consist of characters from the US-ASCII character set.

Dates
Domain Names

Yongji Wang's Blog

Wednesday, May 28, 2014

HTTP The Definitive Guide (Internationalization)

No comments:

Post a Comment