Wednesday, May 28, 2014

HTTP The Definitive Guide (Entities and Encodings)

Entities and EncodingsIn particular, HTTP ensures that its cargo:
  • Can be identified correctly (using Content-Type media formats and Content-Language headers) so browsers and other clients can process the content properly
  • Can be unpacked properly (using Content-Length and Content-Encoding headers) 
  • Is fresh (using entity validators and cache-expiration controls)
  • Meets the user’s needs (based on content-negotiation Accept headers)
  • Moves quickly and efficiently through the network (using range requests, delta encoding, and other data compression)
  • Arrives complete and untampered with (using transfer encoding headers and
  • Content-MD5 checksums)
Messages Are Crates, Entities Are Cargo
HTTP/1.1 defines 10 primary entity header fields:
Content-Type
        The kind of object carried by the entity.
Content-Length
        The length or size of the message being sent.
Content-Location
        An alternate location for the object at the time of the request.
Content-Range
        If this is a partial entity, this header defines which pieces of the whole are included.
Content-MD5
        A checksum of the contents of the entity body.
Last-Modified
        The date on which this particular content was created or modified at the server.
Expires
        The date and time at which this entity data will become stale.
Allow
        What request methods are legal on this resource; e.g., GET and HEAD.
ETag
        A unique validator for this particular instance* of the document. The ETag header is not defined                     formally as an entity header, but it is an important header for many operations involving entities.
Cache-Control
        Directives on how this document can be cached. The Cache-Control header, like the ETag header, is         not defined formally as an entity header.

Entity Bodies

  • In Figure 15-2a, the entity body begins at byte number 65, right after the end-ofheaders CRLF. The entity body contains the ASCII characters for “Hi! I’m a message!”
  • In Figure 15-2b, the entity body begins at byte number 67. The entity body contains the binary contents of the GIF image. GIF files begin with 6-byte version signature, a 16-bit width, and a 16-bit height. You can see all three of these directly in the entity body.
Content-Length: The Entity’s Size
The Content-Length header is mandatory for messages with entity bodies, unless the message is transported using chunked encoding.

Detecting Truncation
Older versions of HTTP used connection close to delimit the end of a message. But, without Content-Length, clients cannot distinguish between successful connection close at the end of a message and connection close due to a server crash in the middle of a message. Clients need Content-Length to detect message truncation.

Message truncation is especially severe for caching proxy servers. Caching proxy servers generally do not cache HTTP bodies that don’t have an explicit Content-Length header, to reduce the risk of caching truncated messages.

Incorrect Content-Length
An incorrect Content-Length can cause even more damage than a missing Content-Length.

Content-Length and Persistent Connections
Content-Length is essential for persistent connections.If the response comes across a persistent connection, another HTTP response can immediately follow the current response. The Content-Length header lets the client know where one message ends and the next begins. Because the connection is persistent, the client cannot use connection close to identify the message’s end. Without a Content-Length header, HTTP applications won’t know where one entity body ends and the next message begins.
As we will see in “Transfer Encoding and Chunked Encoding,” there is one situation where you can use persistent connections without having a Content-Length header: when you use chunked encoding.

Content Encoding
If the body has been content-encoded, the Content-Length header specifies the length, in bytes, of the encoded body, not the length of the original, unencoded body.

Rules for Determining Entity Body Length
The rules should be applied in order; the first match applies.

  • If a particular HTTP message type is not allowed to have a body, ignore the Content-Length header for body calculations.
  • If a message contains a Transfer-Encoding header (other than the default HTTP “identity” encoding), the entity will be terminated by a special pattern called a “zero-byte chunk,” unless the message is terminated first by closing the connection.
  • If a message has a Content-Length header (and the message type allows entity bodies), the Content-Length value contains the body length, unless there is a non-identity Transfer-Encoding header. If a message is received with both a Content-Length header field and a non-identity Transfer-Encoding header field, you must ignore the Content-Length, because the transfer encoding will change the way entity bodies are represented and transferred (and probably the number of bytes transmitted)
  • If the message uses the “multipart/byteranges” media type and the entity length is not otherwise specified (in the Content-Length header), each part of the multipart message will specify its own size. This multipart type is the only entity body type that self-delimits its own size, so this media type must not be sent unless the sender knows the recipient can parse it.
  • If none of the above rules match, the entity ends when the connection closes.
  • To be compatible with HTTP/1.0 applications, any HTTP/1.1 request that has an entity body also must include a valid Content-Length header field (unless the server is known to be HTTP/1.1-compliant).
Entity Digests
The Content-MD5 header is used by servers to send the result of running the MD5 algorithm on the entity body. The Content-MD5 header contains the MD5 of the content after all content encodings have been applied to the entity body and before any transfer encodings have been applied to it.

Media Type and Charset
The Content-Type header field describes the MIME type of the entity body.
If the entity has gone through content encoding, for example, the Content-Type header will still specify the entity body type before the encoding.

Character Encodings for Text Media
        Content-Type: text/html; charset=iso-8859-4

Multipart Media Types
MIME “multipart” email messages contain multiple messages stuck together and sent as a single, complex message. Each component is self-contained, with its own set of headers describing its content; the different components are concatenated together and delimited by a string. HTTP also supports multipart bodies; however, they typically are sent in only one of two situations: in fill-in form submissions and in range responses carrying pieces of a document.

Multipart Form Submissions
When an HTTP fill-in form is submitted, variable-length text fields and uploaded objects are sent as separate parts of a multipart body, allowing forms to be filled out with values of different types and lengths.
Example:

1.
Content-Type: multipart/form-data; boundary=AaB03x
--AaB03x
Content-Disposition: form-data; name="submit-name"
Sally
--AaB03x
Content-Disposition: form-data; name="files"; filename="essayfile.txt"
Content-Type: text/plain
...contents of essayfile.txt...
--AaB03x--

2.
Content-Type: multipart/form-data; boundary=AaB03x
--AaB03x
Content-Disposition: form-data; name="submit-name"
Sally
--AaB03x
Content-Disposition: form-data; name="files"
Content-Type: multipart/mixed; boundary=BbC04y
--BbC04y
Content-Disposition: file; filename="essayfile.txt"
Content-Type: text/plain
...contents of essayfile.txt...
--BbC04y
Content-Disposition: file; filename="imagefile.gif"
Content-Type: image/gif
Content-Transfer-Encoding: binary
...contents of imagefile.gif...
--BbC04y--
--AaB03x--

Multipart Range Responses
HTTP responses to range requests also can be multipart. Such responses come with a Content-Type: multipart/byteranges header and a multipart body with the different ranges.

Example:

HTTP/1.0 206 Partial content
Server: Microsoft-IIS/5.0
Date: Sun, 10 Dec 2000 19:11:20 GMT
Content-Location: http://www.joes-hardware.com/gettysburg.txt
Content-Type: multipart/x-byteranges; boundary=--[abcdefghijklmnopqrstuvwxyz]--
Last-Modified: Sat, 09 Dec 2000 00:38:47 GMT
--[abcdefghijklmnopqrstuvwxyz]--
Content-Type: text/plain
Content-Range: bytes 0-174/1441
Fourscore and seven years ago our fathers brough forth on this continent
a new nation, conceived in liberty and dedicated to the proposition that
all men are created equal.
--[abcdefghijklmnopqrstuvwxyz]--
Content-Type: text/plain
Content-Range: bytes 552-761/1441
But in a larger sense, we can not dedicate, we can not consecrate,
we can not hallow this ground. The brave men, living and dead who
struggled here have consecrated it far above our poor power to add
or detract.
--[abcdefghijklmnopqrstuvwxyz]--
Content-Type: text/plain
Content-Range: bytes 1344-1441/1441
and that government of the people, by the people, for the people shall
not perish from the earth.
--[abcdefghijklmnopqrstuvwxyz]--

Content Encoding
The Content-Encoding Process
The content-encoding process is:

  • A web server generates an original response message, with original Content-Type and Content-Length headers.
  • A content-encoding server (perhaps the origin server or a downstream proxy) creates an encoded message. The encoded message has the same Content-Type but (if, for example, the body is compressed) a different Content-Length. The content-encoding server adds a Content-Encoding header to the encoded message, so that a receiving application can decode it.
  • A receiving program gets the encoded message, decodes it, and obtains the original.

Content-Encoding Types
Accept-Encoding Headers
The Accept-Encoding field contains a comma-separated list of supported encodings.
Here are a few examples:
        Accept-Encoding: compress, gzip
        Accept-Encoding:
        Accept-Encoding: *
        Accept-Encoding: compress;q=0.5, gzip;q=1.0
        Accept-Encoding: gzip;q=1.0, identity; q=0.5, *;q=0

Transfer Encoding and Chunked Encoding
Transfer encodings also are reversible transformations performed on the entity body, but they are applied for architectural reasons and are independent of the format of the content. You apply a transfer encoding to a message to change the way message data is transferred across the network.
Safe Transport
In HTTP, there are only a few reasons why transporting message bodies can cause trouble. Two of these are:
Unknown size
Security

Transfer-Encoding Headers
Transfer-Encoding
TE
All transfer-encoding values are case-insensitive. HTTP/1.1 uses transfer-encoding values in the TE header field and in the Transfer-Encoding header field. The latest HTTP specification defines only one transfer encoding, chunked encoding.

Chunked Encoding
Chunked encoding breaks messages into chunks of known size. Each chunk is sent one after another, eliminating the need for the size of the full message to be known before it is sent.

Chunking and persistent connections
Chunked encoding provides a solution for this dilemma, by allowing servers to send the body in chunks, specifying only the size of each chunk. As the body is dynamically generated, a server can buffer up a portion of it, send its size and the chunk, and then repeat the process until the full body has been sent. The server can signal the end of the body with a chunk of size 0 and still keep the connection open and ready for the next response.

Trailers in chunked messages
Any of the HTTP headers can be sent as trailers, except for the Transfer-Encoding, Trailer, and Content-Length headers.

Combining Content and Transfer Encodings
Transfer-Encoding Rules

  • The set of transfer encodings must include “chunked.” The only exception is if the message is terminated by closing the connection.
  • When the chunked transfer encoding is used, it is required to be the last transfer encoding applied to the message body.
  • The chunked transfer encoding must not be applied to a message body more than once.
Time-Varying Instances
The HTTP protocol specifies operations for a class of requests and responses, called instance manipulations, that operate on instances of an object. The two main instance-manipulation methods are range requests and delta encoding.
Validators and Freshness

Freshness
Servers can provide this information using one of two headers: Expires and Cache-Control.
Expires: Sun Mar 18 23:59:59 GMT 2001

The Cache-Control header actually is very powerful. It can be used by both servers and clients to describe freshness using more directives than just specifying an age or expiration time.
Conditionals and Validators
        GET /announce.html HTTP/1.0
        If-Modified-Since: Sat, 29 Jun 2002, 14:30:00 GMT
The If-Modified-Since conditional header tests the last-modified date of a document instance, so we say that the last-modified date is the validator. The If-None-Match conditional header tests the ETag value of a document, which is a special keyword or version-identifying tag associated with the entity. Last-Modified and ETag are the two primary validators used by HTTP.

HTTP groups validators into two classes: weak validators and strong validators.
The last-modified time is considered a weak validator because, although it specifies the time at which the resource was last modified, it specifies that time to an accuracy of at most one second.
The ETag header is considered a strong validator, because the server can place a distinct value in the ETag header every time a value changes.
The server might advertise a “weak” entity tag by prefixing the tag with “W/”.

Range Requests
HTTP goes further: it allows clients to actually request just part or a range of a document.
Example:

GET /bigfile.html HTTP/1.1
Host: www.joes-hardware.com
Range: bytes=4000-
User-Agent: Mozilla/4.61 [en] (WinNT; I)
In the case where clients request multiple ranges in a single request, responses come back as a single entity, with a multipart body and a Content-Type: multipart/byteranges header.

Servers can advertise to clients that they accept ranges by including the header Accept-Ranges in their responses.

HTTP/1.1 200 OK
Date: Fri, 05 Nov 1999 22:35:15 GMT
Server: Apache/1.2.4
Accept-Ranges: bytes
That is, a client’s range request makes sense only if the client and server have the same version of a document.
Delta Encoding
Delta encoding is an extension to the HTTP protocol that optimizes transfers by communicating changes instead of entire objects. Delta encoding is a type of instance manipulation, because it relies on clients and servers exchanging information about particular instances of an object.

A-IM is short for Accept-Instance-Manipulation

Instance Manipulations, Delta Generators, and Delta Appliers


The Unix diff -e algorithm does a line-by-line comparison of files. This obviously is okay for text files but breaks down for binary files. The vcdiff algorithm is more powerful, working even for non-text files and generally producing smaller deltas than diff -e.

A server supporting delta encoding must keep all the different copies of that page as it changes over time, in order to figure out what’s changed between any requesting client’s copy and the latest copy.







































































No comments:

Post a Comment