Monday, June 9, 2014

HTTP The Definitive Guide (Logging and Usage Tracking)

Logging and Usage Tracking
What to Log?
For the most part, logging is done for two reasons: to look for problems on the server or proxy (e.g., which requests are failing), and to generate statistics about how web sites are accessed.

A few examples of commonly logged fields are:

  • HTTP method
  • HTTP version of client and server
  • URL of the requested resource
  • HTTP status code of the response
  • Size of the request and response messages (including any entity bodies)
  • Timestamp of when the transaction occurred
  • Referer and User-Agent header values
Log Formats
Common Log Format


Combined Log Format

The Combined Log Format is very similar to the Common Log Format; in fact, it mirrors it exactly, with the addition of two fields

Netscape Extended Log Format
The first seven fields in the Netscape Extended Log Format are identical to those in the Common Log Format (see Table 21-1). Table 21-3 lists, in order, the new fields that the Netscape Extended Log Format introduces.


Netscape Extended 2 Log Format

The Netscape Extended 2Log Format derives from the Netscape Extended Log Format, and its initial fields are identical to those listed in Table 21-3.
Table 21-4 lists, in order, the additional fields of the Netscape Extended 2 Log Format.


Squid Proxy Log Format


Hit Metering
The Hit Metering protocol requires caches to periodically report cache access statistics to
origin servers.

Overview

The Meter Header


A Word on Privacy


























Sunday, June 8, 2014

HTTP The Definitive Guide (Redirection and Load Balancing)

Redirection and Load Balancing
In this chapter, we’ll take a look at the following redirection techniques, how they
work, and what their load-balancing capabilities are (if any):

  • HTTP redirection
  • DNS redirection
  • Anycast routing
  • Policy routing
  • IP MAC forwarding
  • IP address forwarding
  • The Web Cache Coordination Protocol (WCCP)
  • The Intercache Communication Protocol (ICP)
  • The Hyper Text Caching Protocol (HTCP)
  • The Network Element Control Protocol (NECP)
  • The Cache Array Routing Protocol (CARP)
  • The Web Proxy Autodiscovery Protocol (WPAD)
Where to Redirect

Servers, proxies, caches, and gateways all appear to clients as servers, in the sense that a client sends them an HTTP request, and they process it. Many redirection techniques work for servers, proxies, caches, and gateways because of their common, server-like traits.
Web servers handle requests on a per-IP basis.
Proxies tend to handle requests on a per-protocol basis.


Overview of Redirection Protocols
The direction that an HTTP message takes on its way through the Internet is affected by the HTTP applications and routing devices it passes from, through, and toward. For example:

  • The browser application that creates the client’s message could be configured to send it to a proxy server.
  • DNS resolvers choose the IP address that is used for addressing the message. This IP address can be different for different clients in different geographical locations.
  • As the message passes through networks, it is divided into addressed packets; switches and routers examine the TCP/IP addressing on the packets and make decisions about routing the packets on that basis.
  • Web servers can bounce requests back to different web servers with HTTP redirects.
Table 20-1 summarizes the redirection methods used to redirect messages to servers, each of which is discussed later in this chapter.

Table 20-2 summarizes the redirection methods used to redirect messages to proxy servers.

General Redirection Methods

HTTP Redirection
DNS Redirection
DNS allows several IP addresses to be associated to a singledomain, and DNS resolvers can be configured or programmed to return varying IP addresses.

DNS round robin
DNS round robin uses a feature of DNS hostname resolution to balance load across a farm of web servers. It is a pure load-balancing strategy, and it does not take into account any factors about the location of the client relative to the server or the current stress on the server.

Multiple addresses and round-robin address rotation
DNS round robin for load balancing

The impact of DNS caching

Other DNS-based redirection algorithms

  • Load-balancing algorithms - Some DNS servers keep track of the load on the web servers and place the leastloaded web servers at the front of the list.
  • Proximity-routing algorithms - DNS servers can attempt to direct users to nearby web servers, when the farm of web servers is geographically dispersed.
  • Fault-masking algorithms - DNS servers can monitor the health of the network and route requests away from service interruptions or other faults.

Anycast Addressing
In anycast addressing, several geographically dispersed web servers have the exact same IP address and rely on the “shortest-path” routing capabilities of backbone routers to send client requests to the server nearest to the client.
IP MAC Forwarding

Because MAC address forwarding is point-to-point only, the server or proxy has to be located one hop away from the switch.

IP Address Forwarding
In IP address forwarding, a switch or other layer 4–aware device examines TCP/IP addressing on incoming packets and routes packets accordingly by changing the destination IP address, instead of the destination MAC address.
This type of forwarding also is called Network Address Translation (NAT).

Two ways to control the return path of the response are:
  • Change the source IP address of the packet to the IP address of the switch. - This is called full NAT, where the IP forwarding device translates both destination and source IP addresses.
  • If the source IP address remains the client’s IP address, make sure (from a hardware perspective) that no routes exist directly from server to client (bypassing the switch). -This sometimes is called half NAT.

Network Element Control Protocol
The Network Element Control Protocol (NECP) allows network elements (NEs)— devices such as routers and switches that forward IP packets—to talk with server elements (SEs)—devices such as web servers and proxy caches that serve application layer requests.

Messages
Proxy Redirection Methods
Explicit Browser Configuration
Proxy Auto-configuration

Web Proxy Autodiscovery Protocol
PAC file autodiscovery
An HTTP client that implements the WPAD protocol:

  • Uses WPAD to find the PAC file CURL
  • Fetches the PAC file (a.k.a. configuration file, or CFILE) corresponding to the CURL
  • Executes the PAC file to determine the proxy server
  • Sends HTTP requests to the proxy server returned by the PAC file


WPAD algorithm
The current WPAD specification defines the following techniques, in order:

  • DHCP (Dynamic Host Discovery Protocol)
  • SLP (Service Location Protocol)
  • DNS well-known hostnames
  • DNS SRV records
  • DNS service URLs in TXT records
Of these five mechanisms, only the DHCP and DNS well-known hostname techniques are required for WPAD clients.

Consider a client with hostname johns-desktop.development.foo.com. This is the
sequence of discovery attempts a complete WPAD client would perform:

  • DHCP
  • SLP
  • DNS A lookup on “QNAME=wpad.development.foo.com”
  • DNS SRV lookup on “QNAME=wpad.development.foo.com”
  • DNS TXT lookup on “QNAME=wpad.development.foo.com”
  • DNS A lookup on “QNAME=wpad.foo.com”
  • DNS SRV lookup on “QNAME=wpad.foo.com”
  • DNS TXT lookup on “QNAME=wpad.foo.com”
CURL discovery using DHCP
DNS A record lookup
Retrieving the PAC file
Once a candidate CURL is created, the WPAD client usually makes a GET request to the CURL. When making requests, WPAD clients are required to send Accept headers with appropriate CFILE format information that they are capable of handling.
For example:
Accept: application/x-ns-proxy-autoconfig

When to execute WPAD
The web proxy autodiscovery process is required to occur at least as frequently as one of the following:

  • Upon startup of the web client—WPAD is performed only for the start of the first instance. Subsequent instances inherit the settings.
  • Whenever there is an indication from the networking stack that the IP address of the client host has changed.
WPAD spoofing
Timeouts

Administrator considerations
Administrators should configure at least one of the DHCP or DNS A record lookup methods in their environments, as those are the only two that all compatible clients are required to implement.

Cache Redirection Methods
WCCP Redirection
Cisco Systems developed the Web Cache Coordination Protocol (WCCP) to enable routers to redirect web traffic to proxy caches.

How WCCP redirection works

Start with a network containing WCCP-enabled routers and caches that can communicate with one another.

  • A set of routers and their target caches form a WCCP service group. The configuration of the service group specifies what traffic is sent where, how traffic is sent, and how load should be balanced among the caches in the service group.
  • If the service group is configured to redirect HTTP traffic, routers in the service group send HTTP requests to caches in the service group.
  • When an HTTP request arrives at a router in the service group, the router chooses one of the caches in the service group to serve the request (based on either a hash on the request’s IP address or a mask/value set pairing scheme).
  • The router sends the request packets to the cache, either by encapsulating the packets with the cache’s IP address or by IP MAC forwarding.
  • If the cache cannot serve the request, the packets are returned to the router for normal forwarding.
  • The members of the service group exchange heartbeat messages with one another, continually verifying one another’s availability.
WCCP2 messages
Message components
Each WCCP2message consists of a header and components. The WCCP header information contains the message type (Here I Am, I See You, Assignment, or Removal Query), WCCP version, and message length (not including the length of the header).


Service groups
A service group consists of a set of WCCP-enabled routers and caches that exchange WCCP messages.

GRE packet encapsulation
Routers that support WCCP redirect HTTP packets to a particular server by encapsulating them with the server’s IP address. The packet encapsulation also contains an IP header proto field that indicates Generic Router Encapsulation (GRE).
WCCP load balancing

Internet Cache Protocol
The Internet Cache Protocol (ICP) allows caches to look for content hits in sibling caches.
ICP can be thought of as a cache clustering protocol.



Cache Array Routing Protocol
The Cache Array Routing Protocol (CARP) is a standard proposed by Microsoft Corporation and Netscape Communication Corporation to administer a collection of proxy servers such that an array of proxy servers appears to clients as one logical cache.
In contrast, the collection of servers connected using CARP operates as a single, large server with each component server containing only a fraction of the total cached documents.
Hyper Text Caching Protocol
The difference between an ICP and an HTCP transaction is in the level of detail in the requests and responses.

HTCP Authentication
Setting Caching Policies





























HTTP The Definitive Guide (Publishing Systems)

Publishing Systems
FrontPage Server Extensions for Publishing Support
FrontPage Server Extensions
Our primary interest lies in the publishing protocol between the FP clients and FPSE. This protocol provides an example of designing extensions to the core services available in HTTP without changing HTTP semantics.

FrontPage Vocabulary

  • Virtual server - A web server that supports virtual servers is called a multi-hosting web server. A machine that is configured with multiple IP addresses is called a multi-homed server
  • Root web - The default, top-level content directory of a web server, or, in a multi-hosting environment, the top-level content directory of a virtual web server. 
  • Subweb - A named subdirectory of the root web or another subweb that is a complete FPSE extended web.

The FrontPage RPC Protocol
FrontPage Security Model

WebDAV and Collaborative Authoring
Web Distributed Authoring and Versioning (WebDAV) adds an extra dimension to web publishing—collaboration.
WebDAV Methods

  • PROPFIND - Retrieves the properties of a resource.
  • PROPPATCH - Sets one or more properties on one or many resources.
  • MKCOL - Creates collections.
  • COPY - Copies a resource or a collection of resources from a given source to a given destination. The destination need not be on the same machine.
  • MOVE - Moves a resource or a collection of resources from a given source to a given destination. The destination need not be on the same machine.
  • LOCK - Locks a resource or multiple resources.
  • UNLOCK - Unlocks a previously locked resource.
WebDAV and XML

XML provides WebDAV with:
  • A method of formatting instructions describing how data is to be handled
  • A method of formatting complex responses from the server
  • A method of communicating customized information about the collections and resources handled
  • A flexible vehicle for the data itself
  • A robust solution for most of the internationalization issues
WebDAV Headers

  • DAV
  • Depth
  • Destination
  • If
  • Lock-Token
  • Overwrite
  • Timeout
WebDAV Locking and Overwrite Prevention


WebDAV supports two types of locks:
  • Exclusive write locking of a resource or a collection - An exclusive write lock guarantees write privileges only to the lock owner. This type of locking completely eliminates potential conflicts.
  • Shared write locking of a resource or a collection - A shared write lock allows a group of people to work on a given document. This type of locking works well in an environment where all the authors are aware of each other’s activities.
The LOCK Method

The UNLOCK Method

Properties and META Data
The PROPFIND Method
The PROPPATCH Method
Collections and Namespace Management
The MKCOL Method
The DELETE Method
The COPY and MOVE Methods
Overwrite header effect
COPY/MOVE of properties
Locked resources and COPY/MOVE


Enhanced HTTP/1.1 Methods
The PUT method
The OPTIONS method
Using the OPTIONS method, the client tries to establish the capability of the WebDAV server.

Version Management in WebDAV

Future of WebDAV






































HTTP The Definitive Guide (Web Hosting)

Web Hosting
Hosting Services
A Simple Example: Dedicated Hosting
Virtual Hosting
Virtual Server Request Lacks Host Information
Making Virtual Hosting Work


Virtual hosting by URL path
In general, URL-based virtual hosting is a poor solution and seldom is used.

Virtual hosting by port number
Virtual hosting by IP address
A much better approach (in common use) is virtual IP addressing. Here, each virtual web site gets one or more unique IP addresses. The IP addresses for all of the virtual web sites are attached to the same shared server. The server can look up the destination IP address of the HTTP connection and use that to determine what web site the client thinks it is connected to.
Virtual IP hosting works, but it causes some difficulties, especially for large hosters:

  • Computer systems usually have a limit on how many virtual IP addresses can be bound to a machine. Hosters that want hundreds or thousands of virtual sites to be hosted on a shared server may be out of luck.
  • IP addresses are a scarce commodity. Hosters with many virtual sites might not be able to obtain enough virtual IP addresses for the hosted web sites.
  • The IP address shortage is made worse when hosters replicate their servers for additional capacity. Different virtual IP addresses may be needed on each replicated server, depending on the load-balancing architecture, so the number of IP addresses needed can multiply by the number of replicated servers.


Virtual hosting by Host header


HTTP/1.1 Host Headers
Syntax and usage
Interpreting Host headers
Host headers and proxies

Making Web Sites Reliable
There are several times during which web sites commonly break:

  • Server downtime
  • Traffic spikes: suddenly everyone wants to see a particular news broadcast or rush to a sale. Sudden spikes can overload a web server, slowing it down or stopping it completely.
  • Network outages or losses

Mirrored Server Farms
A server farm is a bank of identically configured web servers that can cover for each other. The content on each server in the farm can be mirrored, so that if one has a problem, another can fill in.



In the Figure 18-7 scenario, there are a couple of ways that client requests would be directed to a particular server:
  • HTTP redirection - The URL for the content could resolve to the IP address of the master server, which could then send redirects to replica servers.
  • DNS redirection - The URL for the content could resolve to four IP addresses, and the DNS server could choose the IP address that it sends to clients.
Content Distribution Networks
A content distribution network (CDN) is simply a network whose purpose is the distribution of specific content. The nodes of the network can be web servers, surrogates, or proxy caches.

Surrogate Caches in CDNs
Proxy Caches in CDNs

Making Web Sites Fast