Yongji Wang's Blog: HTTP The Definitive Guide (Web Robots)

Web Robots

Crawlers and Crawling

Where to Start: The “Root Set”

Extracting Links and Normalizing Relative Links

Cycle Avoidance

Loops and Dups

Trails of Breadcrumbs

Trees and hash tables
Lossy presence bit maps
Checkpoints
Partitioning

Aliases and Robot Cycles

Canonicalizing URLs

Filesystem Link Cycles

Dynamic Virtual Web Spaces

Avoiding Loops and Dups

Robotic HTTP

Identifying Request Headers

User-Agent

Tells the server the name of the robot making the request.

From

Provides the email address of the robot’s user/administrator.*

Tells the server what media types are okay to send.† This can help ensure that the robot receives only content in which it’s interested (text, images, etc.).

Referer

Provides the URL of the document that contains the current request-URL.

Virtual Hosting

Conditional Requests

Response Handling

Status codes

Entities

User-Agent Targeting

Misbehaving Robots

Runaway robots
Stale URLs
Long, wrong URLs
Nosy robots
Dynamic gateway access

Excluding Robots

The Robots Exclusion Standard

Fetching robots.txt

GET /robots.txt HTTP/1.0

Host: www.joes-hardware.com

User-Agent: Slurp/2.0

Date: Wed Oct 3 20:22:48 EST 2001

Response codes

• If the server responds with a success status (HTTP status code 2XX), the robot must parse the content and apply the exclusion rules to fetches from that site.

• If the server response indicates the resource does not exist (HTTP status code 404), the robot can assume that no exclusion rules are active and that access to the site is not restricted by robots.txt

• If the server response indicates access restrictions (HTTP status code 401 or 403) the robot should regard access to the site as completely restricted.

• If the request attempt results in temporary failure (HTTP status code 503), the robot should defer visits to the site until the resource can be retrieved.

• If the server response indicates redirection (HTTP status code 3XX), the robot should follow the redirects until the resource is found.

robots.txt File Format

Caching and Expiration of robots.txt

HTML Robot-Control META Tags

Robot META directives

NOINDEX

NOFOLLOW

INDEX

Tells a robot that it may index the contents of the page.

Tells a robot that it may crawl any outgoing links in the page.

NOARCHIVE

Tells a robot that it should not cache a local copy of the page.*

ALL

Equivalent to INDEX, FOLLOW.

NONE

Equivalent to NOINDEX, NOFOLLOW.

Search engine META tags

Robot Etiquette

Modern Search Engine Architecture

Full-Text Index

Posting the Query

Sorting and Presenting the Results

Yongji Wang's Blog

Wednesday, May 14, 2014

HTTP The Definitive Guide (Web Robots)

No comments:

Post a Comment