Wednesday, May 14, 2014

HTTP The Definitive Guide (Web Robots)

Web Robots
Crawlers and Crawling
Where to Start: The “Root Set”

Extracting Links and Normalizing Relative Links

Cycle Avoidance
Loops and Dups

Trails of Breadcrumbs
  • Trees and hash tables
  • Lossy presence bit maps
  • Checkpoints
  • Partitioning
Aliases and Robot Cycles
Canonicalizing URLs
Filesystem Link Cycles
Dynamic Virtual Web Spaces
Avoiding Loops and Dups

Robotic HTTP
Identifying Request Headers
User-Agent
     Tells the server the name of the robot making the request.
From
     Provides the email address of the robot’s user/administrator.*
Accept
     Tells the server what media types are okay to send.† This can help ensure that the robot receives only content in which it’s interested (text, images, etc.).
Referer
     Provides the URL of the document that contains the current request-URL.


Virtual Hosting

Conditional Requests
Response Handling
Status codes
Entities
User-Agent Targeting
Misbehaving Robots
  • Runaway robots
  • Stale URLs
  • Long, wrong URLs
  • Nosy robots
  • Dynamic gateway access


Excluding Robots


The Robots Exclusion Standard
Fetching robots.txt
    GET /robots.txt HTTP/1.0
    Host: www.joes-hardware.com
    User-Agent: Slurp/2.0
    Date: Wed Oct 3 20:22:48 EST 2001

Response codes
• If the server responds with a success status (HTTP status code 2XX), the robot must parse the content and apply the exclusion rules to fetches from that site.
• If the server response indicates the resource does not exist (HTTP status code 404), the robot can assume that no exclusion rules are active and that access to the site is not restricted by robots.txt
• If the server response indicates access restrictions (HTTP status code 401 or 403) the robot should regard access to the site as completely restricted.
• If the request attempt results in temporary failure (HTTP status code 503), the robot should defer visits to the site until the resource can be retrieved.
• If the server response indicates redirection (HTTP status code 3XX), the robot should follow the redirects until the resource is found.

robots.txt File Format

Caching and Expiration of robots.txt

HTML Robot-Control META Tags
<META NAME="ROBOTS" CONTENT=directive-list>

Robot META directives
NOINDEX
NOFOLLOW

INDEX
Tells a robot that it may index the contents of the page.
FOLLOW
Tells a robot that it may crawl any outgoing links in the page.
NOARCHIVE
Tells a robot that it should not cache a local copy of the page.*
ALL
Equivalent to INDEX, FOLLOW.
NONE
Equivalent to NOINDEX, NOFOLLOW.

Search engine META tags
Robot Etiquette


Modern Search Engine Architecture



Full-Text Index
Posting the Query
Sorting and Presenting the Results



















No comments:

Post a Comment