Web Robots
Crawlers and Crawling
Where to Start: The “Root Set”
Extracting Links and Normalizing Relative Links
Cycle Avoidance
Loops and Dups
Trails of Breadcrumbs
- Trees and hash tables
- Lossy presence bit maps
- Checkpoints
- Partitioning
Aliases and Robot Cycles
Canonicalizing URLs
Filesystem Link Cycles
Dynamic Virtual Web Spaces
Avoiding Loops and Dups
Robotic HTTP
Identifying Request Headers
User-Agent
Tells the server the name of the robot making the request.
From
Provides the email address of the robot’s user/administrator.*
Accept
Tells the server what media types are okay to send.† This can help ensure that the robot receives only content in which it’s interested (text, images, etc.).
Referer
Provides the URL of the document that contains the current request-URL.
Virtual Hosting
Conditional Requests
Response Handling
Status codes
Entities
User-Agent Targeting
Misbehaving Robots
- Runaway robots
- Stale URLs
- Long, wrong URLs
- Nosy robots
- Dynamic gateway access
Excluding Robots
The Robots Exclusion Standard
Fetching robots.txt
GET /robots.txt HTTP/1.0
Host: www.joes-hardware.com
User-Agent: Slurp/2.0
Date: Wed Oct 3 20:22:48 EST 2001
Response codes
• If the server responds with a success status (HTTP status code 2XX), the robot must parse the content and apply the exclusion rules to fetches from that site.
• If the server response indicates the resource does not exist (HTTP status code 404), the robot can assume that no exclusion rules are active and that access to the site is not restricted by robots.txt
• If the server response indicates access restrictions (HTTP status code 401 or 403) the robot should regard access to the site as completely restricted.
• If the request attempt results in temporary failure (HTTP status code 503), the robot should defer visits to the site until the resource can be retrieved.
• If the server response indicates redirection (HTTP status code 3XX), the robot should follow the redirects until the resource is found.
robots.txt File Format
Caching and Expiration of robots.txt
HTML Robot-Control META Tags
<META NAME="ROBOTS" CONTENT=directive-list>
Robot META directives
NOINDEX
NOFOLLOW
INDEX
Tells a robot that it may index the contents of the page.
FOLLOW
Tells a robot that it may crawl any outgoing links in the page.
NOARCHIVE
Tells a robot that it should not cache a local copy of the page.*
ALL
Equivalent to INDEX, FOLLOW.
NONE
Equivalent to NOINDEX, NOFOLLOW.
Search engine META tags
Robot Etiquette
Modern Search Engine Architecture
Full-Text Index
Posting the Query
Sorting and Presenting the Results
No comments:
Post a Comment