Robotic good behaviour

Robot exclusion

The Peregrinator follows the Standard for Robot Exclusion, so Web managers can control its access to their servers via a
User-agent:    Peregrinator-Mathematics
record in their /robots.txt file.

So far few mathematics Web administrators are making use of this file: of about 90 servers indexed by the Peregrinator on its first run, only two (one of them my own) had a /robots.txt. (Another half dozen erroneously return their default page when /robots.txt is requested: this seems to be an idiosyncrasy of the GN HTTP daemon.)

Guiding the Peregrinator

Administrators of Web servers containing mathematical and statistical information are encouraged to assist the indexing project by providing a /robots.txt entry to indicate to the Peregrinator which subtrees contain information not relevant to mathematics. For example:
User-agent:    Peregrinator-Mathematics
Disallow:      /CompSci/ # (though the Maths-CS boundary is ill-defined)
Disallow:      /Games/
Disallow:      /Unixhelp/
Disallow:      /Bigdummy/
Disallow:      /cgi-bin/finger
Disallow:      /man/
This would of course require variations depending on what is present and how it is named on a given server.

More drastic exclusions

To prevent all access by this robot, include the following record:
User-agent:    Peregrinator-Mathematics
Disallow:      /
To block all robots, you could say
User-agent:    *
Disallow:      /
However, it would be a pity if many sites did this, as the information they contain would not then be automatically indexed by any conforming robot.

Access timing

The Peregrinator also restricts its accesses to individual servers to a maximum of no more than one every several minutes, and an average over its running time of considerably less often. But because it is concentrating its efforts on the hundred or so servers concerned with the topic in question, it is not as easy to space accesses out as is the case for a general-purpose indexing robot doing a breadth-first traverse of the whole Web.

Since so far /robots.txt files are rare, the Peregrinator only checks for new or changed ones about once a week.


JSR, Sydney Mathematics and Statistics, 26 Aug 1994 (modified 7 Nov 1996) . . . SMSsearch