How to assist the Peregrinator (and other robots)

Here are a number of suggestions which WWW administrators might consider to make it easier for the Peregrinator to index their site, and for MathSearch to provide meaningful responses to queries. Some of these points may help other robots and search engines also.

Provide guidance on robot scope

Set up a /robots.txt file if you don't already have one.

Give Web pages useful titles

Give each page an informative title (enclosed by <TITLE>. . .</TITLE>) which will make sense out of context: this title will probably be displayed to the person making the query when the page is matched by the search engine. The title should never be blank, and should provide an accurate and specific description of the contents of the page. Use a different title for each page.

You may wish to include a brief form of the name of your site in page titles. Unlike some other search engines, MathSearch displays the server's DNS name along with the title, but some other form of the site name may be more helpful to people making queries. Note that while such a site name can be useful in titles, it will often not be wanted in the page heading (enclosed in <H1>. . .</H1>).

Organize your directories

Make your directory hierarchies meaningful so as to simplify /robots.txt. In particular, separate files based on content, and also on language if yours is a multi-lingual server.

Don't put large numbers of unrelated files in the same WWW server directory, especially not the top one.

Backups and old copies should preferably not be in the WWW hierarchy at all. If they are, exclude them via /robots.txt, or make sure that they are not accessible by following links from your home page: e.g., forbid server-generated directory indexing, or use other server features such as NCSA httpd's IndexIgnore.

Avoid having multiple paths leading to the same page: robots have difficulty detecting such redundancy, and the result is that several identical responses will appear in query results. A common example which is hard to avoid is that http://server/ and http://server/index.html are often identical. But it is quite easy to avoid less common cases such as soft links, or having /~USER/ and /home/USER/ point to the same directory.

Use valid HTML

If a page has content type ``text/html'', make sure the content really is HTML, not plain text or written in some programming language.

To display angle brackets in HTML, e.g., in a mail address or Usenet message-id, escape them correctly using the ampersand notation: for < > put &lt; &gt; .

Regulate your server

Some servers notably GN seem to return the site's home page when a nonexistent page is requested, instead of indicating an error. This behaviour is helpful to humans but confusing to robots: you may wish to consider disabling it somehow.

Similarly, CGI scripts should return an appropriate status code.

Use strong key words

When you have the opportunity to name something, which people may later search for, choose an unusual name consisting of letters only. (The Peregrinator considers only strings of letters when constructing key words to index.) The more striking the name is, the easier it will be for people to remember, and the fewer extraneous matches are likely during a search.


JSR, Sydney Mathematics and Statistics, 12 Oct 1994 (amended 17 Oct 1994) . . . SMSsearch