Peregrinator: A Web-Indexing Robot

The Peregrinator is a robot for traversing and indexing sections of the Web. For a list of other such robots and an overall description, see WWW Robots, Wanderers, and Spiders by Martijn Koster.

Development

In June 1994, after looking at publicly available tools -- such as WAIS -- for indexing and searching pages on the local WWW server, I decided to develop a simple system, to be written in Perl V4, which would fit naturally into the Web both at the stage of gathering an index and at that of searching it. In particular a Perl CGI script would be used for searching, rather than an independent query server. The end result of this work can be seen in SMSsearch.

I then realized, in part inspired by Brian Pinkerton's description of the WebCrawler, that it would be both easy and worthwhile to modify the program so that it could index not only the local server but also a wider range of WWW sites. However, it seemed too ambitious to attempt the difficult task of indexing the Web at large, especially as this was already done well by the WebCrawler, the World Wide Web Worm, Jonathon Fletcher's JumpStation (no longer operating), and increasingly many others.

Topical indexing

Instead I set out to make an index of Web pages on a particular topic, namely mathematics and statistics. As it turned out, this could be seen as a (small) start from the bottom up on a proposal Ian Cooper describes from the top down in his paper Indexing the World (July 1994), where he talks of ``linking many small indices into a huge virtual index''. There are also a number of similar ideas, and much useful detail, in Charlie Stross's Searching the Web (June 1994). Undoubtedly many people are thinking independently and interdependently along the same lines in the evolutionary environment of the Web.

How such index linking can be done remains to be seen. But there appear to be a number of advantages in developing a series of indices dedicated to particular topic areas:

The index of mathematics and statistics servers generated by the Peregrinator robot which resulted is MathSearch.

Further details


JSR, Sydney Mathematics and Statistics, 26 Aug 1994 (amended 30 Oct 1996) . . . SMSsearch