Peregrinator: A Web-Indexing Robot
The Peregrinator is a robot for traversing and indexing sections of the Web.
For a list of other such robots and an overall description, see
WWW Robots, Wanderers, and Spiders
by Martijn Koster.
Development
In June 1994, after looking at publicly available tools -- such as WAIS --
for indexing and searching pages on the local WWW server, I decided to develop
a simple system, to be written in Perl
V4, which would fit naturally into the Web both at the stage of gathering
an index and at that of searching it. In particular a Perl CGI script would be
used for searching, rather than an independent query server. The end result
of this work can be seen in SMSsearch.
I then realized, in part inspired by
Brian Pinkerton's description of
the WebCrawler, that
it would be both easy and worthwhile to modify the program so that it could
index not only the local server but also a wider range of WWW sites. However,
it seemed too ambitious to attempt the difficult task of indexing the Web at
large, especially as this was already done well by the WebCrawler, the
World Wide Web Worm,
Jonathon Fletcher's JumpStation (no longer operating), and increasingly many
others.
Topical indexing
Instead I set out to make an index of Web pages on a particular topic, namely
mathematics and statistics. As it turned out, this could be seen as
a (small) start from the bottom up on a proposal Ian Cooper describes from
the top down in his paper
Indexing the World
(July 1994), where he talks of ``linking many small
indices into a huge virtual index''. There are also a number of similar
ideas, and much useful detail, in Charlie Stross's
Searching the Web (June 1994). Undoubtedly many people are thinking independently
and interdependently along the same lines in the
evolutionary environment of the Web.
How such index linking can be done remains to be seen. But there appear to
be a number of advantages in developing a series of indices dedicated
to particular topic areas:
- specialist knowledge can predict where on the Web information sources
about the topic will be located;
- the database for such an index should be of more manageable size, perhaps
allowing greater detail to be stored (e.g., full-content and/or sentence-level
indexing);
- keywords and ``stop'' or ``noise'' words can be specifically chosen
to be relevant to the topic;
- administrators of target servers may be more willing to accept robot
accesses if for the benefit of their own discipline;
- because the subject material is limited fewer people will query the index
server, reducing load;
- extraneous matches, that is irrelevant search results which happen
to fit the query, are less likely if the user is searching a database
dedicated to the topic of interest;
- because fewer pages are indexed, the robot can cycle through them more
frequently, so keeping the index more up-to-date.
The index of mathematics and statistics servers generated by the Peregrinator
robot which resulted is MathSearch.
Further details
JSR,
Sydney Mathematics and Statistics, 26 Aug 1994 (amended 30 Oct 1996)
. . . SMSsearch