About 90 Web servers in mathematics and statistics institutions were chosen for the Peregrinator to index: most of these were from the Penn State list. Only servers in English-speaking countries were selected: this regrettable linguistic chauvinism was made necessary because the word-stemming algorithm is specific to English. Only HTML pages are indexed.
Starting in early August 1994, each chosen site was processed by the Peregrinator, from its top page; local URLs were followed up, but all off-site URLs were ignored. Even so, a considerable number of irrelevant pages were found: some such subtrees were identified manually and excluded from further examination. (Web administrators at the sites are encouraged to help guide the robot, and more generally to organize their served files into subtrees according to topic.)
This approach will of course not cover anywhere near all Web material on the topic. But it may cover enough to allow people with queries to reach something relevant directly, and -- by following further links -- most important documents of relevence in only a few hops. Further, unlike automatic relevance detection, the approach is easily implemented.
By late August 1994, approaching 6000 pages had been indexed, and the queue of URLs still to be processed was nearing exhaustion. At this point, reprocessing of documents not seen for a long period commenced: it is intended that the robot will continue to cycle through the documents at a 4- to 6-weekly rate.
MathSearch was announced on sci.math.research on 6 September 1994.
Web administrators can determine whether their site is included by looking up a phrase which occurs in one of their mathematics- or statistics-oriented pages, but which is unlikely to appear elsewhere. Additional mathematics and/or statistics servers which have a /robots.txt entry excluding pages whose content is not mathematical or statistical or is not in English may be added to the robot's list on request.
| Date | Active servers | URLs (1000) | Stems (1000) | Stem entries (M) |
|---|---|---|---|---|
| Oct 1994 | <=107 | 8.8 | ? | ? |
| Jan 1995 | <=138 | 14 | 67 | 2.7 |
| Jan 1996 | <=180 | 47 | 144 | 6.9 |
| Jun 1996 | 160 | 65 | 170 | 9.7 |
| Aug 1998 | ~165 | 156 | 287 | 22.9 |
| Jul 1999 | 187 | 198 | 330 | 30.7 |
| Jul 2000 | 183 | 214 | 352 | 33.9 |
| Jul 2001 | 185 | 256 | 387 | 40.2 |
In June 1996, when the number of URLs exceeded 65,000, the data structure storing URL number had to be changed from a 2-byte to a 4-byte integer, increasing the total storage required for the database from about 57MB to about 77MB.
In July 1999, a list of most frequently linked pages in the index was added; the search engine was receiving an average of over 500 hits a day.