Indexing and searching

The Peregrinator indexes the full contents of documents, and records for each stem which sentences of which documents the stem appears in.

This arrangement allows queries based not just on keywords but on ``phrases'', or, more precisely, (unordered) sets of stems all of which should occur in the same sentence.

Several phrases may be specified: MathSearch returns only those documents which for each phrase contain at least one sentence in which the phrase occurs.

However, MathSearch also provides links to new queries for fewer phrases, and even for subphrases, provided the number of documents involved is not too large. This allows you to specify a query precisely, but to fall back to a broader query (without retyping) if the first one is unsatisfied.

Try a few queries to MathSearch to see how this works in practice.

Data storage

The data generated by the Peregrinator and used by MathSearch is stored as a DBM file containing a record for each stem: the record provides an offset and a count which index a flat file containing document number - sentence number pairs. The flat file is typically around 10 megabytes, too much to store directly in the DBM file. Initially, the DBM file also contained the URL, title and modification date records for each page in the index.

At the start of November 1994, the .pag component of the DBM file suddenly increased in length from 30MB to 124MB, apparently because some threshold had been passed. The .pag file should be sparse, but when it is read by MathSearch the holes seem to fill up. So there was insufficient disk space to store the current version and a backup.

This problem was solved by moving all the URL data out of the DBM file, and storing it in two files, again with offsets and lengths in one and data in the other. The total storage required then dropped from 129MB to 9MB. The moral seems to be: avoid using DBM files for large amounts of data.


JSR, Sydney Mathematics and Statistics, 8 Nov 1994 (amended 5 Jan 1995) . . . SMSsearch