Canonical stems and noise words

Stems

Before insertion in the index being generated by the Peregrinator, words (defined as consisting of letters only: digits and punctuation are excluded) are put in lower case, and are reduced to a canonical stem by removal of suffices. The approach used is from
M. F. Porter, An algorithm for suffix stripping, Program (Automated Library and Information Systems) 14 (3) 130-7, July 1980.
Note that the suffix-stripping algorithm is entirely specific to English-language vocabulary.

Code for the implementation of this algorithm is available as:

    Porter.pm      V2.1     21 Jun 1999   7509 bytes
A small number of words (e.g., news and relativity) are dealt with badly by the algorithm: an exceptions list is kept for these.

Noise words

There are also lists of ``noise words'' (or ``stop words'') and ``noise stems'': input text words are compared exactly against the noise words, and again after stemming against the noise stems; the input word is ignored if a match occurs. Thus common invariant words, and common stems, can be kept out of the index. Choice of the noise words and stems can be based on the topic in question: e.g., for a mathematics index, it is desirable not to treat ``group'' as a noise stem, because though common in everyday use it has a technical mathematical meaning.

As a sample, here is the noise vocabulary used on Peregrinator's first run for MathSearch:

    vocab.pl    V1.0    12 Aug 1994    3136 bytes

JSR, Sydney Mathematics and Statistics, 26 Aug 1994 . . . SMSsearch