Canonical stems and noise words
Stems
Before insertion in the index being generated by the
Peregrinator, words (defined as consisting of
letters only: digits and punctuation are excluded) are put in lower case, and
are reduced to a canonical stem by removal of suffices. The approach used is
from
M. F. Porter, An algorithm for suffix stripping, Program (Automated
Library and Information Systems) 14 (3) 130-7, July 1980.
Note that the suffix-stripping algorithm is entirely specific to English-language
vocabulary.
Code for the implementation of this algorithm is available as:
Porter.pm V2.1 21 Jun 1999 7509 bytes
A small number of words (e.g., news and relativity) are dealt with
badly by the algorithm: an exceptions list is kept for these.
Noise words
There are also lists of ``noise words'' (or ``stop words'') and ``noise
stems'': input text words are compared exactly against the noise words, and
again after stemming against the noise stems; the input word is ignored if a
match occurs. Thus common invariant words, and common stems, can be kept out
of the index. Choice of the noise words and stems can be based on the topic
in question: e.g., for a mathematics index, it is desirable not to treat
``group'' as a noise stem, because though common in everyday use it has a
technical mathematical meaning.
As a sample, here is the noise vocabulary used on Peregrinator's first
run for MathSearch:
vocab.pl V1.0 12 Aug 1994 3136 bytes
JSR,
Sydney Mathematics and Statistics, 26 Aug 1994
. . . SMSsearch