Paice/Husk stemmer

 back   


Principle  -   Use  -   Performances  -   Add a new language  -   Toolkit  -   Download  -   Java version  -   Links


Principle
What is a stemmer? What do I need it for?

A stemmer is an algorithm for removing inflectional and derivational endings in order to reduce word forms to a common stem.
Search engines use stemmers to better perform the information retrieval. The key terms of a query or document are represented by stems rather than by the original words. Different variants of a term can then be conflated to a single representative form, and it reduces the dictionary size, that is, the number of distinct terms needed for representing a set of documents. A smaller dictionary size results in a saving of storage space and processing time.

Stemmers families and the Paice/Husk algorithm

Two main families of stemmers are presented in the litterature: the algorithmic stemmers and the dictionary-based stemmers. An algorithmic stemmer is often faster and enables the stemming of unknown words (actually, each word is unknown to it). It has a higher error rate, but can conflate words that should not be grouped together (a phenomenon known as over-stemming). The dictionary-based approach does not make any error on known words, but can produce some on those it does not list. It is slower too, and needs nevertheless to remove some suffixes before looking-up the given word in the dictionary.
The Paice/Husk algorithm belongs to the algorithmic stemmers branch. It is based on a set of rules to extract stems, and besides keeps those rules outside its code. Thus, it is possible to process a new language simply by using another set of rules without rewriting its code, with just a few adjustments (for each language, it needs the list of its accented vowels and the validity rules for its stems). So that algorithm can easily handle a new language.

Top of page

Natural Language Processing and Text Mining Natural Language Processing and Text Mining
Springer
Text Mining: Applications and Theory Text Mining: Applications and Theory
Wiley
Foundations of Statistical Natural Language Processing Foundations of Statistical Natural Language Processing
Christopher D. Manning, Hinrich Schütze
The MIT Press

Use

The Paice/Husk algorithm implementation is composed of a set of functions that use the stems extraction rules set to a given word and check the acceptability of the proposed stem (PaiceHuskStemmer.php), and the set of rewriting rules (PaiceHuskStemRules_en.php for English - adapted from the rules at The Lancaster (Paice/Husk) Stemming Algorithm -, and PaiceHuskStemRules_fr.php for French).
The main function takes the word from which we would like to get the stem and the language code ('en' for English, and 'fr' for French). Thus, PaiceHuskStemmer('abusively','en') will return the stem 'abusive', whereas PaiceHuskStemmer('constitutionnellement','fr') will return the stem 'constitu'.

Top of page

Performances

First of all, this implementation in PHP is not intended to be the fastest one. It sure can be accelerated a lot, and the mere question of using the PHP language is a real one when trying to extract stems from a real-world text unit of a given size. Nevertheless, it should be useful to compare that stemmer with a given set of rules to another one or to the same with a different set of rules. The following metrics are listed in the Stemming performance page and implemented in the function getStatistics from the StemmingToolKit.php.

* Number of words per conflation class
This is the average size of the groups of words coverted to a particular stem (regardless of whether they are all correct). Thus, if the words “engineer”, “engineering” and “engineered”, and no others, were all stemmed to the stem “engineer”, then the size of that conflation class would be 3. If the conflation of 1,000 different words resulted in 250 distinct stems, then the mean number of words per conflation class would be 4.
This metric is obviously dependent on the number of words processed, but for a word collection of given size, a higher value indicates a heavier stemmer. The value is easily calculated as follows:
MWC = Mean number of words per conflation class
N = Number of unique words before Stemming
S = Number of unique stems after Stemming
MWC = N / S

* Index Compression Factor
The Index Compression Factor represents the extent to which a collection of unique words is reduced (compressed) by stemming, the idea being that the heavier the Stemmer, the greater the Index Compression Factor. This can be calculated by:
ICF = Index Compression Factor
N = Number of unique words before Stemming
S = Number of unique stems after Stemming
ICF = (N - S) / N

* The Word Change Factor
This is a simply the proportion of the words in a sample that have been changed in any way by the stemming process, the idea being that the larger the number of words that are altered, the greater the strength of the Stemmer.

* Number of Characters Removed
There are various metrics which use the principle that strong stemmers remove more characters from words than weaker stemmers. One way is to compute the average number of letters removed when a stemmer is applied to a text collection. Thus, suppose that the nine words “react”, “reacts”, “reacting”, “reacted”, “reaction”, “reactions”, “reactive”, “reactivity” and “reactivities” are all reduced to “react”; the numbers of characters removed are then 0, 1, 3, 2, 3, 4, 3, 5 and 7, respectively. This gives a mean removal rate of 28/9 = 3.11.

The sample text is taken from the novel “Pêcheur d'Islande” (Fishers from Island, 1886) by Pierre Loti, ie a sample of the same type (novel) and from the same period (end of the 19th century) than those used as corpus to extract the stemming rules.
Sample file: sample_fr.inc.php (view source)
Stop list file: stoplist_fr.inc.php (view source)

Size of the sample: 23,135 words (dispatched in 427 text units)
Number of unique words: 4,278

 Paice/Husk stemmer 
Number of stems found:  3,265
Number of unique stems found:  3,265
Min/Max value of found stems length:  2 / 15
Mean value of found stems length: 5.63
Number of words per conflation class:  1.31
Index Compression:  0.24
Word Change Factor:  0.26
Number of characters removed:  10,079
Mean removal rate:  1.12
Median removal rate:  1

Top of page

Add a new language
How to stem a new language (French, for instance)?

In the file PaiceHuskStemmer.php, we need to:
          - create a new rule pattern for that language, including all accented letters
               Accented forms of letters in French are: à, â, è, é, ê, ë, î, ï, ô, û, ù and ç.
               So we can now add the following line:

$rule_pattern_fr = "/^([a-zàâèéêëîïôûùç]*)(\*){0,1}(\d)([a-zàâèéêëîïôûùç]*)([.|>])/";
          - list all the vowels for that language
               French vowels are: a, à, â, e, è, é, ê, ë, i, î, ï, o, ô, u, û, ù and y.
          - add new case in the checkAcceptability function to define the rules of writing stems (we need there the list of that language vowels)

The big part is about writing out the ordered list of rewriting rules to extract the stems, ie the files PaiceHuskStemRules_en.php (for English) and PaiceHuskStemRules_fr.php (for French). We have to use a corpus of a minimal size (about 1 million words seems to be a good start). The French corpus is composed of ten novels (source: Gallica): Hernani ou L'honneur castillan (1830), Lucrèce Borgia (1833), Marie Tudor (1833), Ruy Blas (1905), Le Rhin, Lettres à un ami (1906) and Les travailleurs de la mer (1911) by Victor Hugo, and Nana (1880), Germinal (1885), La bête humaine (1890) and La débâcle (1892) by Emile Zola.
That corpus counts 1,172,726 different forms.
In order to be able to extract the stems from a given text, we have to list the more used suffixes in its language. The words frequency in the corpus has been computed, then they have been sorted by frequency and listed in reverse form (to highlight the suffix), then by alphabetical order for each frequency (listSuffixes function of the StemmingToolKit.php). It has then been possible to identify the most common suffixes to extract of it a set of rewriting rules. The rewriting rules are therefore found in an incremental way. That method enables the coverage of the most used vocabulary for that language, but it can also give quite a big number of rules.

Step by step method
- take the first word of the frequency-sorted list (corpus_stems_fr.txt file generated by the listSuffixes function of the StemmingToolKit.php)
- link the form with its stem in the test file words2stems_fr.inc.php (view source)
- run the Paice/Husk algorithm on that file (developPaiceHusk function of the StemmingToolKit.php)
If the found stem does not match our expectation, we have to add/modify/reorder the PaiceHuskStemRules_fr.php rules and even alter some stems from the test file until the algorithm returns the “good” stem for each one from the test file. We can see here that we have to make choices to define which words will have the same stems as well as the strength of the stemmer that will use those rules.

Top of page

Toolkit
List of the StemmingToolKit.php library functions

getStatistics:  displays some statistics for the Paice/Husk stemmer
other used functions: standardizePunctuation and indexText
requires: PaiceHuskStemmer.php, the stop list array for the language (stoplist_en.inc.php, stoplist_fr.inc.php (view sources))
standardizePunctuation:  returns the given text in which each punctuation mark has a space before and after it
localizePunctuation:  uses typographic rules to return a well-written text from a standardized-punctuated one
indexText:  indexes the given text and returns an array of:
     - 'original': the original text
     - 'modified': the modified text, ie the standardized-punctuation form
     - 'index': an array that lists for each word of the modified text that is not in the stop list its form (the form of the word in the original text), its index in the modified text and its stem.
listSuffixes:  lists the most common suffixes of a corpus in the given language which are not listed in the matching words2stems_<language>.inc.php file
requires: corpus_fr.txt (for French) which is a plain-text corpus, and words2stems_fr.inc.php file (for French) which defines an array associating for each word its stem (optional, useful if you do not want to list words for which your rules already return the stem you were looking for)
returns: its output in the file corpus_stems_fr.txt (for French)
output sample:
		frequency: 0.01474%
		---------
		atser	(resta)
		ecnirp	(prince)
		snolla	(allons)
		xuoneg	(genoux)
developPaiceHusk:  displays for each word of the words2stems array the word and the returned stem after having processed it with the new set of rules if the returned stem is different from the expected one
requires: words2stems_fr.inc.php file (for French) which defines an array associating for each word its expected stem
Top of page

Download

This application is public domain: you can download it and use it for whatever you will. If you bring any modification to it you wish to share, the new version will be added on this site.
PaiceHuskStemmer.zip (80kb)


Java version

A Java version of this application has been translated by Tito Colonna who I want to thank for his comments and his critical reading of the code. That version handles only French (but adding another language is relatively easy), does not use a stoplist, nor does handle compound words. The compound words are not handled either by the PHP version, as we consider that two words linked by a hyphen make only one (in English, beggar-my-neighbour has a specific meaning that would be lost by separating the three words in beggar my neighbour), and the normalization does not handle it either.
PaiceHuskStemmer.java.txt


Links
The Lancaster (Paice/Husk) Stemming Algorithm
The Porter Stemming Algorithm
Snowball (includes lists of accented letters, sample vocabulary with matching stems, stop lists...)


 back