BaseModel package for baseform or wordform lookup
Background
For many language technology applications, there is a need to find the form of a word which is found as the entry point in a lexicon, usually the baseform. In Forsbom (2007, 2006b), we report on an experiment trying to find the best-performing wordform-baseform mapping model, computed from various subsets of a Swedish vocabulary pool derived from the Stockholm-Umeå corpus (Forsbom 2006a).
The vocabulary pool is a frequency-and-dispersion-ranked list of baseforms and their wordforms. The mapper application is a standalone version of the Base Model described in Wicentowski (2002), where wordforms are stored in a suffix trie and every trie node has information on suffix change probabilities filtered on part-of-speech. If the models, are reversed, they can be used for wordform generation instead.
Eight subsets from the vocabulary pool were used as models, and evaluated for both directions on a testset of wordform-baseform mappings, and four of the models for baseform mapping on 5 randomly selected texts from the Scarrie corpus. The subsets were selected based on dispersion, tagset used, wordform frequency and frequency ratio among alternative wordforms.
For baseform mapping of the testset, six models performed on par with SWETWOL, a state-of-the-art commercial system, with 0.4--0.8% error rates for the top 5 ranked alternatives. If the top 1 ranked alternative only was used, the lowest error rates were 4.35% (baseform mapping) and 5.51% (wordform mapping). Wordform frequency filtering was always decremental, but all other selection features gave better results for some situation.
For baseform mapping of the corpus texts, using only the top 1 alternative, all four models did slightly better than SWETWOL and Lexware, with 99.3% accuracy for the best two models (using the PAROLE tagset).
Comment
During the discussion of the paper, Hercules Dalianis suggested using the CST Lemmatizer (Jongejan and Haltrup 2005). I did a quick test on baseform mapping and top 1 alternative, using the FullFull model and the corresponding test set. The results were a bit worse: 6.05% error rate, compared to 4.35 (FullFull) and 5.14 (BaseFull) with BaseModel. With a frequency file, the results were slightly worse yet: 6.12%.
Package
This package includes a Perl module that computes a wordform-baseform mapping model filtered on part-of-speech tags, a server wrapper example (in Perl), and two client examples (in Perl and Java). The data to be used for the mapping model is a tabulated text file with wordform, part-of-speech-tag, and baseform.
The scripts and models are licensed under GNU General Public License. The testsets are licensed under Creative Commons ShareAlike 1.0 license (http://creativecommons.org/licenses/sa/1.0/), since they are derived from DSSO (Westerberg 2003), which comes under that license.
The package has been tested for Linux (2.4.22, Mandrake 9.2, 2.6.14, Fedora Core 4, and 2.6.17-11, Ubuntu 6.10).
- Download as package (gzipped tar file)
- View documentation
- Try a demo (demo)
References
Eva Forsbom. Inducing Baseform Models from a Swedish Vocabulary Pool. 2007. In Proceedings of the 16th Nordic Conference of Computational Linguistics NODALIDA-2007, pp. 51-58. Tartu, Estonia, May 25-26. (pdf)
Eva Forsbom. 2006a. A Swedish Base Vocabulary Pool. Presentation at the Swedish Language Technology Conference. Göteborg, October 27-28. (Extended abstract pdf, BaseVocabulary package.)
Eva Forsbom. Inducing Baseform Models from a Swedish Vocabulary Pool. 2006b. (pdf, longer, and older, version of the NODALIDA paper above)
Bart Jongejan and Dorte Haltrup. The CST Lemmatizer. Version 2.9 (6 October 2005). Center for Sprogteknologi, University of Copenhagen, 2005. (http://www.cst.dk/download/cstlemma/current/doc/cstlemma.pdf)
Tom Westerberg. Den Stora Svenska Ordlistan [The Large Swedish Dictionary]. Version 1.13. 2003. (http://dsso.se/)
Richard Wicentowski. Modeling and Learning Multilingual Inflectional Morphology in a Minimally Supervised Framework. PhD thesis, John Hopkins University, Baltimore, Maryland, USA, 2002.
