Tagger models
Description
The part-of-speech models are based on the following bootstrap method:
- Train training model on all source (labelled) corpus.
- Tag raw (unlabelled) bootstrap corpus using training model.
- Train evaluation model on tagged bootstrap corpus only.
- Evaluate evaluation model on 10 folds of source corpus.
- Possibly drilled-down by genre
- Evaluate with tnt-diff (in the absense of a free tool). For evaluation with tnt-diff, we also need to train a lexical TnT model on the same material to get information on known and unknown words.
- Use 10 folds to get standard deviation.
- (Train final tag model on source and bootstrapped corpus for actual usage.)
The models here are a selection of final tag models, i.e. the ones I think could be useful. More details on the models can be found in Forsbom (2006, 2008b, 2009).
Some of the bootstrapped models (Forsbom 2009) have been used by me in the following projects:
- Text-centered thesauri: Combining knowledge bases for lexical cohesion analysis in information access and text summarisation
- Text and Language in Assessment of Mathematics and Natural Science
And other bootstrapped models have been used by others, for example from Forsbom (2008b), in the following project:
