Morphology Induction (level 2, 5 credit points)
Course Description
This course is an extended version of the course Minimally Supervised
Induction of Morphology (MSIM) by the Nordic Graduate School of
Language Technology (2 credit points). The extension (3 credit points)
is organised as an individual project course, including reading the
literature list of MSIM.
The focus
of MSIM is mostly on unsupervised learning methods, which could
derive features to be used in a supervised learning method capable of
handling noisy data well.
The project consists of implementing trie-based supervised learning of
mapping inflected Swedish words to their baseforms as a stand-alone
analyser. Various models derived from the Stockholm-Umeĺ Corpus will
be evaluated 1) against the same test set (verb inflection) as in
Wicentowski's PhD thesis, and 2) against a few texts from another
corpus.
Literature
- Baroni et al 2002
-
Marco Baroni, Johannes Matiasek, and Harald Trost.
2002.
Unsupervised discovery of morphologically related words based on
orthographic and semantic similarity.
In Proceedings of the ACL-02 Workshop on Morphological and
Phonological Learning.
- Baroni 2003
-
Marco Baroni.
2003.
Yearbook of Morphology 2003, chapter Distribution-driven
morpheme discovery: A computational/experimental study, pages 213-248.
Springer, Dordrecht.
- Belkin and Goldsmith 2002
-
Mikhail Belkin and John Goldsmith.
2002.
Using eigenvectors of the bigram graph to infer morpheme identity.
In Proceedings of the ACL-02 Workshop on Morphological and
Phonological Learning.
- Cavar et al 2004
-
Damir Cavar, Joshua Herring, Toshikazu Ikuta, Paul Rodrigues, and Giancarlo
Schrementi.
2004.
On induction of morphology grammars and its role in bootstrapping.
In Proceedings of the 9th Conference on Formal Grammar.
- Clark 2001a
-
Alexander Clark.
2001a.
Learning morphology with pair hidden markov models.
In Proceedings of the Student Workshop at ACL 2001.
- Clark 2001b
-
Alexander Clark.
2001b.
Partially supervised learning of morphology with stochastic
transducers.
In Proceedings of Natural Language Processing Pacific Rim
Symposium (NLPRS).
- Clark 2002
-
Alexander Clark.
2002.
Memory-based learning of morphology with stochastic transducers.
In Proceedings of the 40th Annual Meeting of the Association for
Computational Linguistics.
- Clark 2003
-
Alexander Clark.
2003.
Combining distributional and morphological information for part of
speech induction.
In Proceedings of EACL 2003.
- Creutz and Lagus 2002
-
Mathias Creutz and Krista Lagus.
2002.
Unsupervised discovery of morphemes.
In Proceedings of the ACL-02 Workshop on Morphological and
Phonological Learning.
- Creutz and Lagus 2004
-
Mathias Creutz and Krista Lagus.
2004.
Induction of a simple morphology for highly-inflecting languages.
In Proceedings of the 7th Meeting of the ACL Special Interest
Group in Computational Phonology (SIGPHON), pages 43-51.
- Creutz and Lagus 2005
-
Mathias Creutz and Krista Lagus.
2005.
Inducing the morphological lexicon of a natural language from
unannotated text.
In Proceedings of the International and Interdisciplinary
Conference on Adaptive Knowledge Representation and Reasoning (AKRR'05).
- Creutz and Lagus 2005
-
Mathias Creutz and Krista Lagus.
2005.
Unsupervised morpheme segmentation and morphology induction from text
corpora using Morfessor 1.0.
Technical Report A81, Helsinki University of Technology.
- Creutz and Lindén 2004
-
Mathias Creutz and Krister Lindén.
2004.
Morpheme segmentation gold standards for finnish and english.
Technical Report Report A77, Helsinki University of Technology.
- Creutz et al 2005
-
Mathias Creutz, Krista Lagus, Krister Lindén, and Sami Virpioja.
2005.
Morfessor and Hutmegs: Unsupervised morpheme segmentation for
highly-inflecting and compounding languages.
In Proceedings of the Second Baltic Conference on Human Language
Technologies, pages 107-112.
- Creutz 2003
-
Mathias Creutz.
2003.
Unsupervised segmentation of words using prior distributions of morph
length and frequency.
In In Proceedings of ACL-03, the 41st Annual Meeting of the
Association of Computational Linguistics, pages 280-287.
- Déjean 1998
-
Hervé Déjean.
1998.
Morphemes as necessary concept for structures discovery from untagged
corpora.
In Proceedings of the ACL-98 Workshop on New Methods in Language
Processing and Computational Natural Language Learning.
- de Marcken 1996
-
Carl G. de Marcken.
1996.
Unsupervised Language Acquisition.
Ph.D. thesis, Massachussetts Institute of Technology.
- Gaussier 1999
-
Éric Gaussier.
1999.
Unsupervised learning of derivational morphology from inflectional
lexicons.
In Proceedings of the ACL-99 Workshop on Unsupervised Learning
in Natural Language Processing.
- Goldsmith 2001
-
John Goldsmith.
2001.
Unsupervised learning of the morphology of a natural language.
Computational Linguistics, 27(2):153-198.
- Hafer and Weiss 1974
-
Margaret A. Hafer and Stephen F. Weiss.
1974.
Word segmentation by letter success varieties.
Information Storage and Retrieval, 10:371-385.
- Hajic 2000
-
Jan Hajic.
2000.
Morphological tagging: Data vs. dictionaries.
In Proceedings of the 1st Meeting of the North American Chapter
of the Association for Computational Linguistics.
- Hajic and Hladká 1998
-
Jan Hajic and Barbora Hladká.
1998.
Tagging inlective languages: Prediction of morphological categories
for a rich structured tagset.
In Proceedings of the 36th Annual Meeting of the Association for
Computational Linguistics, pages 483-490.
- Harris 1955
-
Zellig Harris.
1955.
From phoneme to morpheme.
Language, 31:190-222.
- Harris 1967
-
Zellig Harris.
1967.
Morpheme boundaries within words: Report on a computer test.
In Transformations and Discourse Analysis Papers. Department of
Linguistics, University of Pennsylvania.
- Hirsimäki et al 2005
-
Teemu Hirsimäki, Mathias Creutz, Vesa Siivola Kurimo, and Mikko.
2005.
Morphologically motivated language models in speech recognition.
In Proceedings of the International and Interdisciplinary
Conference on Adaptive Knowledge Representation and Reasoning (AKRR'05).
- Hu et al 2005
-
Yu Hu, Irina Matveeva, John Goldsmith, and Colin Sprague.
2005.
Refining the SED heuristic for morpheme discovery: Another look at
Swahili.
In Proceedings of the Workshop on Psychocomputational Models of
Human Language Acquisition, pages 28-35.
- Hu et al 2005
-
Yu Hu, Irina Matveeva, John Goldsmith, and Colin Sprague.
2005.
Using morphology and syntax together in unsupervised learning.
In Proceedings of the Workshop on Psychocomputational Models of
Human Language Acquisition, pages 20-27.
- Hyrro 2003
-
Heikki Hyrro.
2003.
Practical Methods for Approximate String Matching.
Ph.D. thesis, University of Tampere.
- Karttunen and Beesley 2001
-
Lauri Karttunen and Kenneth R. Beesley.
2001.
A short history of two-level morphology.
In ESSLLI-2001 Special Event: Twenty Years of Finite-State
Morphology.
- Karypis 2003
-
George Karypis 2003.
Cluto: A Clustering Toolkit, Release 2.1.1.
- Kazakov1997
-
Dimitar Kazakov.
1997.
Unsupervised learning of naďve morphology with genetic algorithms.
In W. Daelemans, A. van den Bosch, and A. Weijters, editors, Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural
Language Processing Tasks, pages 105-112, Prague.
- Kazakov 2000
-
Dimitar Kazakov.
2000.
Achievements and prospects of learning word morphology with inductive
logic programming.
Lecture Notes in Computer Science: Learning Language in Logic,
pages 89-109.
- McCarthy et al 2004
-
D. McCarthy, R. Koeling, J. Weeds and J. Carroll
2001.
Finding Predominant Word Senses in Untagged Text
In Proceedings of ACL 2004.
- Needleman and Wunsch 1970
-
Saul Needleman and Christian Wunsch.
1970.
A general method applicable to the search for similarities in the
amino acid sequence of two proteins.
Journal of Molecular Biology, 48(3):443-453.
- Neuvel and Fulop 2002
-
Sylvain Neuvel and Sean A. Fulop.
2002.
Unsupervised learning of morphology without morphemes.
In Proceedings of the ACL-02 Workshop on Morphological and
Phonological Learning.
- Oflazer and Nirenburg 1999
-
Kemal Oflazer and Sergei Nirenburg.
1999.
Practical bootstrapping of morphological analyzers.
In Proceedings of CoNLL-99: Computational Natural Language
Learning.
- Oflazer and Tur 1996
-
Kemal Oflazer and Gokhan Tur.
1996.
Combining hand-crafted rules and unsupervised learning in
constraint-based morphological disambiguation.
In Proceedings of the First Conference on Empirical Methods in
Natural Language Processing.
- Oflazer et al 2001
-
Kemal Oflazer, Sergei Nirenburg, and Marjorie McShane.
2001.
Bootstrapping morphological analyzers by combining human elicitation
and machine learning.
Computational Linguistics, 27(1).
- Schone and Jurafsky 2000
-
Patrick Schone and Daniel Jurafsky.
2000.
Knowledge-free induction of morphology using latent semantic
analysis.
In Proceedings of the Fourth Conference on Computational Natural
Language Learning and of the Second Learning Language in Logic Workshop.
- Schone and Jurafsky 2001
-
Patrick Schone and Daniel Jurafsky.
2001.
Knowledge-free induction of inflectional morphologies.
In Proceedings of the Second Meeting of the North American
Chapter of the Association for Computational Linguistics.
- Schone 2001
-
Patrick Schone.
2001.
Toward Knowledge-Free Induction of Machine Readable
Dictionaries.
Ph.D. thesis, University of Colorado.
- Sharma et al 2002
-
Utpal Sharma, Jugal Kalita, and Rajib Das.
2002.
Unsupervised learning of morphology for building lexicon for a highly
inflectional language.
In Proceedings of the ACL-02 Workshop on Morphological and
Phonological Learning.
- Siivola et al 2003
-
Vesa Siivola, Teemu Hirsimäki, Mathias Creutz, and Mikko Kurimo.
2003.
Unlimited vocabulary speech recognition based on morphs discovered in
an unsupervised manner.
In Proceedings of the 8th European Conference on Speech
Communication and Technology (Eurospeech), pages 2293-2296.
- Smith and Waterman 1981
-
Temple F. Smith and Michael S. Waterman.
1981.
Identification of common molecular subsequences.
Journal of Molecular Biology, 147(1):195-197.
- Snover et al 2002
-
Matthew G. Snover, Gaja E. Jarosz, and Michael R. Brent.
2002.
Unsupervised learning of morphology using a novel directed search
algorithm: Taking the first step.
In Proceedings of the ACL-02 Workshop on Morphological and
Phonological Learning.
- Wicentowski 2004
-
Richard Wicentowski.
2004.
Multilingual noise-robust supervised morphological analysis using the
wordframe model.
In Proceedings of Seventh Meeting of the ACL Special Interest
Group on Computational Phonology (SIGPHON), pages 70-77.
- Yarowsky and Wicentowski 2000
-
David Yarowsky and Richard Wicentowski.
2000.
Minimally supervised morphological analysis by multimodal alignment.
In Proceedings of the 38th Annual Meeting of the Association for
Computational Linguistics, pages 207-216.
- Yarowsky et al 2001
-
D. Yarowsky, G. Ngai, and R. Wicentowski.
2001.
Inducing multilingual text analysis tools via robust projection
across aligned corpora.
In Proceedings of the First International Conference on Human
Language Technology Research.
Examination
A report on the implementation and evaluation of the project.