Morhology Induction

Morphology Induction (level 2, 5 credit points)

Course Description

This course is an extended version of the course Minimally Supervised Induction of Morphology (MSIM) by the Nordic Graduate School of Language Technology (2 credit points). The extension (3 credit points) is organised as an individual project course, including reading the literature list of MSIM.

The focus of MSIM is mostly on unsupervised learning methods, which could derive features to be used in a supervised learning method capable of handling noisy data well.

The project consists of implementing trie-based supervised learning of mapping inflected Swedish words to their baseforms as a stand-alone analyser. Various models derived from the Stockholm-Umeĺ Corpus will be evaluated 1) against the same test set (verb inflection) as in Wicentowski's PhD thesis, and 2) against a few texts from another corpus.

Literature

Baroni et al 2002
Marco Baroni, Johannes Matiasek, and Harald Trost.
2002.
Unsupervised discovery of morphologically related words based on orthographic and semantic similarity.
In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning.

Baroni 2003
Marco Baroni.
2003.
Yearbook of Morphology 2003, chapter Distribution-driven morpheme discovery: A computational/experimental study, pages 213-248.
Springer, Dordrecht.

Belkin and Goldsmith 2002
Mikhail Belkin and John Goldsmith.
2002.
Using eigenvectors of the bigram graph to infer morpheme identity.
In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning.

Cavar et al 2004
Damir Cavar, Joshua Herring, Toshikazu Ikuta, Paul Rodrigues, and Giancarlo Schrementi.
2004.
On induction of morphology grammars and its role in bootstrapping.
In Proceedings of the 9th Conference on Formal Grammar.

Clark 2001a
Alexander Clark.
2001a.
Learning morphology with pair hidden markov models.
In Proceedings of the Student Workshop at ACL 2001.

Clark 2001b
Alexander Clark.
2001b.
Partially supervised learning of morphology with stochastic transducers.
In Proceedings of Natural Language Processing Pacific Rim Symposium (NLPRS).

Clark 2002
Alexander Clark.
2002.
Memory-based learning of morphology with stochastic transducers.
In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.

Clark 2003
Alexander Clark.
2003.
Combining distributional and morphological information for part of speech induction.
In Proceedings of EACL 2003.

Creutz and Lagus 2002
Mathias Creutz and Krista Lagus.
2002.
Unsupervised discovery of morphemes.
In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning.

Creutz and Lagus 2004
Mathias Creutz and Krista Lagus.
2004.
Induction of a simple morphology for highly-inflecting languages.
In Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON), pages 43-51.

Creutz and Lagus 2005
Mathias Creutz and Krista Lagus.
2005.
Inducing the morphological lexicon of a natural language from unannotated text.
In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR'05).

Creutz and Lagus 2005
Mathias Creutz and Krista Lagus.
2005.
Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0.
Technical Report A81, Helsinki University of Technology.

Creutz and Lindén 2004
Mathias Creutz and Krister Lindén.
2004.
Morpheme segmentation gold standards for finnish and english.
Technical Report Report A77, Helsinki University of Technology.

Creutz et al 2005
Mathias Creutz, Krista Lagus, Krister Lindén, and Sami Virpioja.
2005.
Morfessor and Hutmegs: Unsupervised morpheme segmentation for highly-inflecting and compounding languages.
In Proceedings of the Second Baltic Conference on Human Language Technologies, pages 107-112.

Creutz 2003
Mathias Creutz.
2003.
Unsupervised segmentation of words using prior distributions of morph length and frequency.
In In Proceedings of ACL-03, the 41st Annual Meeting of the Association of Computational Linguistics, pages 280-287.

Déjean 1998
Hervé Déjean.
1998.
Morphemes as necessary concept for structures discovery from untagged corpora.
In Proceedings of the ACL-98 Workshop on New Methods in Language Processing and Computational Natural Language Learning.

de Marcken 1996
Carl G. de Marcken.
1996.
Unsupervised Language Acquisition.
Ph.D. thesis, Massachussetts Institute of Technology.

Gaussier 1999
Éric Gaussier.
1999.
Unsupervised learning of derivational morphology from inflectional lexicons.
In Proceedings of the ACL-99 Workshop on Unsupervised Learning in Natural Language Processing.

Goldsmith 2001
John Goldsmith.
2001.
Unsupervised learning of the morphology of a natural language.
Computational Linguistics, 27(2):153-198.

Hafer and Weiss 1974
Margaret A. Hafer and Stephen F. Weiss.
1974.
Word segmentation by letter success varieties.
Information Storage and Retrieval, 10:371-385.

Hajic 2000
Jan Hajic.
2000.
Morphological tagging: Data vs. dictionaries.
In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics.

Hajic and Hladká 1998
Jan Hajic and Barbora Hladká.
1998.
Tagging inlective languages: Prediction of morphological categories for a rich structured tagset.
In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, pages 483-490.

Harris 1955
Zellig Harris.
1955.
From phoneme to morpheme.
Language, 31:190-222.

Harris 1967
Zellig Harris.
1967.
Morpheme boundaries within words: Report on a computer test.
In Transformations and Discourse Analysis Papers. Department of Linguistics, University of Pennsylvania.

Hirsimäki et al 2005
Teemu Hirsimäki, Mathias Creutz, Vesa Siivola Kurimo, and Mikko.
2005.
Morphologically motivated language models in speech recognition.
In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR'05).

Hu et al 2005
Yu Hu, Irina Matveeva, John Goldsmith, and Colin Sprague.
2005.
Refining the SED heuristic for morpheme discovery: Another look at Swahili.
In Proceedings of the Workshop on Psychocomputational Models of Human Language Acquisition, pages 28-35.

Hu et al 2005
Yu Hu, Irina Matveeva, John Goldsmith, and Colin Sprague.
2005.
Using morphology and syntax together in unsupervised learning.
In Proceedings of the Workshop on Psychocomputational Models of Human Language Acquisition, pages 20-27.

Hyrro 2003
Heikki Hyrro.
2003.
Practical Methods for Approximate String Matching.
Ph.D. thesis, University of Tampere.

Karttunen and Beesley 2001
Lauri Karttunen and Kenneth R. Beesley.
2001.
A short history of two-level morphology.
In ESSLLI-2001 Special Event: Twenty Years of Finite-State Morphology.

Karypis 2003
George Karypis 2003.
Cluto: A Clustering Toolkit, Release 2.1.1.

Kazakov1997
Dimitar Kazakov.
1997.
Unsupervised learning of naďve morphology with genetic algorithms.
In W. Daelemans, A. van den Bosch, and A. Weijters, editors, Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural Language Processing Tasks, pages 105-112, Prague.

Kazakov 2000
Dimitar Kazakov.
2000.
Achievements and prospects of learning word morphology with inductive logic programming.
Lecture Notes in Computer Science: Learning Language in Logic, pages 89-109.

McCarthy et al 2004
D. McCarthy, R. Koeling, J. Weeds and J. Carroll
2001.
Finding Predominant Word Senses in Untagged Text
In Proceedings of ACL 2004.

Needleman and Wunsch 1970
Saul Needleman and Christian Wunsch.
1970.
A general method applicable to the search for similarities in the amino acid sequence of two proteins.
Journal of Molecular Biology, 48(3):443-453.

Neuvel and Fulop 2002
Sylvain Neuvel and Sean A. Fulop.
2002.
Unsupervised learning of morphology without morphemes.
In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning.

Oflazer and Nirenburg 1999
Kemal Oflazer and Sergei Nirenburg.
1999.
Practical bootstrapping of morphological analyzers.
In Proceedings of CoNLL-99: Computational Natural Language Learning.

Oflazer and Tur 1996
Kemal Oflazer and Gokhan Tur.
1996.
Combining hand-crafted rules and unsupervised learning in constraint-based morphological disambiguation.
In Proceedings of the First Conference on Empirical Methods in Natural Language Processing.

Oflazer et al 2001
Kemal Oflazer, Sergei Nirenburg, and Marjorie McShane.
2001.
Bootstrapping morphological analyzers by combining human elicitation and machine learning.
Computational Linguistics, 27(1).

Schone and Jurafsky 2000
Patrick Schone and Daniel Jurafsky.
2000.
Knowledge-free induction of morphology using latent semantic analysis.
In Proceedings of the Fourth Conference on Computational Natural Language Learning and of the Second Learning Language in Logic Workshop.

Schone and Jurafsky 2001
Patrick Schone and Daniel Jurafsky.
2001.
Knowledge-free induction of inflectional morphologies.
In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics.

Schone 2001
Patrick Schone.
2001.
Toward Knowledge-Free Induction of Machine Readable Dictionaries.
Ph.D. thesis, University of Colorado.

Sharma et al 2002
Utpal Sharma, Jugal Kalita, and Rajib Das.
2002.
Unsupervised learning of morphology for building lexicon for a highly inflectional language.
In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning.

Siivola et al 2003
Vesa Siivola, Teemu Hirsimäki, Mathias Creutz, and Mikko Kurimo.
2003.
Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner.
In Proceedings of the 8th European Conference on Speech Communication and Technology (Eurospeech), pages 2293-2296.

Smith and Waterman 1981
Temple F. Smith and Michael S. Waterman.
1981.
Identification of common molecular subsequences.
Journal of Molecular Biology, 147(1):195-197.

Snover et al 2002
Matthew G. Snover, Gaja E. Jarosz, and Michael R. Brent.
2002.
Unsupervised learning of morphology using a novel directed search algorithm: Taking the first step.
In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning.

Wicentowski 2004
Richard Wicentowski.
2004.
Multilingual noise-robust supervised morphological analysis using the wordframe model.
In Proceedings of Seventh Meeting of the ACL Special Interest Group on Computational Phonology (SIGPHON), pages 70-77.

Yarowsky and Wicentowski 2000
David Yarowsky and Richard Wicentowski.
2000.
Minimally supervised morphological analysis by multimodal alignment.
In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 207-216.

Yarowsky et al 2001
D. Yarowsky, G. Ngai, and R. Wicentowski.
2001.
Inducing multilingual text analysis tools via robust projection across aligned corpora.
In Proceedings of the First International Conference on Human Language Technology Research.

Examination

A report on the implementation and evaluation of the project.