DReaM: The Dictionary/Grammar Reading Machine: Computational Tools for Accessing the World's Linguistic Heritage 2018-2020

The DReaM Project is a JPICH Digital Heritage-funded project.

Project Members

  • Sweden
  • Netherlands
  • France

    Associate Partners APs

  • AP1: Prof. Dr. Gerhard Jäger, University of Tübingen, Tübingen, Germany
  • AP2: Dr. Justyna Olko, University of Warsaw, Warsaw, Poland
  • AP3: Prof. Dr. Qibin Ran, Nankai University, Tianjin, China
  • AP4: Prof. Dr. Valery Solovev, Kazan Federal University, Kazan, Russia
  • AP5: Dr. Guillaume Jacques, Centre de recherches linguistiques sur l'Asie orientale, Paris, France
  • AP6: Dr. Dmitry Idiatov, Langage, Langues et Cultures d'Afrique Noire (LLACAN), Paris, France
  • AP7: Prof. Dr. Martin Haspelmath, Max Planck Institute for the Science of Human History, Jena, Germany

    Work Packages

    Work PackageDescriptionResponsability
    WP1.1 document scanning Harald Hammarström and Søren Wichmann
    WP1.2 OCR and OCR postcorrection Søren Wichmann and Shafqat Virk
    WP1.3 importing data to corpus infrastructures Systems Developer (N. N.)
    WP2.1 digitization of dictionaries Guillaume Segerer and PhD Student (N. N.)
    WP2.2 web interface for digital dictionaries Guillaume Segerer and PhD Student (N. N.)
    WP2.3 dictionary App development Guillaume Segerer and Rémy Bonnet
    WP2.4 surveys and evaluation PhD Student (N.N.)
    WP3.1 linguistic Information Extraction Søren Wichmann, Shafqat Virk and Harald Hammarström
    WP3.2 language Factoid Database Søren Wichmann, Shafqat Virk and Harald Hammarström
    WP3.3 presentation of results Harald Hammarström, Marian Klamer and Stéphane Robert

    Bibliography of relevant publications

    A .bib file of this is here.

    Bender, Emily M., Joshua Crowgey, Michael Wayne Goodman & Fei Xia. (2014) Learning Grammar Specifications from IGT: A Case Study of Chintang In Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages, 43--53. Baltimore, Maryland, USA: Association for Computational Linguistics.
    Bickel, Balthasar. (2015) Distributional typology: statistical inquiries into the dynamics of linguistic diversity. In Bernd Heine & Heiko Narrog (eds.), The Oxford Handbook of Linguistic Analysis, 901-923. 2nd edn. Oxford: Oxford University Press.
    Borin, Lars, Shafqat Virk & Anju Saxena. (2016) Towards a Big Data View on South Asian Linguistic Diversity In WILDRE-3 - 3rd Workshop on Indian Language Data: Resources and Evaluation, 87-92. ELRA.
    Cooper, Doug. (2014) Logistics of the Asia-Pacific Linguistic Data Warehouse. Paper presented at the Language Comparison with Linguistic Databases: RefLex and Typological Databases, 7-8 Oct 2014.
    Cysouw, Michael. (2011) Typology without Types: Quantitatively inducing a Numeral Typology. Poster presented at the 9th biannual meeting of the Association for Linguistic Typology, ALT9, Hong Kong, China.
    Dryer, Matthew S. (2006) Descriptive theories, explanatory theories, and basic linguistic theory. In Felix Ameka, Alan Dench & Nicholas Evans (eds.), Catching Language: Issues in Grammar Writing, 207-234. Berlin: Mouton de Gruyter.
    Dryer, Matthew. (forthcoming) World Atlas of Word Order in Language. Oxford: Oxford University Press.
    Evans, Nicholas & Stephen Levinson. (2009) The Myth of Language Universals: Language diversity and its importance for cognitive science. Behavioral and Brain Sciences 32(5). 429-492.
    Güldemann, Tom. (2010) "Sprachraum" and geography: Linguistic macro-areas in Africa. In Alfred Lameli, Roland Kehrein & Stefan Rabanus (eds.), Language and Space: An International Handbook of Linguistic Variation Volume 2: Language Mapping (Handbooks of Linguistics and Communication Science 30/2), 561-585. Berlin: Mouton de Gruyter. [guldemann_sprachraum2010.pdf (1.29 MB) guldemann_sprachraum-africa2010.zip (5.64 MB) ]
    Hammarström, Harald, Shafqat Mumtaz Virk & Markus Forsberg. (2017) Poor Man's OCR Post-Correction: Unsupervised Recognition of Variant Spelling Applied to a Multilingual Document Collection. In Proceedings of the Digital Access to Textual Cultural Heritage (DATeCH) conference, 71-75. Göttingen: ACM.
    Hammarström, Harald, Shafqat Virk & Markus Forsberg. (2017) Extracting Grammar from Grammars: From Raw-Text Descriptions to Grammatical Characteristics of the Languages of the World. Presentation at the Computational Linguistics Seminar, Uppsala.
    Hammarström, Harald, Shafqat Virk & Markus Forsberg. (2017) Automatically Filling in Grambank. Presentation at the Glottobank meeting, Waiheke.
    Hammarström, Harald. (2013) Three Approaches to Prefix and Suffix Statistics in the Languages of the World. Paper presented at the Workshop on Corpus-based Quantitative Typology (CoQuaT 2013).
    Harris, Zellig S. (1951) Methods in structural linguistics. Chicago: University of Chicago Press.
    Himmelmann, Nikolaus. (2014) Asymmetries in the prosodic phrasing of function words: Another look at the suffixing preference. Language 90(4). 927-960.
    Kamholz, David, Jonathan Pool & Susan Colowick. (2014) PanLex: Building a Resource for Panlingual Lexical Translation In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). Reykjavik, Iceland: European Language Resources Association (ELRA).
    Littell, Patrick, Aidan Pine & Henry Davis. (2017) Waldayu and Waldayu Mobile: Modern digital dictionary interfaces for endangered languages Association for Computational Linguistics.
    Macklin-Cordes, Jayden L., Nathaniel L. Blackbourne, Thomas J. Bott, Jacqueline Cook, T. Mark Ellison, Jordan Hollis, Edith E. Kirlew, Genevieve C. Richards, Sanle Zhao & Erich R. Round. (2017) Robots who read grammars. Poster presented at CoEDL Fest 2017, Alexandra Park Conference Centre, Alexandra Headlands, QLD.
    Manning, Christopher D., Prabhakar Raghavan & Hinrich Schütze. (2008) Introduction to Information Retrieval. Cambridge: Cambridge University Press.
    Mikolov, Tomas, Ilya Sutskever, Kai Chen, Gregory S. Corrado & Jeffrey Dean. (2013) Distributed Representations of Words and Phrases and their Compositionality. In Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani & Kilian Q. Weinberger (eds.), Advances in Neural Information Processing Systems 26 (NIPS 2013), 3111-3119. Lake Tahoe, Nevada: Neural Information Processing Systems.
    Nivre, Joakim, Željko Agić, Lars Ahrenberg & Maria Jesus Aranzabe. (2017) Universal Dependencies 2.0. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University in Prague.
    Plank, Frank. (2009) WALS values evaluated. Linguistic Typology 13(1). 41-75.
    Polyakov, Vladimir N., Valery D. Solovyev, Søren Wichmann & Oleg Belyaev. (2009) Using WALS and Jazyki Mira. Linguistic Typology 13. 137-167.
    Saussure, Ferdinand de. (1916) Cours de linguistique générale. Paris: Payot.
    Segerer, Guillaume. (2016) RefLex: la reconstruction sans peine. Faits de Langues 47. 201-214. [segerer_reflex2016.pdf (1.06 MB) ]
    Tsunoda, Tasaku. (2005) Language Endangerment and Language Revitalization (Trends in Linguistics: Studies and Monographs 148). Berlin: Mouton de Gruyter.
    Virk, Shafqat Mumtaz, Lars Borin, Anju Saxena & Harald Hammarström. (2017) Automatic Extraction of Typological Linguistic Features from Descriptive Grammars. In Kamil Ekštein & Václav Matoušek (eds.), Text, Speech, and Dialogue: 20th International Conference, TSD 2017, Prague, Czech Republic, August 27-31, 2017, Proceedings (Lecture Notes in Computer Science 10415), 111-119. Berlin: Springer.
    Virk, Shafqat, Markus Forsberg & Harald Hammarström. (2017) TextCat for Language Profiling. Submitted.
    Xia, Fei, William D. Lewis, Michael Wayne Goodman, Glenn Slayden, Ryan Georgi, Joshua Crowgey & Emily M. Bender. (2016) Enriching a massively multilingual database of interlinear glossed text. Language Resources and Evaluation 50(2). 1-29.