Uppsala Persian Dependency Treebank: UPDT

tree.jpg

What is UPDT?

Uppsala Persian Dependency Treebank (UPDT) is a dependency-based syntactically annotated corpus. The treebank consists of 6000 sentences (151,671 tokens) of written text in CoNLL-format which has been developed through a bootstrapping procedure involving the open source data-driven dependency parser MaltParser (Nivre et al., 2006), and manual validation of the annotation.

The treebank data is extracted from the open source, validated Uppsala Persian Corpus (UPC) created from on-line material containing newspaper articles and common text on various topics (e.g. culture, technology, fiction, and art). The corpus is annotated with part-of-speech tags.

The treebank annotation scheme is based on Stanford Typed Dependencies (de Marneffe et al., 2006; de Marneffe and Manning, 2008). The entire dependency relations used in the annotation including the guidelines for sentence segmentation, tokenization, and morphological annotation are described in detail in the Uppsala Persian Dependency Treebank Annotation Guidelines.

Download UPDT

The treebank has been licensed under Creative Commons Attribution 3.0 License and can be downloaded below:
When using the treebank, please refer to the following paper:

Seraji Mojgan, Beáta Megyesi, and Joakim Nivre. 2012. Bootstrapping a Persian Dependency Treebank. Linguistic Issues in Language Technology 7(18), 1-10. [pdf]

Parsing Experiments

The UPDT has sequentially been split into 10 parts, of which segments 1-8 are used for training (80%), 9 for development (10%), and 10 for test (10%) sets.

Feedback and bug reports

Please contact mojgan.seraji@lingfil.uu.se with feedback and bug reports.

Participants

Acknowledgments

We would like to thank Recorded Future Inc. for their contribution and financial support for the development of the treebank.

References

1. De Marneffe, Marie-Catherine, Bill MacCartney, and Christopher D. Manning. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC).

2. De Marneffe, Marie-Catherine, and Christopher D. Manning. 2008. Stanford Typed Dependencies Representation. In Proceedings of the COLING’08 Workshop on Cross-Framework and Cross-Domain Parser Evaluation.

3. Nivre J., Hall J., and Nilsson J. 2006. Maltparser: A data-driven parser-generator for dependency parsing. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC).

4. Seraji Mojgan, Beáta Megyesi, and Joakim Nivre. 2012. A Basic Language Resource Kit for Persian. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC). Istanbul, Turkey. [pdf]

5. Seraji Mojgan, Beáta Megyesi, and Joakim Nivre. 2012. Dependency Parsers for Persian. In Proceedings of 10th Workshop on Asian Language Resources, COLING 2012, 24th International Conference on Computational Linguistics. Mumbai, India. [pdf]