In this lab, you will have the opportunity to get familiar with Docent, the document-level decoder for phrase-based SMT developed by the computational linguistics group at Uppsala University. You will learn how to run decoder and explore its most important parameters.
This lab is designed to be completed during class time. Since you may not be able to finish all the assignments, you may choose whether you want to start working with the readability models or with the search operations (see below). You aren't required to submit a written report on this lab. Instead, we ask you to tell the others orally about your experiments and findings at the end of the class.
Students who are not present during class time will be asked to do the lab on their own and submit a written report.
Docent is a decoder for phrase-based SMT that translates complete documents and makes it possible to create feature models having access to the entire document context, including its translation proposed by the MT system. It is based on local search with hill climbing instead of the dynamic programming algorithm called stack decoding that is used in most other SMT decoders.
Docent is open source software and is released on Github. There's also some documentation on the Github site: https://github.com/chardmeier/docent/wiki.
The search algorithm implemented in Docent is described in the following publication:
Hardmeier, C., Nivre, J. and Tiedemann, J. Document-Wide Decoding for Phrase-Based Statistical Machine Translation. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1179-1190, Association for Computational Linguistics, 2012.
The software itself is described in a system demonstration paper:
Hardmeier, C., Stymne, S., Tiedemann, J. and Nivre, J. Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 193-198, Association for Computational Linguistics, 2013.
In this lab, we're going to work with two document-level features, TTR (type-token ratio) and OVIX, designed to improve text readability. You can find details about them in our Nodalida paper:
Stymne, S., Tiedemann, J., Hardmeier, C. and Nivre, J. Statistical Machine Translation with Readability Constraints. In: Proceedings of Nodalida 2013, pages 375-386, NEALT, 2013.
The Docent decoder potentially creates many output files, so we recommend that you start by creating a working directory and make your the current path of your shell is set to that working directory whenever you run Docent.
Start by familiarising yourself with the file formats used by Docent. The setup of the decoder, including the description of the feature models, their weights, the search parameters etc. are specified in an XML configuration file, whose format is described on the Wiki page.
You will find a working configuration file for a Swedish-English system in /local/kurs/mt/lab-docent/config.xml. Copy this file to your newly created work directory. The configuration file refers to a number of other files such as the phrase table and language model. You needn't copy those, it's enough to have your own copy of the configuration to work on. Take a good look at the configuration file and compare its contents with the description on the Wiki page. Note that some models and weights have been commented out by adding <!-- and --> markers.
The input text for a document-level decoder also requires a special encoding, because information about document boundaries must be retained. The standard input format used by Docent is an XML-based format called NIST-XML that is commonly used by MT competitions. We've provided an input file for you to work with. You can find it in /local/kurs/mt/lab-docent/testset.xml. The test set contains two small excerpts from Europarl and a newspaper article from Dagens Nyheter (3 June 2013). Take a look at the file. Note that the text in the file is tokenised.
Here's how to invoke the decoder:
/local/kurs/mt/bin64/detailed-docent -b burnin -i sampleInterval -x maxSteps \ -n /local/kurs/mt/lab-docent/testset.xml config.xml outstem
Here, config.xml is your copy of the configuration file, and all files created will have names starting with outstem. The other three parameters, burnin, sampleInterval and maxSteps are iteration counts. The decoder will first run for burnin iteration without creating any output. Then it will dump its current state to a file and continue running, creating a new file every sampleInterval iterations. After maxSteps iterations, it will stop.
To begin with, try running the decoder for a small number of iterations and see what happens. For this lab, we recommend that you generally set the burnin period to 0. You could start by running the decoder for 10000 iterations and sample every 1000 iterations or so. 10000 iterations will not be enough to create good translations, but it will give you an impression of how the decoder works and how long decoding takes. Then you could gradually raise the maxSteps parameter, and remember to adjust the sampleInterval so the total number of dumps produced remains reasonable. Try to find out for how many iterations you need to run Docent before additional running time no longer gives you a noticeable improvement.
Take a look at the modifications made by the decoder between the various sampling points. In /local/kurs/mt/lab-docent, you will find a script called compare.sh that helps you compare two output files by showing just the output lines that are different.
The configuration file contains disabled entries for two readability models, TTR and OVIX. TTR (type token ratio) is the ratio of the number of types, i.e. unique words, and tokens, the total number of words, in a text. A text with a low type/token ratio has less lexical variation than a text with high type token ratio. OVIX is a reformulation of type token ratio that is less sensitive to document length, which can be a problem with type token ratio in some contexts. For the formula of OVIX, see the Nodalida paper linked above. Note that type token ratio only affects one aspect related to readability. There are many other aspsects that are not treated by these models.
1) Try enabling the readability features, individually or in combination, and run the decoder to see what happens. Remember that you also have to enable the corresponding entry in the weights section whenever you enable a model. Try varying the feature weights for the readability features. High weights for the readability models may have the effect of producing excessively long translations. If you encounter this problem, try increasing the word penalty weight. Otherwise, you should leave the weights of the baseline features constant.
2) Compare the output produced by the translation system. What happens if you run the decoder with a very high weight for the TTR and/or the OVIX model? What if you use a similar weight as for the other models? Can you find a weight setting that has a positive impact on readability without messing up the translations? Provide example translations.
Now let's take a closer look at what the decoder actually does. Remember that the local search process applies modifications to document states by running certain state operations. The operations available to the decoder are listed in the <state-generator> section of the configuration file. Each operation has a weight that specifies how often it will be attempted relative to the other operations. Some operations also have other parameters. The various decay parameters control draws from a geometric distribution. They should be between 0 and 1, and the higher the decay parameter, the more likely will high numbers be drawn (i.e., longer chunks be moved, or chunks be moved across longer distances). The decoder tells you after regular intervals how many operations of each type were attempted and how many of them were accepted because they improved the scores.
3) Try experimenting with different sets of operations and different parameters. How is the translation quality affected?
4) If you want more information about what the operations are doing, you can pass the debug option -d component to the decoder. Possible values of component include ChangePhraseTranslationOperation, SwapPhrasesOperation, ResegmentOperation or SimulatedAnnealing. The former three output information about the operations proposed, and the latter will tell you whether an operation was accepted or rejected. Note that this will write potentially huge amounts of information to stderr, so it's best to redirect the output to a file, and you may find the format a bit difficult to understand.
Don't hesitate to ask for help if you feel lost.
Report your results for the above assignments 1 to 3. Assignment 4 is optional.
Include your name and the name of your lab partner in the report.
Upload your report in English and PDF format to Studentportalen.
Deadline for handing in the written report : May 26th, 2017
Last possible deadline for handing in the report: June 2nd, 2017