Lab1 - Machine translation evaluation


In this lab session you will gain hands-on experience of using real machine translations services. You will compare the quality of different systems using both manual and automatic evaluation methods. Last, you will assess the pros and cons of machine translated texts.


In this section we will carry out some experiments with text from two very different domains: course plans (from Luleĺ University) and movie subtitles.


1) Create a new directory for this lab and copy the following six files into this new work directory:

mkdir lab1/
cd lab1/

Source files:

Reference files:

Translated files:
The following files have been translated using the rule-based Convertus engine. The Convertus system specializes in course syllabi.

2) Translate the sentences in the Swedish source files into English using Google Translate. Save the translation results in a separate text file in your work directory.

First evaluation

3) Spend about 20 minutes going through the reference and translated files (Google). Then answer the following questions:

  1. What were your first impressions of the MT results, and how did they compare to your previous ideas of how well MT works?
  2. What kinds of errors did the Google system make?
  3. Were there any problems with segmentation and if so what might have caused them?
  4. What potential pros and cons do you see for a professional translator in editing MT output rather than translating from scratch?

Automatic evaluation

4) Tokenize the translated texts (Convertus and Google) using Unix commands or a simple Python script (i.e. separate common punctuations from surrounding words using space characters). Your tokenized version should contain one sentence per line, with tokens separated by a space. You will also need to tokenize the reference for the course syllabus text. This is necessary for the evaluation script! Include your code in the report.

5) Use multi-bleu.perl to compute BLEU scores for the translated texts, using the reference translations provided. Report the scores obtained.

perl multi-bleu.perl reference.txt < translation.txt

6) If you want to compare with another system, repeat the process with Bing Translator (this is not necessary to pass the lab).

Manual evaluation

Look at the translation into English and evaluate the top 20 text lines of each file (for both Convertus and Google Translate) using the following subjective assessment scale:

  1. Correct translation: 3 points
  2. Includes the main contents; however, there are grammatical problems that do not effect the understanding of the main message: 2 points
  3. Parts of the original message are lost but the main content is still understandable in its context: 1 point
  4. Unacceptable translation: 0 points
7) Compute the average score of your manual evaluations for each translation file. Report the scores obtained.


8) Compare the results of the manual and the automatic evaluations and answer the following questions:
  1. Do the manual and the automatic evaluations correlate well with each other? Why?
  2. Can you see problems with the reference translations that may negatively influence the automatic evaluation? Do you observe acceptable translations that are not well matched with the reference and, therefore, are penalized by the automatic metrics without objective reason? Discuss possible solutions to the problems you discover.
  3. Did the domain of the texts affect the quality of the translations? Were there differences between the two translation engines? Did they make different kinds of mistakes? Why do you think this is so?

Guidelines for lab report

Report your results for the above assignments 3 to 9. Assignment 6 is optional.
Include your name and the name of your lab partner in the report.
Upload your report in English and PDF format to Studentportalen.
Deadline for handing in the report: April 14th, 2017
Last possible deadline for handing in the report: June 2nd, 2017

Background information

Chapter 8 in the course textbook
Original publication: Kishore Papineni, Salim Roukos, Todd Ward, Wei-jing Zhu: BLEU: a Method for Automatic Evaluation of Machine Translation, (2002)

© 2017. UPPSALA UNIVERSITET, Institutionen för lingvistik och filologi, Box 635, 751 26 Uppsala |