Uppsala universitet  

Projects in Machine Translation
Master Programme

Organisation
Project Work
Individual Reflection Report
Seminar Presentation
Possible Topics
Resources

Aim

The goal of the master projects is
  1. to study background literature and prepare a presentation for the final seminars in the MT course;
  2. to carry out a practical assignment related to the topic selected for the seminar and to prepare a final report describing the results

Deadlines

  • April 14: Hand in your topic preferences
  • May 29 and 31: Seminar presentations (detailed schedule TBA)
  • June 2: Hand in final group report
  • June 2: Hand in individual reflection reports

Organisation

You will work in groups of 3-4 students. We will put together the groups, based on your wishes for which topics you prefer to work on. The list of
topic suggestions can be found at the bottom of this page. You can hand in your preferences by email to Fabienne, by April 14th at the latest. Give a list of at least three different topics you could consider to work on, and rank them We will try our best to accomodate everyone's wishes, but we cannot guarantee that you will get your prefered topics. If you fail to hand in a wish by April 14, you will be assigned arbitrarily to a topic.

Project work

For each topic you should perform a practical project, where you apply some of the concepts related to your projects practically. This includes setting up and running MT systems, normally with Moses, and evaluate and compare systems that differ in some aspect related to your topic. Each project will be assigned a supervisor, with whom you should discuss how to set up your project. The project work should be presented in a written group report, written in English.

It is possible to divide the work in the group so that different persons perform different experiments. It is important that each person in the group should set up and run at least one MT system. You are jointly responsible for writing the report, however, and each person should understand and be familiar with all parts of the report. It is obligatory to have at least one meeting with your supervisor where you discuss how you divide the work, and show that everyone can set up and run MT systems.

In the final report, you are expected to

  • describe the background in terms of the concepts, approaches and techniques within your selected topic, including references to journal and/or conference articles
  • describe your project and motivate your experimental setup
  • summarize, evaluate and analyze your results
  • describe possible shortcomings and ideas for improvement

The deadline for handing in the reports is June, 2, 20:00h via Studentportalen.

Individual reflection report

In addition to the group report, each student should also hand in a short individual reflection report. The report should be 1-2 A4 pages (not more), and can be written in English. The report should consist of two parts:

  • A description of your role in the project group and what you personally did in the project, including which MT systems you trained.
  • Pick a recent conference article related to your topic, and briefly discuss how your project work relates to that work. Do not to pick the same article as the other group members.
The deadline for handing in the reports is June, 2, 20:00h via Studentportalen.

Seminar Presentations

The goal of the seminars is to give all students an overview of the topics selected by the master students for their projects. Please, try to give a comprehensible introduction to the topic you have selected. Motivate the ideas and concepts and try to be as pedagogical as possible. Allow discussions and questions. The overall time for your presentation is 30 minutes including all discussions and questions. This means that you should prepare a presentation for about 20 minutes - 25 minutes.

It is up to the students in each group to decide how to organise the presentation. Each student should present some part of the work. All students in the group should know the contents of the whole presentation and be prepared to answer questions. It is compulsary for all students to attend both seminars. Please inform Fabienne beforehand if you cannot participate. We will then find an alternative soulution. The presentation should be given in English.

The seminars will be held on May 29 and 31.

Project Topics

Here is a list of project ideas and a short description of their main goals. Note that the exact descriptions are suggestions that should be discussed with your supervisor. For some projects the subjects will also be discussed briefly during other lectures. If this is the case, discuss the contents of your seminar with the respective teacher.

Note that there might be fewer groups than there are topic suggestions, depending on your wishes and the course coverage.

Each project consists of the following parts:

  • building a baseline system (to have something to compare with)
  • building one (or more) new system(s) (depending on the topic)
  • building a system consists of training, tuning and testing it, respectively.

List of Topics

  • Parameter Tuning

  • Factored SMT models
    • Seminar: Explain the basic concepts of factored SMT
    • Project: train and compare various factored SMT models
      - include factors such as POS tags, lemmas, syntactic function
      - compare various combinations of translation and generation steps
      - tools: Moses, tagger and lemmatizer (hunpos, TreeTagger) and parsers with existing models
      - data: translated movie subtitles

  • Language Modeling and Domains
    • Seminar: Explain the basic concepts of n-gram LM's
    • Project: Explore language models and their parameters
      - investigate the effect of data size on translation quality
      - compare the use of in-domain versus out-of-domain data (perplexity and translation quality)
      - combinations of in-domain and out-of-domain LM's
      - tools: KenLM, SRILM, Moses
      - data: translated movie subtitles and data from other domains

  • Word alignment and Phrase-Based SMT
    • Seminar: Explain word alignment algorithms and phrase extraction strategies
    • Project: Explore the impact of word alignment on SMT quality
      - different settings for GIZA++
      - different symmetrization heuristics
      - difference between alignment of wordforms, lemmas, (POS tags?)
      - other alignment tools: anymalign, (Berkeley aligner?)
      - tools: GIZA++, Moses, anymalign, TreeTagger, ...
      - data: translated movie subtitles

  • Re-ordering and SMT
    • Seminar: Explain different re-ordering strategies
    • Project: Apply and compare different re-ordering approaches
      - lexicalized re-ordering models
      - re-ordering constraints (see Moses: hybrid translation)
      - pre-ordering (before training/decoding)
      - tools: Moses, external or own tools
      - data: translated movie subtitles

  • Tree-based SMT
    • Seminar: Explain the basic concepts of tree-based SMT
    • Project: train and compare various tree-based SMT models
      - hierarchical phrase-based SMT (no linguistic syntax)
      - linguistic syntax in source and/or target language
      - tools: POS tagger (e.g. hunpos) and parsers with existing models
      - data: translated movie subtitles

  • Domains and evaluation
    • Seminar: Explain the impact on domain on MT and domain adaption strategies
    • Project: Explore the influences of different domains on training and test data, and evaluate through several different methods
      - Vary the domain in training, dev and test data
      - Train on mixed data or data from a single domain
      - Possibly: explore methods for domain adaption
      - Evaluate using different automatic metrics
      - Evaluate using some manual or semi-automatic method
      - tools: Moses, evaluation metrics
      - data: translated movie subtitles and data from other domains

  • Compounds in SMT
    • Seminar: Explain how compound words can be treated in MT
    • Project: Explore how to handle compound words for MT from and possibly to a compounding language
      - Compound splitting
      - Train MT systems with split compounds
      - Explore merging strategies for translating into compounding languages?
      - tools: Moses, external or own tools
      - data: translated movie subtitles

  • Lattices and confusion networks
    • Seminar: Explain how lattices and confusion networks are used in MT and give some examples of when they have been used
    • Project: Identify areas where lattices and/or confusion networks can be useful and apply it to an MT system
      - Figure out things that can be represented by lattices and/or confusion networks
      - Run MT systems with lattices and/or confusion networks
      - tools: Moses, external or own tools
      - data: translated movie subtitles

Resources

General Resources for the projects will be listed and linked here.

Tools

You will mainly work with the
Moses SMT toolkit, parts of which you are already familiar with from the lab sessions. For all projects, you will first have to build a baseline system following the instructions given in this tutorial. This is required in order to be able to compare the results of the systems you will build during the project. Remember that BLEU scores are relative, we are always interested in their improvement over a baseline system. An absolute BLEU score is not informative.

IMPORTANT! You do not have to install Moses on the university computers. At least one version of it is already installed. We will make sure that everything works and provide you with all the paths you need when the projects start.

Other tools and resources that you might need in some projects:

  • hunpos - POS tagger; pre-trained POS tagging models
  • TreeTagger - another POS tagger with many pre-trained models; includes also lemmatization!
  • MaltParser - a data-driven dependency parser generator. pre-trained POS tagging models for various languages (NOTE - you will need version 1.4.1 for using those models!)
  • Simple tools to convert POS-tagged data to CoNLL format (for parsing) and MaltParser output to XML trees (for tree-based SMT training) are available at /local/kurs/mt/projects/tools/:
    • tagged2conll.pl - convert TAB separated POS-tagging output to CoNLL format for parsing
    • malt2tree.pl - convert TAB separated parser output (CoNLL format) to XML format to be used with Moses syntax-based models. NOTE - This only works for projective trees!
  • anymalign - an alternative word aligner
  • The Berkeley Word Aligner
  • More links to tools

Data

The basic resource is translated subtitles as collected in the OPUS corpus collection. For some of the projects you might need more diverse data, discuss this with your supervisor if that is the case. A selection of small/medium sized data sets is available on our server (stp) in

/local/kurs/mt/projects/data

Each parallel data set includes

  • training data (xx-yy.train.xx, xx-yy.train.yy)
  • development data (xx-yy.dev.xx, xx-yy.dev.yy)
  • test data (xx-yy.test.xx, xx-yy.test.yy)
xx is here the language ID of the source language and yy is the language ID in the target language (you may, of course, also use the data in the other translation direction).

Currently, there are data sets available for

  • English - Swedish (en-sv)
  • English - French (en-fr)
  • English - Spanish (en-es)
  • German - English (de-en)
  • German - Swedish (de-sv)
  • French - Swedish (fr-sv)
We can produce data sets for other language pairs if you like. Just ask us.

All parallel data sets are sentence aligned (corresponding lines are aligned with each other), tokenized and "true-cased" (look at the Moses homepage to understand what that means. True-casing is not perfect in the case of movie subtitles as there are often dashes or other marker characters in the beginning of a sentence. You may recase the data if you like using the Moses tools.

There are also monolingual data sets for all languages above. They have the basename mono and an extension corresponding to the language ID (de, en, es, fr, sv).