Uppsala universitet  

Projects in Machine Translation
Bachelor Programme

Organisation
Project Work
Individual Reflection Report
Seminar Presentation
Possible Topics
Resources

Aim

The goal of the bachelor projects is
  1. to study background literature
  2. to carry out a practical assignment related to the topic selected for the seminar and to prepare a final report describing the results

Deadlines

  • April 14: Hand in your topic preferences
  • May 29 and 31: Seminar presentations (detailed schedule TBD)
  • June 2: Hand in final group report
  • June 2: Hand in individual reflection reports

Organisation

You will work in groups of 3-4 students. We will put together the groups, based on your wishes for which topics you prefer to work on. The list of
topic suggestions can be found at the bottom of this page. You can hand in your preferences by email to Fabienne, by April 14th at the latest. Give a list of at least three different topics you could consider to work on, and rank them We will try our best to accomodate everyone's wishes, but we cannot guarantee that you will get your prefered topics. If you fail to hand in a wish by April 14, you will be assigned arbitrarily to a topic.

Project work

For each topic you should perform a practical project, where you apply some of the concepts related to your projects practically. This includes setting up and running MT systems, normally with Moses, and evaluate and compare systems that differ in some aspect related to your topic. Each project will be assigned a supervisor, with whom you should discuss how to set up your project. The project work should be presented in a written group report, written in either English or Swedish.

It is possible to divide the work in the group so that different persons perform different experiments. It is important that each person in the group should set up and run at least one MT system. You are jointly responsible for writing the report, however, and each person should understand and be familiar with all parts of the report. It is obligatory to have at least one meeting with your supervisor where you discuss how you divide the work, and show that everyone can set up and run MT systems.

In the final report, you are expected to

  • describe the background in terms of the concepts, approaches and techniques within your selected topic, including references to journal and/or conference articles
  • describe your project and motivate your experimental setup
  • summarize, evaluate and analyze your results
  • describe possible shortcomings and ideas for improvement

The deadline for handing in the reports is June, 2, 20:00h via Studentportalen.

Individual reflection report

In addition to the group report, each student should also hand in a short individual reflection report. The report should be about 1-2 A4 pages (not more), and can be written in English or Swedish. The report should consist of two parts:

  • A description of your role in the project group and what you personally did in the project, including which MT systems you trained.
  • Pick a recent conference article related to your topic, and briefly discuss how your project work relates to that work. Do not to pick the same article as the other group members.
The deadline for handing in the reports is June, 2, 20:00h via Studentportalen.

Project Topics

Here is a list of project ideas and a short description of their main goals. Note that the exact descriptions are suggestions that should be discussed with your supervisor. For some projects the subjects will also be discussed briefly during other lectures. If this is the case, discuss the contents of your project with the respective teacher.

Note that there might be fewer groups than there are topic suggestions, depending on your wishes and the course coverage.

Each project consists of the following parts:

  • building a baseline system (to have something to compare with)
  • building one (or more) new system(s) (depending on the topic)
  • building a system consists of training, tuning and testing it, respectively.

List of Topics

  • Factored SMT models
    • Report: Explain the basic concepts of factored SMT
    • Project: train and compare various factored SMT models
      - include factors such as POS tags, lemmas, syntactic function
      - compare various combinations of translation and generation steps
      - tools: Moses, tagger and lemmatizer (hunpos, TreeTagger) and parsers with existing models
      - data: translated movie subtitles

  • Language Modeling and Domains
    • Report: Explain the basic concepts of n-gram LM's
    • Project: Explore language models and their parameters
      - investigate the effect of data size on translation quality
      - compare the use of in-domain versus out-of-domain data (perplexity and translation quality)
      - combinations of in-domain and out-of-domain LM's
      - tools: KenLM, SRILM, Moses
      - data: translated movie subtitles and data from other domains

  • Re-ordering and SMT
    • Report: Explain different re-ordering strategies
    • Project: Apply and compare different re-ordering approaches
      - lexicalized re-ordering models
      - re-ordering constraints (see Moses: hybrid translation)
      - pre-ordering (before training/decoding)
      - tools: Moses, external or own tools
      - data: translated movie subtitles

  • Compounds in SMT
    • Report: Explain how compound words can be treated in MT
    • Project: Explore how to handle compound words for MT from and possibly to a compounding language
      - Compound splitting
      - Train MT systems with split compounds
      - tools: Moses, external or own tools
      - data: translated movie subtitles

Resources

General Resources for the projects will be listed and linked here.

Tools

You will mainly work with the
Moses SMT toolkit, parts of which you are already familiar with from the lab sessions. For all projects, you will first have to build a baseline system following the instructions given in this tutorial. This is required in order to be able to compare the results of the systems you will build during the project. Remember that BLEU scores are relative, we are always interested in their improvement over a baseline system. An absolute BLEU score is not informative.

IMPORTANT! You do not have to install Moses on the university computers. At least one version of it is already installed. We will make sure that everything works and provide you with all the paths you need when the projects start.

Other tools and resources that you might need in some projects:

  • hunpos - POS tagger; pre-trained POS tagging models
  • TreeTagger - another POS tagger with many pre-trained models; includes also lemmatization!
  • MaltParser - a data-driven dependency parser generator. pre-trained POS tagging models for various languages (NOTE - you will need version 1.4.1 for using those models!)
  • Simple tools to convert POS-tagged data to CoNLL format (for parsing) and MaltParser output to XML trees (for tree-based SMT training) are available at /local/kurs/mt/projects/tools/:
    • tagged2conll.pl - convert TAB separated POS-tagging output to CoNLL format for parsing
    • malt2tree.pl - convert TAB separated parser output (CoNLL format) to XML format to be used with Moses syntax-based models. NOTE - This only works for projective trees!
  • anymalign - an alternative word aligner
  • The Berkeley Word Aligner
  • More links to tools

Data

The basic resource is translated subtitles as collected in the OPUS corpus collection. For some of the projects you might need more diverse data, discuss this with your supervisor if that is the case. A selection of small/medium sized data sets is available on our server (stp) in

/local/kurs/mt/projects/data

Each parallel data set includes

  • training data (xx-yy.train.xx, xx-yy.train.yy)
  • development data (xx-yy.dev.xx, xx-yy.dev.yy)
  • test data (xx-yy.test.xx, xx-yy.test.yy)
xx is here the language ID of the source language and yy is the language ID in the target language (you may, of course, also use the data in the other translation direction).

Currently, there are data sets available for

  • English - Swedish (en-sv)
  • English - French (en-fr)
  • English - Spanish (en-es)
  • German - English (de-en)
  • German - Swedish (de-sv)
  • French - Swedish (fr-sv)
We can produce data sets for other language pairs if you like. Just ask us.

All parallel data sets are sentence aligned (corresponding lines are aligned with each other), tokenized and "true-cased" (look at the Moses homepage to understand what that means. True-casing is not perfect in the case of movie subtitles as there are often dashes or other marker characters in the beginning of a sentence. You may recase the data if you like using the Moses tools.

There are also monolingual data sets for all languages above. They have the basename mono and an extension corresponding to the language ID (de, en, es, fr, sv).