Lab 3: Parallel Corpora and Alignment

0. Introduction

There are two tasks in this lab. You are required to complete both to pass this lab.

Task 1: Parallel Corpora and Sentence Alignment

Aim

The goal of this first task of the lab is to test automatic sentence alignment and the collection of parallel corpora. We will use translated movie subtitles to build a small parallel corpora to test existing alignment approaches.

Assignments

(1) Building a Parallel Subtitle Corpus

Use on-line resources such as http://www.opensubtitles.org or http://www.undertexter.se to collect subtitles in English and one other language that one of you is familiar with. Collect subtitles in both languages for at least 4 movies. Try to focus on a specific genre or period of time. Make sure that the subtitles refer to exactly the same movie. Sometimes there are several versions refering to different versions and splits of the same movie (look, for example, on the indication of the #CD's).

Important: Download files in SRT-format to make it possible to use the provided conversion tools! If necessary, unzip the files using for example
unzip file.movie1.eng.zip

Convert the subtitles to XML format by following these steps:

  1. Copy Perl scripts to your working directory:
    cp /local/kurs/mt/lab-sentalign/*.pl .

  2. Convert to Unix format (files include often CR characters from the Windows world):
    dos2unix file.movie1.eng.srt
    (Use sed -i dos 's/\r$//' file.srt if dos2unix is not available on your system)

  3. Convert to XML:
    ./srt2xml.pl -l eng file.movie1.eng.srt > movie1.en.xml
    ./srt2xml.pl -l swe file.movie1.swe.srt > movie1.sv.xml
    The Parameter -l lang is used to guess the character encoding. Replace 'swe' with the three-letter code for your language if you choose a language other than Swedish. Always look at the result of the conversion to check whether it worked out fine. Use, for example, file movie1.sv.xml and less movie1.sv.xml to check the XML output.

If everything works, the files should look like this:

<?xml version="1.0" encoding="utf-8"?>
<document>
...
  <s id="9">
    <w id="9.1">Avslås</w>
    <w id="9.2">.</w>
    <time id="T8E" value="00:01:26,200" />
  </s>
  <s id="10">
    <time id="T9S" value="00:01:28,500" />
    <i>
    <w id="10.1">Rätt</w>
    <w id="10.2">till</w>
    <w id="10.3">Habeas</w>
    <w id="10.4">Corpus</w>
    <w id="10.5">:</w>
    </i>
  </s>
    
If they do not, try to find the problem or download other cleaner files. Check also if both language versions cover the entire movie. Look at the end of the subtitle files and check the time values.

In some cases you have some extra bytes in the beginning of a file specifying the encoding of that file (so-called BOM). You can get rid of those bytes by running the following command:

tail --bytes=+4 movie1.eng.srt > movie1.eng.clean.srt

If you need to unpack rar-files: /home/stp12/aarons/bin/rar/unrar -x movie1.eng.rar

Don't forget to cleanup your directory. Delete all unnecessary files and only keep the clean XML files you want to use in your corpus.

(2) Automatic Alignment of Movie Subtitles

In this part, we will try different alignment approaches to align the collected texts.

The first task is to align all your data files using the length-based method developed by Gale and Church (see lecture notes and original paper - A Program for Aligning Sentences in Bilingual Corpora, Gale & Church 1993). Use the following command (store all your subtitles in the folder "data"):

~joerg/projects/uplug/uplug align/sent -src data/movie1.en.xml -trg data/movie1.sv.xml > data/movie1.length.ensv.xml

This will take a few seconds (or a minute, dependening on the size of texts to be aligned). The alignment information in this example will be stored in data/movie1.ensv.xml. This alignment is coded as XML-based stand-off annotation that connects sentences in the original corpus files by their sentence ID's. Take a look at the alignment file to see the structure of this annotation. You can use the following command to extract the aligned sentences from the original documents:

~joerg/projects/uplug/tools/readalign data/movie1.length.ensv.xml | less

Select 2 subtitle pairs (from 2 movies) that you would like to evaluate. Look at the first 20 aligned segments of each of these and evaluate the alignment quality. How many links are OK? Estimate the precision based on your evaluation for each subtitle pair. Report the evaluation scores in your report.

Another approach is to use the time information given in the subtitle files to find alignments between sentences. Run a time-based aligner using the following command:

./srtalign.pl data/movie1.en.xml data/movie1.sv.xml > data/movie1.time.ensv.xml Use the readalign from above once again to examine the result.

Do this for all your subtitle pairs.

Run also experiments with additional options to the alignment program. One problem with the time-based alignment is that movie subtitles are not always synchronized with the movie in exactly the same way. The alignment script provides heuristics to synchronize the time information based on matching cognates. Try, for example, the following command:

./srtalign.pl -v -c 0.7 -b data/movie1.en.xml data/movie1.sv.xml > data/movie1.cognates.ensv.xml

The flag "-c" is used to specify a threshold for a string similarty measure called LCSR (the longest common subsequence ratio) that will be used to find possible pairs of cognates. A value of 1 refers to identical strings, 0 means no match at all. With the flag "-b", the script tries to find the best cognate pair that improves the alignment the most. Alignment quality is measured in terms of the ratio between empty alignments and non-emtpy ones. You can try without this flag to see what happens. The flag "-v" enables verbose output.

After each run, look at the alignment using readalign and decide what kind of settings you would like to use for your final alignment. Discuss in your lab report how you made your decisions. Evaluate the final alignment in the same way as above (with the length-based method) using the same subtitle pairs. Discuss the differences and the result in general in your report. Can you see specific problems with one or the other alignment approach? Are there files that are particularly difficult to align? Can you explain why?

Task 2: Word Alignment

Important: There are two different assignments for word alignment. The first assignment is a programming exercise, whereas the second assignment involves experimenting with pre-written code. Assignment 1 is for master's students. For bacholer's students, you are free to choose between the two assignments, please indicate which one is chosen in your report.

Preparations

Copy the "blocks world" corpus into your work directory:

mkdir lab-wordalign
cp /local/kurs/mt/lab-wordalign/corpus* lab-wordalign/

This is a small corpus of Swedish and English sentence pairs. Sentences at corresponding lines are aligned with each other. Open the files to see what they look like.


Assignment 1: Implement IBM Model 1

  1. Implement IBM model 1 as outlined in the pseudo-code on page 91 in Philipp Koehn's textbook (or you can find it on page 29 in this slide)on statistical machine translation. Use your favorite programming language to implement the algorithm. The program should be able to read the example corpus and to produce lexical alignment parameters t(e|f).

  2. Run your word alignment program with the example corpus with English as the 'e' language and Swedish as the 'f' language. Report the 10 highest lexical translation probabilities t(e|f) after each of the first 5 iterations.

  3. Modify your program in such a way that it also reports the perplexity of the model with respect to the training data as explained in chapter 4.2.4 of the course book. Report the perplexity for the first 5 iterations.

  4. Word-align your best sentence-aligned movie subtitle from the first part of the lab using your own implementation of IBM model 1. Report the perplexity for the first 5 iterations and 10 highest lexical translation probabilities after 10 iterations.
In your report, discuss and analyze your implementation of IBM model 1 and observations in detail. Suggest improvements and discuss possible ideas for a better alignment. Also hand-in your implementations with some instructions on how to run the code. Don't forget to add appropriate comments to your source code. (Please write your own code from scratch. Copying the existing code from the web will not be accepted.)

Assignment 2: Experiment with word alignment software

  1. Prepare the data for running the word alignment software:
    /local/kurs/mt/bin/plain2snt.out corpus.sv corpus.en

    This extracts vocabulary files (with the file extension .vcb) and sentence alignment files (with the file extension .snt) using word type IDs taken from the vocabulary files.

    Look at the vocabulary files and report the 3 most frequent words in the English corpus and the 3 most frequent ones in the Swedish corpus (frequencies are given in the third column).

    Look at the sentence alignment files. There are two of them, one for the alignment direction from Swedish to English and one for the other alignment direction. Each aligned unit is stored on three lines: The first one gives the frequency of this particular sentence pair, the second specifies words in the source language and the third one words in the target language.

    Using the ID's in the vocabulary file, figure out which one of the files aligns Swedish to English and which one the other way around (include the solution in your report).


  2. Run mkcls to automatically create word classes:

    /local/kurs/mt/bin/mkcls -c5 -pcorpus.en -Vcorpus.en.vcb.classes
    /local/kurs/mt/bin/mkcls -c5 -pcorpus.sv -Vcorpus.sv.vcb.classes

    This creates 5 word classes for English and 5 for Swedish. Look at the word class (in the files with the extension .cats) and report the result in your report for both languages. Each class has a unique ID and the words in that class follow the ID. Do the classes make sense? Can you see a pattern that correspond to linguistic intuition?

    Run once again with 3 classes only (-c3) and see what happens. Add comments about the new result in your report. Run now with 5 classes again to create the original word classes for the next experiments.


  3. Compute word co-occurrence for the parallel corpus (a list of word pairs that co-occur in aligned sentences):

    /local/kurs/mt/bin/snt2cooc.out corpus.sv.vcb corpus.en.vcb corpus.sv_corpus.en.snt > corpus.sven.cooc

    These lists are used for the initial estimations of the word aligmnent models. Look at the output file and see if this makes sense. Note that ID=0 is reserved for the special NULL word!


  4. Run IBM 1 using the following command (all on one line):

    /local/kurs/mt/bin/GIZA++ -S corpus.sv.vcb -T corpus.en.vcb -C corpus.sv_corpus.en.snt -cooc corpus.sven.cooc -mh 0 -model3iterations 0 -model4iterations 0 -model1dumpfrequency 1 -o ibm1

    This runs 5 iterations of IBM model 1 on a single core and dumps lexical translation probabilities into the files ibm1.t1.[1-5]. The last number refers to the iteration. Look at Part V (Output File Formats) in /local/kurs/mt/giza-pp/GIZA++-v2/README to understand the format of this file (T-table). Take an example of a lexical translation probability (do not select a NULL alignment where one of the word IDs is 0) and report how the probablity changes for this example during the 5 iterations. Replace the word ID by the actual word by looking it up in the vocabulary files!

    The files ibm1.A1.[1-5].part0 show the Viterbi alignment after each iteration. Check /local/kurs/mt/giza-pp/GIZA++-v2/README again to understand how to read this file format. Look at the changes throughout the training process and discuss them in your report.


  5. Now, run IBM models 1 - 3 in one training process using the following command

    /local/kurs/mt/bin/GIZA++ -S corpus.sv.vcb -T corpus.en.vcb -C corpus.sv_corpus.en.snt -cooc corpus.sven.cooc -mh 0 -model2iterations 5 -model3iterations 5 -model4iterations 0 -o ibm3

    Look at the final IBM 3 probability tables created by GIZA++ (ibm3.a3.final, ibm3.d3.final, ibm3.n3.final, ibm3.p0_3.final, ibm3.t3.final) and explain in your own words what kind of parameter they refer to. Look at /local/kurs/mt/giza-pp/GIZA++-v2/README for the documentation of output files (Part V). Give at least one example from the tables of each parameter file and explain what it represents. Replace word IDs by the actual word form in the cases they appear (for example, in the t-table).

    Look at the development of model perplexity (trn-pp) on the training data in ibm3.perp. Where do you see the largest improvements in perplexity? (lower scores are better)

    Finally, look at the Viterbi word alignment in ibm3.A3.final. List (in a table) how the words are aligned with each other in the first sentence pair.

  6. Finally, run GIZA++ in the opposite direction (using English as the source language (-S) and Swedish as the target language (-T). Note that you also need to use the other sentence alignment file.

    Look at the Viterbi alignment file again and look at the alignment of the first sentence pair. List the links again and discuss alignment differences compared to the ones in the opposite direction. Check also the other output files and comment other interesting differences if you find any.

Feel free to experiment with further settings if you like. An overview of all possible parameters is shown if you run GIZA++ without command line arguments.

Don't forget to clean up after all your experiments to save space on our file server!

Lab report

Prepare a lab report that discusses all your results from sentence alignment and word alignment. Include answers to all questions mentioned in the assignments (in bold) of each part of the lab and report the results of your experiments. Give also some basic statistics of your parallel corpus and mention problems and difficulties you had during the work with the assignments. If you implemented IBM Model 1 remember to send your source code.

Submit your report as a pdf through the student portal. Deadline for handing in the report: May 5, 2017.