The goal of this first task of the lab is to test automatic sentence alignment and the collection of parallel corpora. We will use translated movie subtitles to build a small parallel corpora to test existing alignment approaches.
Use on-line resources such as http://www.opensubtitles.org or http://www.undertexter.se to collect subtitles in English and one other language that one of you is familiar with.
Collect subtitles in both languages for at least 4 movies.
Try to focus on a specific genre or period of time.
Make sure that the subtitles refer to exactly the same movie.
Sometimes there are several versions refering to different versions and splits of the same movie (look, for example, on the indication of the #CD's).
Important: Download files in SRT-format to make it possible to use the provided conversion tools! If necessary, unzip the files using for example
Convert the subtitles to XML format by following these steps:
cp /local/kurs/mt/lab-sentalign/*.pl .
sed -i dos 's/\r$//' file.srtif dos2unix is not available on your system)
./srt2xml.pl -l eng file.movie1.eng.srt > movie1.en.xml
./srt2xml.pl -l swe file.movie1.swe.srt > movie1.sv.xml
-l langis used to guess the character encoding. Replace 'swe' with the three-letter code for your language if you choose a language other than Swedish. Always look at the result of the conversion to check whether it worked out fine. Use, for example,
less movie1.sv.xmlto check the XML output.
If everything works, the files should look like this:
<?xml version="1.0" encoding="utf-8"?> <document> ... <s id="9"> <w id="9.1">Avslås</w> <w id="9.2">.</w> <time id="T8E" value="00:01:26,200" /> </s> <s id="10"> <time id="T9S" value="00:01:28,500" /> <i> <w id="10.1">Rätt</w> <w id="10.2">till</w> <w id="10.3">Habeas</w> <w id="10.4">Corpus</w> <w id="10.5">:</w> </i> </s>If they do not, try to find the problem or download other cleaner files. Check also if both language versions cover the entire movie. Look at the end of the subtitle files and check the time values.
In some cases you have some extra bytes in the beginning of a file specifying the encoding of that file (so-called BOM). You can get rid of those bytes by running the following command:
tail --bytes=+4 movie1.eng.srt > movie1.eng.clean.srt
If you need to unpack rar-files:
/home/stp12/aarons/bin/rar/unrar -x movie1.eng.rar
Don't forget to cleanup your directory. Delete all unnecessary files and only keep the clean XML files you want to use in your corpus.
In this part, we will try different alignment approaches to align the collected texts.
The first task is to align all your data files using the length-based method developed by Gale and Church (see lecture notes and original paper - A Program for Aligning Sentences in Bilingual Corpora, Gale & Church 1993).
Use the following command (store all your subtitles in the folder "data"):
~joerg/projects/uplug/uplug align/sent -src data/movie1.en.xml -trg data/movie1.sv.xml > data/movie1.length.ensv.xml
This will take a few seconds (or a minute, dependening on the size of texts to be aligned).
The alignment information in this example will be stored in
This alignment is coded as XML-based stand-off annotation that connects sentences in the original corpus files by their sentence ID's.
Take a look at the alignment file to see the structure of this annotation.
You can use the following command to extract the aligned sentences from the original documents:
~joerg/projects/uplug/tools/readalign data/movie1.length.ensv.xml | less
Select 2 subtitle pairs (from 2 movies) that you would like to evaluate. Look at the first 20 aligned segments of each of these and evaluate the alignment quality. How many links are OK? Estimate the precision based on your evaluation for each subtitle pair. Report the evaluation scores in your report.
Another approach is to use the time information given in the subtitle files to find alignments between sentences. Run a time-based aligner using the following command:
./srtalign.pl data/movie1.en.xml data/movie1.sv.xml > data/movie1.time.ensv.xmlUse the
readalignfrom above once again to examine the result.
Do this for all your subtitle pairs.
Run also experiments with additional options to the alignment program. One problem with the time-based alignment is that movie subtitles are not always synchronized with the movie in exactly the same way. The alignment script provides heuristics to synchronize the time information based on matching cognates. Try, for example, the following command:
./srtalign.pl -v -c 0.7 -b data/movie1.en.xml data/movie1.sv.xml > data/movie1.cognates.ensv.xml
The flag "-c" is used to specify a threshold for a string similarty measure called LCSR (the longest common subsequence ratio) that will be used to find possible pairs of cognates. A value of 1 refers to identical strings, 0 means no match at all. With the flag "-b", the script tries to find the best cognate pair that improves the alignment the most. Alignment quality is measured in terms of the ratio between empty alignments and non-emtpy ones. You can try without this flag to see what happens. The flag "-v" enables verbose output.
After each run, look at the alignment using
readalign and decide what kind of settings you would like to use for your final alignment.
Discuss in your lab report how you made your decisions.
Evaluate the final alignment in the same way as above (with the length-based method) using the same subtitle pairs.
Discuss the differences and the result in general in your report.
Can you see specific problems with one or the other alignment approach?
Are there files that are particularly difficult to align?
Can you explain why?
Important: There are two different assignments for word alignment. The first assignment is a programming exercise, whereas the second assignment involves experimenting with pre-written code. Assignment 1 is for master's students. For bacholer's students, you are free to choose between the two assignments, please indicate which one is chosen in your report.
Copy the "blocks world" corpus into your work directory:
cp /local/kurs/mt/lab-wordalign/corpus* lab-wordalign/
This is a small corpus of Swedish and English sentence pairs. Sentences at corresponding lines are aligned with each other. Open the files to see what they look like.
/local/kurs/mt/bin/plain2snt.out corpus.sv corpus.en
This extracts vocabulary files (with the file extension .vcb) and sentence alignment files (with the file extension .snt) using word type IDs taken from the vocabulary files.
Look at the vocabulary files and report the 3 most frequent words in the English corpus and the 3 most frequent ones in the Swedish corpus (frequencies are given in the third column).
Look at the sentence alignment files. There are two of them, one for the alignment direction from Swedish to English and one for the other alignment direction. Each aligned unit is stored on three lines: The first one gives the frequency of this particular sentence pair, the second specifies words in the source language and the third one words in the target language.
Using the ID's in the vocabulary file, figure out which one of the files aligns Swedish to English and which one the other way around (include the solution in your report).
/local/kurs/mt/bin/mkcls -c5 -pcorpus.en -Vcorpus.en.vcb.classes
/local/kurs/mt/bin/mkcls -c5 -pcorpus.sv -Vcorpus.sv.vcb.classes
This creates 5 word classes for English and 5 for Swedish. Look at the word class (in the files with the extension .cats) and report the result in your report for both languages. Each class has a unique ID and the words in that class follow the ID. Do the classes make sense? Can you see a pattern that correspond to linguistic intuition?
Run once again with 3 classes only (-c3) and see what happens. Add comments about the new result in your report. Run now with 5 classes again to create the original word classes for the next experiments.
/local/kurs/mt/bin/snt2cooc.out corpus.sv.vcb corpus.en.vcb corpus.sv_corpus.en.snt > corpus.sven.cooc
These lists are used for the initial estimations of the word aligmnent models. Look at the output file and see if this makes sense. Note that ID=0 is reserved for the special NULL word!
/local/kurs/mt/bin/GIZA++ -S corpus.sv.vcb -T corpus.en.vcb -C corpus.sv_corpus.en.snt -cooc corpus.sven.cooc -mh 0 -model3iterations 0 -model4iterations 0 -model1dumpfrequency 1 -o ibm1
This runs 5 iterations of IBM model 1 on a single core and dumps lexical
translation probabilities into the files
ibm1.t1.[1-5]. The last number refers to the iteration. Look at Part
V (Output File Formats)
understand the format of this file (T-table). Take an example of a
lexical translation probability (do not select a NULL alignment
where one of the word IDs is 0) and report how the probablity
changes for this example during the 5 iterations. Replace the word
ID by the actual word by looking it up in the vocabulary files!
The files ibm1.A1.[1-5].part0 show the Viterbi alignment after each
again to understand how to read this file format. Look at the
changes throughout the training process and discuss them in your
/local/kurs/mt/bin/GIZA++ -S corpus.sv.vcb -T corpus.en.vcb -C corpus.sv_corpus.en.snt -cooc corpus.sven.cooc -mh 0 -model2iterations 5 -model3iterations 5 -model4iterations 0 -o ibm3
Look at the final IBM 3 probability tables created by GIZA++
ibm3.n3.final, ibm3.p0_3.final, ibm3.t3.final) and
explain in your own
words what kind of parameter they refer to.
the documentation of output files (Part V).
Give at least one
example from the tables of each parameter file and explain what
it represents. Replace word IDs by the actual word form in the
cases they appear (for example, in the t-table).
Look at the development of model perplexity (trn-pp) on the training
ibm3.perp. Where do you see the largest
improvements in perplexity? (lower scores are better)
Finally, look at the Viterbi word alignment
ibm3.A3.final. List (in a table)
how the words are
aligned with each other in the first sentence pair.
Look at the Viterbi alignment file again and look at the alignment of the first sentence pair. List the links again and discuss alignment differences compared to the ones in the opposite direction. Check also the other output files and comment other interesting differences if you find any.
Feel free to experiment with further settings if you like. An overview of all possible parameters is shown if you run GIZA++ without command line arguments.
Don't forget to clean up after all your experiments to save space on our file server!
Prepare a lab report that discusses all your results from sentence alignment and word alignment. Include answers to all questions mentioned in the assignments (in bold) of each part of the lab and report the results of your experiments. Give also some basic statistics of your parallel corpus and mention problems and difficulties you had during the work with the assignments. If you implemented IBM Model 1 remember to send your source code.
Submit your report as a pdf through the student portal. Deadline for handing in the report: May 5, 2017.