Practical Assignment 1: Language Modeling
This practical assignment is about language modeling, that is,
probability estimation for arbitrary strings of symbols (in our
case sequences of words). Data for the assignment have been generated
using a probabilistic grammar for a blocks world language and contains
sentences like take the blue cube and
put the block on the red square.
The vocabulary is limited to 12 words and the grammar is very
restricted in order to make the task feasible with relatively little
data and without advanced smoothing methods. Since the data have been
generated by a probabilistic grammar, it is possible to say exactly
what is the probability of each sentence in the test corpus. The task
in this practical assignment is to see how close to these probabilities
we can get with different n-gram models. The main point of the assignment
is to give a conceptual understanding of probabilistic modeling, not
to build practically useful language models.
Data
The training data consist of 100 sentences and the test data of 10 sentences,
both generated by the same probabilistic grammar:
For the test data the exact probability of each sentence is given, in order
to enable the evaluation of different language models. The normal way of
evaluating language models is to compute the probability or entropy of
the test set given the model. The principle is that a better model assigns
a higher probability (lower entropy) to the test set. In this assignment,
we in addition have access to the true probabilities, which means that we
can check how well we can approximate the true probabilities.
(More about n-gram models,
probability and
entropy.)
Software
We will use a simple Prolog program that computes probabilities for strings
according to a unigram, bigram or trigram model induced from the training data.
The probability estimates are pure maximum likelihood estimates, without any
kind of smoothing. In order to use the program, you have to do the following:
- Download ngram.pl and save it in a suitable working directory.
- Start the Prolog interpreter from the same directory by running the command swipl
(which starts SWI-Prolog):
prompt$ swipl
Welcome to SWI-Prolog (Multi-threaded, Version 5.6.35)
Copyright (c) 1990-2007 University of Amsterdam.
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to redistribute it under certain conditions.
Please visit http://www.swi-prolog.org for details.
For help, use ?- help(Topic). or ?- apropos(Word).
?-
- Compile the file ngram.pl by entering [ngram]. and hitting return
at the Prolog prompt.
?- [ngram].
% ngram compiled 0.00 sec, 5,592 bytes
Yes
?-
In order to use the program, you also have to load the training data and test data. You do this by
compiling the files training_data.pl
and test_data.pl.
The former contains unigram, bigram and trigram frequencies (in the form of a Prolog database);
the latter contains the sentences in the test set in different formats tailored for
ngram.pl.
Task 1: String Probabilities
The first task is to check what probability the different models assign to the sentences in
the test set. You should test the unigram, bigram and trigram models and compare them to each
other and to the true probabilities. In order to test sentence number 1 with the unigram model,
you type test(1, unigram) at the Prolog prompt and hit return:
?- test(1, unigram).
Sentence: take the block on the green circle
Probability: 3.23519e-09
Yes
?-
In order to test other sentences, you replace 1 by one of the numbers 2-10. In order to test
other models, you replace unigram by bigram or trigram. If you want to test your own
sentences, you can instead use the predicate test_sentence, which takes a model name and a list of
words as arguments. For example, you can compute the probability for the sentence
take the block according to the bigram model like this:
?- test_sentence(bigram, [take,the,block]).
Sentence: take the block
Probability: 0.0126059
Yes
?-
Go through all ten sentences with all three models and study how the probabilities vary for different
models and sentences. Check how they relate to the true probabilities for each sentence and try to
explain the patterns you find.
Taks 2: Corpus Probability and Entropy
The normal way of evaluating a language model is to compute the probability or
entropy that the model assigns to a given test corpus. The principle is that
a better model gives higher probability (lower entropy). To compute these metrics,
you use the predicate test_corpus as follows:
?- test_corpus(unigram).
Probability: 1.45072e-89
Entropy per word: 3.43157
Entropy per sentence: 29.5115
Yes
?-
By replacing unigram by bigram and trigram, you can
test all three models. Compare the result for a given model with the true
probability (entropy) given for the test data.
Report
Write a short report on the evaluation of the different models (1-2 pages).
Try to explain the differences you observe both between different models and
between the models and the true probabilities.
Deadline: 20 March