Practical Assignment 1: Language Modeling

This practical assignment is about language modeling, that is, probability estimation for arbitrary strings of symbols (in our case sequences of words). Data for the assignment have been generated using a probabilistic grammar for a blocks world language and contains sentences like take the blue cube and put the block on the red square. The vocabulary is limited to 12 words and the grammar is very restricted in order to make the task feasible with relatively little data and without advanced smoothing methods. Since the data have been generated by a probabilistic grammar, it is possible to say exactly what is the probability of each sentence in the test corpus. The task in this practical assignment is to see how close to these probabilities we can get with different n-gram models. The main point of the assignment is to give a conceptual understanding of probabilistic modeling, not to build practically useful language models.

Data

The training data consist of 100 sentences and the test data of 10 sentences, both generated by the same probabilistic grammar: For the test data the exact probability of each sentence is given, in order to enable the evaluation of different language models. The normal way of evaluating language models is to compute the probability or entropy of the test set given the model. The principle is that a better model assigns a higher probability (lower entropy) to the test set. In this assignment, we in addition have access to the true probabilities, which means that we can check how well we can approximate the true probabilities. (More about n-gram models, probability and entropy.)

Software

We will use a simple Prolog program that computes probabilities for strings according to a unigram, bigram or trigram model induced from the training data. The probability estimates are pure maximum likelihood estimates, without any kind of smoothing. In order to use the program, you have to do the following:
  1. Download ngram.pl and save it in a suitable working directory.
  2. Start the Prolog interpreter from the same directory by running the command swipl (which starts SWI-Prolog):
    prompt$ swipl
    Welcome to SWI-Prolog (Multi-threaded, Version 5.6.35)
    Copyright (c) 1990-2007 University of Amsterdam.
    SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software,
    and you are welcome to redistribute it under certain conditions.
    Please visit http://www.swi-prolog.org for details.
    
    For help, use ?- help(Topic). or ?- apropos(Word).
    
    ?-
    
  3. Compile the file ngram.pl by entering [ngram]. and hitting return at the Prolog prompt.
    ?- [ngram].
    % ngram compiled 0.00 sec, 5,592 bytes
    
    Yes
    ?- 
    
In order to use the program, you also have to load the training data and test data. You do this by compiling the files training_data.pl and test_data.pl. The former contains unigram, bigram and trigram frequencies (in the form of a Prolog database); the latter contains the sentences in the test set in different formats tailored for ngram.pl.

Task 1: String Probabilities

The first task is to check what probability the different models assign to the sentences in the test set. You should test the unigram, bigram and trigram models and compare them to each other and to the true probabilities. In order to test sentence number 1 with the unigram model, you type test(1, unigram) at the Prolog prompt and hit return:
?- test(1, unigram).
Sentence: take the block on the green circle
Probability: 3.23519e-09

Yes
?-
In order to test other sentences, you replace 1 by one of the numbers 2-10. In order to test other models, you replace unigram by bigram or trigram. If you want to test your own sentences, you can instead use the predicate test_sentence, which takes a model name and a list of words as arguments. For example, you can compute the probability for the sentence take the block according to the bigram model like this:
?- test_sentence(bigram, [take,the,block]).
Sentence: take the block
Probability: 0.0126059

Yes
?-
Go through all ten sentences with all three models and study how the probabilities vary for different models and sentences. Check how they relate to the true probabilities for each sentence and try to explain the patterns you find.

Taks 2: Corpus Probability and Entropy

The normal way of evaluating a language model is to compute the probability or entropy that the model assigns to a given test corpus. The principle is that a better model gives higher probability (lower entropy). To compute these metrics, you use the predicate test_corpus as follows:
?- test_corpus(unigram).
Probability: 1.45072e-89
Entropy per word: 3.43157
Entropy per sentence: 29.5115

Yes
?-
By replacing unigram by bigram and trigram, you can test all three models. Compare the result for a given model with the true probability (entropy) given for the test data.

Report

Write a short report on the evaluation of the different models (1-2 pages). Try to explain the differences you observe both between different models and between the models and the true probabilities. Deadline: 20 March