Practical Assignment 2: PCFG Parsing

This assignment is about statistical parsing. We will create a treebank PCFG and create a simple parser using DCG in Prolog. The data for this assignment have been generated using a probabilistic grammar for a blocks world language and consist of sentences like take the blue cube and put the block on the red square. Even if the vocabulary and grammar of this language is very restricted, the language does in fact contain ambiguous sentences, and the task is to investigate how often the most probable analysis according to the treebank PCFG is in fact the preferred interpretation.

Data

The training data consists of 100 sentences and the test data of 10 sentences. These are exactly the same sentences as in the first assignment, but they are now distributed as treebanks, i.e., sets of syntactically analyzed sentences:

PCFG Parsing in Prolog

A simple way of parsing with a PCFG is to use the built-in DCG in Prolog. The only requirements are that every category symbol (term) is augmented with an extra argument for the probability of the phrase and that every rule is augmented with a Prolog call to multiply the probability of the daughters with the probability of the rule itself in order to obtain the probability of the entire phrase. For example, the following PCFG rule with probability 0.4
S --> NP VP : 0.4
can be written
s(P0) --> np(P1), vp(P2), {P0 is P1*P2*0.4}.
Please note that variables start with uppercase letters and constants with lowercase letters in Prolog.

When we use the grammar for parsing it is practical to augment each category symbol (term) with an additional argument for a parse tree. Thus:

s(P0,s(NP,VP)) --> np(P1,np(NP)), vp(P2,vp(VP)), {P0 is P1*P2*0.4}.
For this assignment, you don't need to understand how parsing is performed in Prolog, but you need to understand the rule format.

Software

You have at your disposal a simple Prolog program that parses sentences using a grammar in the format described above. To use this program you need to do the following:
  1. Download the file parser.pl and save it in a suitable working directory.
  2. Start the Prolog interpreter from the same working directory by running the command swipl (which starts SWI-Prolog):
    prompt$ swipl
    Welcome to SWI-Prolog (Multi-threaded, Version 5.6.35)
    Copyright (c) 1990-2007 University of Amsterdam.
    SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software,
    and you are welcome to redistribute it under certain conditions.
    Please visit http://www.swi-prolog.org for details.
    
    For help, use ?- help(Topic). or ?- apropos(Word).
    
    ?-
    
  3. Compile the file parser.pl by typing [parser]. and hitting return at the Prolog prompt:
    ?- [parser].
    % parser compiled 0.00 sec, 3,492 bytes
    
    Yes
    ?- 
    
In order to use the program for parsing, you must also load a grammar and the sentences in the test set. You do this by compiling the files pcfg2.pl and test_data_trees.pl. To see how the parsing works, you can parse sentence 1 in the test set as follows:
?- parse(1).
Sentence: take the block on the green circle
Probability: 0.000162469
s
  v-take
  np
    det-the
    n-block
    pp
      p-on
      np
        det-the
        a-green
        n-circle
Probability: 0.000449915
s
  v-take
  np
    det-the
    n-block
  pp
    p-on
    np
      det-the
      a-green
      n-circle

Yes
?-
Please note that the parser prints the sentence followed by all parse trees preceded by their probability according to the grammar. In the example above, the first parse tree has a probability of approximately 0.00016 and the second a probability of 0.00045. Which is the correct parse tree? (How do you know that?)

If you want to parse sentences that are not in the test set, you can use the predicate parse_sentence:

?- parse_sentence([take,the,block]).
Sentence: take the block
Probability: 0.02898
s
  v-take
  np
    det-the
    n-block

Yes
?-

Task 1: Parsing and Evaluation

Parse all ten sentences in the test corpus and evaluate the result by checking how often the grammar assigns the highest probability to the preferred parse tree. (The preferred parse trees for the test sentences are available as trees in test_data_trees_pp.) In addition to a summary of the evaluation, you should try to analyze the errors that you observe.

Task 2: Grammar Transformation

The simple treebank grammar used so far gives highly systematic parsing errors. Try to transform the grammar to improve parsing accuracy. It should be possible to achieve 100% accuracy on the test set.

Report

Write a short report summarizing the evaluation of the grammar in task 1 and describing your grammar transformation in task 2. Deadline: 19 April