Practical Assignment 2: PCFG Parsing
This assignment is about statistical parsing. We will create a treebank
PCFG and create a simple parser using DCG in Prolog. The data for this
assignment have been generated using a probabilistic grammar for
a blocks world language and consist of sentences like
take the blue cube and
put the block on the red square.
Even if the vocabulary and grammar of this language is very
restricted, the language does in fact contain ambiguous sentences,
and the task is to investigate how often the most probable analysis
according to the treebank PCFG is in fact the preferred interpretation.
Data
The training data consists of 100 sentences and the test data of
10 sentences. These are exactly the same sentences as in the first
assignment, but they are now distributed as treebanks, i.e.,
sets of syntactically analyzed sentences:
PCFG Parsing in Prolog
A simple way of parsing with a PCFG is to use the built-in DCG in Prolog.
The only requirements are that every category symbol (term) is augmented
with an extra argument for the probability of the phrase and that
every rule is augmented with a Prolog call to multiply the probability
of the daughters with the probability of the rule itself in order to
obtain the probability of the entire phrase.
For example, the following PCFG rule with probability 0.4
S --> NP VP : 0.4
can be written
s(P0) --> np(P1), vp(P2), {P0 is P1*P2*0.4}.
Please note that variables start with uppercase letters and constants
with lowercase letters in Prolog.
When we use the grammar for parsing it is practical to augment each
category symbol (term) with an additional argument for a parse tree.
Thus:
s(P0,s(NP,VP)) --> np(P1,np(NP)), vp(P2,vp(VP)), {P0 is P1*P2*0.4}.
For this assignment, you don't need to understand how parsing is performed
in Prolog, but you need to understand the rule format.
Software
You have at your disposal a simple Prolog program that parses sentences
using a grammar in the format described above. To use this program you need to do the following:
- Download the file parser.pl and save it in a suitable working directory.
- Start the Prolog interpreter from the same working directory by running the command swipl
(which starts SWI-Prolog):
prompt$ swipl
Welcome to SWI-Prolog (Multi-threaded, Version 5.6.35)
Copyright (c) 1990-2007 University of Amsterdam.
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to redistribute it under certain conditions.
Please visit http://www.swi-prolog.org for details.
For help, use ?- help(Topic). or ?- apropos(Word).
?-
- Compile the file parser.pl by typing [parser]. and hitting return at
the Prolog prompt:
?- [parser].
% parser compiled 0.00 sec, 3,492 bytes
Yes
?-
In order to use the program for parsing, you must also load a grammar and the sentences
in the test set. You do this by compiling the files pcfg2.pl and
test_data_trees.pl. To see how the parsing works,
you can parse sentence 1 in the test set as follows:
?- parse(1).
Sentence: take the block on the green circle
Probability: 0.000162469
s
v-take
np
det-the
n-block
pp
p-on
np
det-the
a-green
n-circle
Probability: 0.000449915
s
v-take
np
det-the
n-block
pp
p-on
np
det-the
a-green
n-circle
Yes
?-
Please note that the parser prints the sentence followed by all parse
trees preceded by their probability according to the grammar.
In the example above, the first parse tree has a probability of
approximately 0.00016 and the second a probability of 0.00045.
Which is the correct parse tree? (How do you know that?)
If you want to parse sentences that are not in the test set, you
can use the predicate parse_sentence:
?- parse_sentence([take,the,block]).
Sentence: take the block
Probability: 0.02898
s
v-take
np
det-the
n-block
Yes
?-
Task 1: Parsing and Evaluation
Parse all ten sentences in the test corpus and evaluate the result by checking
how often the grammar assigns the highest probability to the preferred parse tree.
(The preferred parse trees for the test sentences are available as trees in
test_data_trees_pp.) In addition to a summary
of the evaluation, you should try to analyze the errors that you observe.
Task 2: Grammar Transformation
The simple treebank grammar used so far gives highly systematic parsing errors.
Try to transform the grammar to improve parsing accuracy. It should be possible to
achieve 100% accuracy on the test set.
Report
Write a short report summarizing the evaluation of the grammar in task 1
and describing your grammar transformation in task 2.
Deadline: 19 April