Experimental Settings: Coling 2010

This page provides additional information about the experiments reported in the Coling 2010 paper Evaluating Dependency Parsers on Unbounded Dependencies.

Training Data and Preprocessing

The standard versions of MSTParser and MaltParser were trained on Penn Treebank, Wall Street Journal, sections 2-21.

The special question models, referred to as MST-Q and Malt-Q in the paper, were trained on the same data plus the subset of all sentences in QuestionBank that do not occur in the development or test sets for the unbounded dependencies.

The data from Penn Treebank and QuestionBank were converted to Stanford basic dependencies in CoNLL format using a Java program (PennToStanford.java) that first employs the Stanford parser to obtain the dependencies in Stanford format from the Penn trees, then converts this output to CoNLL format and adds to it the part-of-speech tags obtained from the original trees. Detailed instructions about how to execute this converter are in a comment at the beginning of the Java code.

The development and test sets for unbounded dependencies were part-of-speech tagged using SVMTool with the pre-trained model for English available on its website.


Both versions of MaltParser (Malt, Malt-Q) were trained using MaltParser, version 1.3.2, with the following parameters:

java -jar -Xmx1024m malt.jar -c model -m learn -i training.conll -a stackproj -F coling10.xml -d POSTAG -s Stack[0] -T 1000 
This requires the feature specification file coling10.xml. For more information about parameters, see the MaltParser documentation.


Both versions of MSTParser (MST, MST-Q) were trained using MSTParser, version 0.4.3b. Complete instructions and scripts for training the parsers can be found in the archive mst.tar.gz. (Consult the README file.)


The output of all four parsers (MST, MST-Q, Malt, Malt-Q) was post-processed using the Perl script basic2propagated.pl, which infers dependencies licensed by relative clauses and coordination. The usage is as follows:

perl basic2propagated.pl input.conll > output.stanford
NB1: After the experiments reported in the paper, the post-processing has been refined and reimplemented and the old version is only published in the interest of replicability.


We considered a correctly recovered dependency to be one where the gold-standard head and dependent were correctly identified, and the label was an "acceptable match" to the gold-standard label. To be an acceptable match, the label had to indicate the grammatical function of the extracted element at least to the level of distinguishing active subjects, passive subjects, objects, and adjuncts. For example, we did not distinguish between dobj (direct object), pobj (prepositional object), and iobj (indirect object) where the gold-standard label was any sort of object, but none of these would have been allowed if the gold-standard label was nsubj (subject). We also did not allow generic dep (dependency) labels since they are underspecified for the grammatical role of the extracted element. In some cases we allowed the correct dependency to be inferred from a path of dependencies.

All the files used for the evaluation and their scoring can be found in the archive evaluation.tar.gz. We include three files for each combination of parser and construction:

*.conllRaw parser output in CoNLL format
*.stanfordParser output after post-processing in Stanford format
*.scoredScoring of parser output after post-processing

The scored files show [1] for a correctly recovered dependency and [0] when the gold-standard dependency was not recovered. A few of the decisions in the development files are commented (prefaced by #). The file difficult_cases.txt gives examples of the reasoning used in the evaluation for some of the more difficult cases in the test data.

NB2: After the experiments reported in the paper, the manual evaluation has been replaced by a fully automatic procedure, which will be used in the future.

NB3: There are three errors in Table 3 in the published paper:

  1. MST on ObRed: The total number errors should be 10 (not 9).
  2. Malt on ObRed: The total number of errors should be 14 (not 13).
  3. Malt on Free: The number of Arg errors should be 4 (not 5) and the total number of errors should be 5 (not 6).