Download MaltParser 0.1

MaltParser 0.1 can be downloaded as binaries and is available on three platforms: The software can be used freely for non-commercial research and educational purposes. It comes with no warranty, but we welcome all comments, bug reports, and suggestions for improvements.

MaltParser 0.1 uses libTimbl, part of TiMBL (Tilburg Memory-Based Learner), version 5.0, in order to learn parsing models from treebanks, and we gratefully acknowledge the use of this excellent software package. However, MaltParser 0.1 is a standalone application, so there is no need to install TiMBL separately.

User Guide for MaltParser 0.1

This is a short user guide for MaltParser 0.1, a data-driven trainable parser that uses dependency-based syntactic representations and memory-based learning for predicting parser actions. More information about models and algorithms can be found in the following papers:

Running MaltParser

MaltParser is run by executing the following command at the command line prompt:

> ./malt -f file

where file is the name of an option file, specifying all the parameters needed. The parser can be run in two basic modes, learning (inducing a parsing model from a treebank) and parsing (using the parsing model to parse new data). In the current version of the parser, new data must be tokenized and part-of-speech tagged in the Malt-TAB format. The option file, which also specifies the parser mode, is described in detail below.

Option File

The option file contains a sequence of parameter specifications with the following simple syntax:

$PARAMETER$
VALUE

In addition, the option file may contain comment lines starting with "--". The following table lists all the available parameters with their permissible values. Default values are marked with "*". Parameters that lack a default value must be specified in the option file (if they are required by the particular configuration of modules invoked). An example option file can be found here.

I/O Parameters Description Values Description
INFILE Input file Filename The input (for both learning and parsing) must be in the Malt-TAB format. During learning the four columns form, postag, head, deprel are required; during parsing only the first two (form, postag) are required. An example input file can be found here.
OUTFILE Output file Filename
OUTFORMAT Output data format TAB
(MALT)XML*
Malt-TAB
Malt-XML
VERBOSE Output to terminal YES*
NO
Tagset Parameters Description Values Description
POSSET Part-of-speech tagset Filename The part-of-speech tagset must be specified in a text file with one tag per line (and no blank lines). An example file can be found here.
DEPSET Dependency type tagset Filename The dependency type tagset must be specified in a text file with one tag per line (and no blank lines). The first tag must be the tag assigned to root nodes. An example file can be found here.
Parser Parameters Description Values Description
PARSERMODE Parser mode (learning or parsing) PARSE*
LEARN
Parsing (using a memory-based model)
Learning (building an instance base for memory-based learning)
MODELTYPE Model type (feature set) for memory-based learning, described in detail below. MBL2
MBL3*
MBL4
MBL6
CoNLL 2004: Non-lexical
CoNLL 2004: Lexical
Coling 2004: Model 2
Coling 2004: Model 1
PROJECTIVE Enforce projectivity or not YES
NO*
Under the NO condition, no check is made to ensure that REDUCE actions are legal. Under the YES condition, illegal REDUCE actions are replaced by SHIFT actions.
MODELFILE Model file Filename The model file contains the instance base for memory-based learning. This is an output file during learning and a required input file during parsing. A (small) example file can be found here.
COMMAND TiMBL command String This is the commandline options sent to the TiMBL server for memory-based learning. The default value is "-m M -k 5 -d ID -L 3" (see TiMBL Reference Guide).

Parsing Models

The features used in the different parsing models are depicted below. Top is the token on top of the stack, Next is the next input token, and L1, L2, L3 are the (parts-of-speech of) the three tokens following Next.

The dependency arcs represent dependencies that may or may not be present at decision time, where TL, TR and NL represent the (parts-of-speech of) the leftmost and rightmost dependents of Top and Next (in case there are several dependents).

Red features are lexical features (word forms); blue features are part-of-speech features (PoS tags); and green features are dependency features (dependency types). Note that for the words Top and Next, there are both lexical and part-of-speech features.

The following table shows which features are used in the different parsing models.

Models Top Next T N TH TL TR NL TH TL TR NL L1 L2 L3
MBL2 +++++++
MBL3 +++++++++
MBL4 +++++++++++
MBL6 +++++++++++++++

NB: The "missing" models MBL1 and MBL5 are only of historical interest and are not available in the current release.