Malt-XML, Malt-TAB and MaltConverter

Malt-XML

Malt-XML is an XML-based representation format for dependency treebanks. It is based on the following simple principles of representation: The representation is based on the assumption that each word has at most one head. By convention, word ids start at 1 and a root word has head="0" and deprel="ROOT". A dependency tree for the Swedish sentence "Genom skattereformen införs individuell beskattning (särbeskattning) av arbetsinkomster." can be represented as follows:
<sentence id="2" user="malt" date="">
  <word id="1" form="Genom" postag="pp" head="3" deprel="ADV"/>
  <word id="2" form="skattereformen" postag="nn.utr.sin.def.nom" head="1" deprel="PR"/>
  <word id="3" form="införs" postag="vb.prs.sfo" head="0" deprel="ROOT"/>
  <word id="4" form="individuell" postag="jj.pos.utr.sin.ind.nom" head="5" deprel="ATT"/>
  <word id="5" form="beskattning" postag="nn.utr.sin.ind.nom" head="3" deprel="SUB"/>
  <word id="6" form="(" postag="pad" head="5" deprel="IP"/>
  <word id="7" form="särbeskattning" postag="nn.utr.sin.ind.nom" head="5" deprel="APP"/>
  <word id="8" form=")" postag="pad" head="5" deprel="IP"/>
  <word id="9" form="av" postag="pp" head="5" deprel="ATT"/>
  <word id="10" form="arbetsinkomster" postag="nn.utr.plu.ind.nom" head="9" deprel="PR"/>
  <word id="11" form="." postag="mad" head="3" deprel="IP"/>
</sentence>
The tagsets used for parts-of-speech and dependency relations must be specified in the header of the XML document. An example document can be found here. An XML schema for Malt-XML treebanks can be found here.

Malt-TAB

Malt-TAB is a text-based representation, which is mainly used by MaltParser. Malt-TAB contains a subset of the features in Malt-XML, and attributes are implicitly defined by their position. Each word is represented on one line, with attribute values being separated by tabs. The required order of attributes is as follows:

form (required) < postag (required) < head (optional) < deprel (optional)

Although head and deprel are optional, they must either both be included or both be omitted. (Normally, all four columns are present in the input when training the parser and in the output when parsing, while only form and postag are present in the input when parsing.) Please note also that the id attribute is not represented explicitly at all. Words in a sentence are separated by one newline; sentences are separated by one additional newline. A dependency tree for the Swedish sentence "Genom skattereformen införs individuell beskattning (särbeskattning) av arbetsinkomster." can be represented as follows:

Genom		pp			3	ADV
skattereformen	nn.utr.sin.def.nom	1	PR
införs		vb.prs.sfo		0	ROOT
individuell	jj.pos.utr.sin.ind.nom	5	ATT
beskattning	nn.utr.sin.ind.nom	3	SUB
(		pad			5	IP
särbeskattning	nn.utr.sin.ind.nom	5	APP
)		pad			5	IP
av		pp			5	ATT
arbetsinkomster	nn.utr.plu.ind.nom	9	PR
.		mad			3	IP

An example document can be found here.

Malt-XML (Malt-TAB) <--> TIGER-XML

For interchange purposes we have defined a conversion from Malt-XML to Nordic Treebank Network TIGER-XML. The above sentence will get the following representation in NTN TIGER-XML:

<s id="s2">
  <graph root="p2_3">
    <terminals>
      <t id="w2_1" form="Genom" postag="pp"/>
      <t id="w2_2" form="skattereformen" postag="nn.utr.sin.def.nom"/>
      <t id="w2_3" form="införs" postag="vb.prs.sfo"/>
      <t id="w2_4" form="individuell" postag="jj.pos.utr.sin.ind.nom"/>
      <t id="w2_5" form="beskattning" postag="nn.utr.sin.ind.nom"/>
      <t id="w2_6" form="(" postag="pad"/>
      <t id="w2_7" form="särbeskattning" postag="nn.utr.sin.ind.nom"/>
      <t id="w2_8" form=")" postag="pad"/>
      <t id="w2_9" form="av" postag="pp"/>
      <t id="w2_10" form="arbetsinkomster" postag="nn.utr.plu.ind.nom"/>
      <t id="w2_11" form="." postag="mad"/>
    </terminals>
    <nonterminals>
      <nt id="p2_1" form="Genom" postag="pp" >
        <edge idref="w2_1" label="--"/>
        <edge idref="p2_2" label="PR"/>
      </nt>
      <nt id="p2_2" form="skattereformen" postag="nn.utr.sin.def.nom" >
        <edge idref="w2_2" label="--"/>
      </nt>
      <nt id="p2_3" form="införs" postag="vb.prs.sfo" >
        <edge idref="w2_3" label="--"/>
        <edge idref="p2_1" label="ADV"/>
        <edge idref="p2_5" label="SUB"/>
        <edge idref="p2_11" label="IP"/>
      </nt>
      <nt id="p2_4" form="individuell" postag="jj.pos.utr.sin.ind.nom" >
        <edge idref="w2_4" label="--"/>
      </nt>
      <nt id="p2_5" form="beskattning" postag="nn.utr.sin.ind.nom" >
        <edge idref="w2_5" label="--"/>
        <edge idref="p2_4" label="ATT"/>
        <edge idref="p2_6" label="IP"/>
        <edge idref="p2_7" label="APP"/>
        <edge idref="p2_8" label="IP"/>
        <edge idref="p2_9" label="ATT"/>
      </nt>
      <nt id="p2_6" form="(" postag="pad" >
        <edge idref="w2_6" label="--"/>
      </nt>
      <nt id="p2_7" form="särbeskattning" postag="nn.utr.sin.ind.nom" >
        <edge idref="w2_7" label="--"/>
      </nt>
      <nt id="p2_8" form=")" postag="pad" >
        <edge idref="w2_8" label="--"/>
      </nt>
      <nt id="p2_9" form="av" postag="pp" >
        <edge idref="w2_9" label="--"/>
        <edge idref="p2_10" label="PR"/>
      </nt>
      <nt id="p2_10" form="arbetsinkomster" postag="nn.utr.plu.ind.nom" >
        <edge idref="w2_10" label="--"/>
      </nt>
      <nt id="p2_11" form="." postag="mad" >
        <edge idref="w2_11" label="--"/>
      </nt>
    </nonterminals>
  </graph>
</s>

An example document can be found here.

MaltConverter 0.1

MaltConverter is a terminal-based program for conversion between the representation format for dependency treebanks Malt-XML, Malt-TAB and TIGER-XML (NTN). It is also possible to map attribute names and tagsets.

To run MaltEval you need the Java VM (tested for JRE 1.4.1).
Usage: java -jar MaltConverter.jar <conversion> <mapfile> <infile> <outfile>

ParameterDescription
conversion Specifies the conversion (e.g. malt2tiger). See table below for available conversions.
mapfilePath to the XML document which describes the mapping of attribute names and tagsets. See example below.
infileThe path to the source file which will be converted.
outfileThe path to the destination file where the output will be saved.

The table below lists the available conversions. In the table, malt stands for Malt-XML and tab for Malt-TAB. Note that it is possible to convert from Malt-XML to Malt-XML with malt2malt, which allows mapping of tagsets and removal of attributes. When converting to Malt-TAB, tagset files will be created which can be used by MaltParser.

ParameterFromTo
tiger2malt TIGER-XML Malt-XML
tiger2tab TIGER-XML Malt-TAB
malt2tiger Malt-XML TIGER-XML
malt2tab Malt-XML Malt-TAB
malt2malt Malt-XML Malt-XML
tab2malt Malt-TAB Malt-XML
tab2tiger Malt-TAB TIGER-XML
tab2tab Malt-TAB Malt-TAB

A mapping file must be specified. A mapping file can be represented as follows:

<?xml version="1.0" encoding="ISO-8859-1"?>
<mapping id="Talbanken">
    <annotation>
      <feature from="form" to="form"/>
      <feature from="postag" to="postag">
				<value from="ab"/>
				<value from="ab.kom"/>
				<value from="ab.pos"/>
				...
				<value from="vb.sup.akt.mod"/>
				<value from="vb.sup.sfo"/>
      </feature>
      <feature from="head" to=""/>
      <feature from="deprel" to="edgelabel">
      	<value from="ROOT" to="--"/>  
				<value from="ADV"/>
				...
				<value from="XX"/>
      </feature>
   </annotation>
</mapping>

The mapping file consists of a sequence of features (or attributes) with a mapping from an attribute name to an attribute name. If the attribute names are the same, the to value should be identical to the from value. If the to value is the empty string, the attribute will be suppressed in the output.

For tagset values, the identity map can be achieved by simply excluding the to attribute (but the empty string is not allowed as a value of the to attribute).