UPPSALA UNIVERSITET : Inst. för lingvistik och filologi : STP
Institutionen för lingvistik och filologi


Language Technology Project (5LN706), 7.5hp

Course Syllabus: 5LN706


(Preliminary) Schedule

The overall goal of this course is to independently carry out work related to a scientific research project. The students are advised to actively participate in the project and to interact with members of the project. There are no lectures scheduled within this course but the following meetings are planned with responsible teachers and relevant project supervisors:

2012-03-13 10-12 9-2029 Introduction and Motivation
2012-03-29 15-16 9-2029 Progress meeting
2012-04-17 10-12 9-2029 Progress meeting
2012-05-08 10-12 9-2029 Progress meeting
2012-05-31 10-12 9-2029 Seminar with project presentations
2012-06-17 Deadline for project reports

Furthermore, there will be meetings with supervisors on a regular basis.

Intended Learning Outcomes

In order to pass the course, a student must be able to
  1. independently carry out work related to the goals of the overall project
  2. independently and creatively identify and formulate research questions and issues related to the project,
    plan and carry out and evaluate a chosen sub-project in a timely manner using adequate and sound methods,
    thus contributing to the scientific development of the project goals
  3. give an overview over research touched by the project, describe the current state-of-the-art in this subject and identify issues that are most relevant for future developments (according to the research community)
  4. present and discuss the goals, contributions and motivations of the project
in relation to an existing scientific research project.

Examination and Grading Criteria

The course is examined by means of three assignments:
  1. Project report: A detailed scientific report describing the contributions to the project
  2. Popular science report: A report describing the outcome in a way that is understandable by a wider audience
  3. Presentation: A presentation describing the project including an popular science introduction/overview

Project Proposals

Course projects this year will be related to the OPUS project. Indivudal projects should be related to one of the following three tasks.
  1. Identification and correction of OCR-related errors in OpenSubtitles
    • Tasks:
      • develop methods for identifying possible OCR-errors
      • develop methods for correcting errors
      • support various languages (completely language-independent)
    • Examples:
              Alright, I'il count to three.
              I'il get you a new set.
              THE BALTlC STATES 1919/20
      missing token boundaries:
              Ijust call it believin' in myself
              I squeezedyou, and I heldyou
              Tincque qualificar aquests exàmens.
    • Challenges: Some misspellings are intentional:
              Jåg vill hå dig.
              Ni hår bådå boxåts i Philådelphiå.
              Ni kån reglernå. lngå lågå slåg.
      ... but not "lngå" in:
              Se upp med huvudenå. lngå stångningår.
  2. Visualization and annotation of parallel treebanks
    • Task: develop a graphical tool for visualization and correction of aligned and syntactically annotated parallel corpora (sentence/word-aligned and dependency trees)
    • Challenges:
      • various formats (parse information, sentence alignment, word alignment)
      • graphical representation (a prototype exists)
      • user interface/management etc
  3. Mining parallel data from WikiSource
    • Task: Develop tools to mine parallel data from open content (WikiSource)
    • Challenges:
      • identify parallel documents
      • remove extra content (non-parallel parts)
      • convert and align (using existing tools)


