Improving Language Technology with Fortuitous Data

Željko Agić, Anders Johannsen, and Barbara Plank

  • Area: LaCo
  • Level: A
  • Week: 1
  • Time: 17:00 – 18:30
  • Room: D1.01

Abstract

Current successful approaches to natural language processing (NLP) are for the most part based on supervised learning. In turn, supervised learning critically depends on the availability of annotated data. Such data is generally not plentiful, as it requires time and expertise to develop annotated resources. This is the problem of data sparsity. At the same time, available annotated data is usually a sample of a particular domain or language. Thus, even if some annotated data is available, it is often not a clear fit for the problem at hand. This is the problem of data bias.

In this course, we present approaches to facilitate NLP development when confronted by sparsity, or even absence, of supervision through annotated, biased samples of language data. By using part-of-speech tagging and syntactic dependency parsing as running examples, we outline modern approaches to augmenting supervised techniques for top-level performance. The approaches include semi-supervised and unsupervised techniques, domain adaptation and cross-lingual learning. We place particular emphasis on leveraging the various sources of fortuitous data that may be available even in the most severely under-resourced domains of natural language. We argue that fortuitous data provides often the ‘secret sauce’ to make approaches based on limited supervision work.

Outline

  • Day 1: Introduction
  • Day 2: Structured input and output
  • Day 3: Representation sharing and multi-task learning
  • Day 4: Fortuitous recipes + hands-on
  • Day 5: Cross-lingual learning

Slides

The course material can be found on the fortuitous data homepage.

Additional References

See slides on course material website