Incremental Speech and Language Processing for Interactive Systems

Timo Baumann and Arne Köhn

  • Area: LoCo
  • Level: A
  • Week: 1
  • Time: 14:00 – 15:30
  • Room: C3.06


Incremental processing – the processing of partial linguistic material as it happens has become a highly relevant research area. Incremental processing allows for faster system reactions (with processing times folded between modules), for more natural behaviour based on partial understanding (e. g. in interactive situations), to shape interactions in collaboration between the system and interlocutor (e. g. using less rigid turn-taking schemes), and is psycholinguistically appealing. We introduce a model of incrementality, in particular establishing the need to be able to re-analyze a component’s hypothesis about the state of affairs. We describe strategies for solving incremental processing tasks in speech and language using speech recognition, syntactic parsing, and speech synthesis as examples. We discuss how to evaluate the various aspects of incremental quality (such as timing, stability, and correctness) for each of the examples. We close with a discussion of what happens when individual processors are combined into partial or full systems.

Incremental processing has in some parts made the move from a “research challenge and opportunity” with “prospects for commercial use” (as identified in a survey reported in Williams 2009) to something that can be deployed and be beneficial for users (shown in a comparison of Apple’s Siri, which does not do incremental speech recognition, with Google Voice, which does:

Beyond speeding up the presentation of results of voice search, it offers the potential for creating systems with a much more natural behaviour, with respect to turn-taking or production and understanding of feedback utterances, for taking action (Baumann et al. 2013) or expressing utterances based on partial understanding (Baumann and Schlangen 2013) in situated environments such as human-robot interaction. Finally, incrementality is also appealing psycholinguistically (Levelt 1989; Clark 1996). Implementing incremental processing, however, requires a reconceptualization of the architecture and all the components of a speech and language processing system (Baumann 2013).

Although incremental processing has made a move towards research applications, the required reconceptualization and rebuilding of processing models still represents a high “barrier of entry” to researchers interested in this processing paradigm, or in “just” building incremental applications as compared to non-incremental processing. We hope to lower the relatively high barrier of entry with our course.

By detailing multiple examples of incremental processing modes, we aim to help students get a feeling, learn strategies, and know tools for the work necessary to “incrementalize” their own (or others’) NLP components and integrate these into existing systems, or to build fully incremental systems, whether as a research goal of its own, or as a tool for researching (e. g. responsive interaction).

After laying out the grounds of incremental processing (Guhe 2007; Schlangen and Skantze 2009; Schlangen and Skantze 2011) and its evaluation (Baumann, Buß, and Schlangen 2011), we will teach incrementality by describing three examples of incremental processing tasks:

  • Speech recognition is used as a simple example of incremental processing (the core algorithm is incremental as-is). We show how hypotheses evolve with more input becoming available, how output is unstable as a result of changing hypotheses, and we explain why more stable hypotheses can be achieved by adding delays. We further examine the timeliness/stability trade-off and show several methods that have been used to optimize this trade-off (Baumann, Atterer, and Schlangen 2009; Selfridge et al. 2011; McGraw and Gruenstein 2012).
  • Syntactic parsing features structured output (not just sequential as speech recognition). We will discuss how to produce such output given limited input and explain the need for prediction in order to guarantee the output of valid trees. We introduce different approaches to incremental parsing, explain how incremental gold standards can be derived, and how incremental parsers can be evaluated (Demberg-Winterfors 2010; Beuck, Köhn, and Menzel 2013; Köhn and Menzel 2014) and highlight how structural prediction improves timeliness without hurting performance. We further show that prediction and monotonicity cannot be achieved at the same time.
  • Speech synthesis is used as an example that re-interpretation is limited once system output is being generated. We present a system that is able to cope (to some extent) with this limitation by flexibly and naturally adjusting its output with very little delay (Baumann and Schlangen 2012; Astrinaki et al. 2012; Baumann 2014). This system also serves as an example for a more complex structured input into an incremental processor, with input chunks of various sizes (i. e. a mixed granularity of input).

We thoroughly compare the (largely complementary) example tasks and try to develop the uniting and differentiating factors in order for students to generalize from these examples to their own future application domains. We will also discuss our experience on combining individual modules to complex incremental systems and the prospect of developing fully integrated bi-directional rather than pipeline-based incremental systems as a glimpse into future work.

We provide voluntary exercises to students to help them deepen and operationalize the concepts taught in the lectures. Our software implementations for the example domains is available as open-source and ready to use for students after studying our course.

A background in natural speech and language processing is helpful (e. g. on the level of Jurafsky and Martin 2009) and basic CS knowledge is beneficial but not a requirement for fruitful participation.


slides for day 1
slides for day 2
slides for day 3
slides for day 4
slides for day 5

Additional References

Hands-on material will be in parts designed specifically for the course, in parts be based on the InproTK programming tutorial (