Distributional Semantics – A Practical Introduction

Stefan Evert

  • Area: LaCo
  • Level: I
  • Week: 1
  • Time: 14:00 – 15:30
  • Room: D1.02

News: slides/handout for day 2 now available with additional code examples

Abstract

Distributional semantic models (DSM) – also known as “word space” or “distributional similarity” models – are based on the assumption that the meaning of a word can (at least to a certain extent) be inferred from its usage, i.e. its distribution in text.  Therefore, these models dynamically build semantic representations of words or other linguistic units in the form of high-dimensional vector spaces, based on a statistical analysis of their distribution across documents, their collocational profiles, their syntactic dependency relations, and other contextual features.  DSMs are a promising technique for solving the lexical acquisition bottleneck by unsupervised learning, and their distributed representation provides a cognitively plausible, robust and flexible architecture for the organisation and processing of semantic information.

This course aims to equip participants with the background knowledge and skills needed to build different kinds of DSM representations and apply them to a wide range of tasks. It will

  • introduce the most common DSM architectures and their parameters, as well as prototypical applications;
  • equip participants with a basic knowledge of the mathematical techniques needed for the implementation of DSMs, in particular those of matrix algebra;
  • show how DSM vectors and distances can be applied to a wide range of practical tasks;
  • convey a better understanding of the high-dimensional vector spaces based on empirical studies, visualisation techniques and mathematical arguments; and
  • provide an overview of current research on DSMs, available software, evaluation tasks and future trends.

The course follows a hands-on approach, putting a strong emphasis on practical exercises with the help of a user-friendly software packages for R.  Various pre-compiled DSMs and other data sets will be made available to the course participants.  In contrast to other recent courses targeted at compositional semantics, it focuses on the underlying representations of individual words, as an essential basis not only for compositional vector operations but also for many other applications of DSMs.

The course is targeted both at participants who are new to the field and need a comprehensive overview of DSM techniques and applications, and at those who have already worked with DSMs and want to gain a deeper understanding of their parameters and mathematical underpinnings.

Slides

Note: WordPress doesn’t allow R script files, so the exercises and worked examples have been uploaded as plain text files. Please remove the extension .txt so they will be recognized by RStudio as R script files (.R).

Software & Resources

The course will include several hands-on exercises using the statistical software R together with the wordspace package. It is recommended that you bring your own laptop to the course. You might also want to pre-install the required software packages and data sets.

Step 1

Install R version 3.3.x and the RStudio GUI.

Step 2

Use the installer built into RStudio (or the standard R GUI) to install the following packages from the CRAN archive:

  • sparsesvd
  • iotools
  • tm
  • Rcpp (needed on Linux only)
Step 3

Install the wordspace package itself, which is not available from CRAN yet and will have to be installed from a local package archive.  You can download the source package as well as binaries for Windows and Mac OS X here: http://www.collocations.de/data/#tutorial

During the course, you will be asked to install a second package with additional evaluation tasks (wordspaceEval) from a password-protected Web page.

Step 4

Download one or more pre-compiled distributional models from http://www.collocations.de/data/ (both the full model and the SVD dimensions).

Some further data files will be made available during the course.

Additional References