Crowdsourcing Linguistic Datasets

Chris Biemann

  • Area: LaCo
  • Level: I
  • Week: 2
  • Time: 11:00 – 12:30
  • Room: C3.06


The course gives a thorough introduction to crowdsourcing as an instrument to quickly acquire linguistic datasets for training and evaluation purposes.  Further, the course will provide step-by-step instructions on how to realize simple and complex crowdsourcing projects on Amazon Mechanical Turk and on CrowdFlower.
While crowdsourcing seems like a straightforward solution for linguistic annotation, the success of a crowdsourcing project is critically depending on multiple dimensions.
In this course, emphasis is placed on understanding these dimensions by dis-cussing practical experiences in order to enable participants to successfully use crowdsourcing for language-related research. This includes learning about demographics, platform mechanisms, schemes for ensuring data quality, best practices regarding the treatment of workers and, most of all, lessons learned from previous crowdsourcing projects as described in the literature and as conducted by the instructor.
The educational goal is to enable participants to successfully set up crowdsourcing projects and to circumnavigate typical pitfalls.

The course is organized in 5 sessions of 90 minutes each.

  1. What is Crowdsourcing? History and demographics, definitions, elementary concepts, example projects.
  2. Crowdsourcing platforms, esp. Amazon Mturk and Crowdflower. Technical possibilities, payment schemes, Do’s and Don’ts, schemes for ensuring quality,
  3. Successful design patterns for Crowdsourcing projects for language tasks,
  4. Crowdsourcing projects for language tasks, lessons learned, including non-English tasks.
  5. Quality Control Mechanisms, Ethical considerations, how to treat your crowdworkers,, requester code of conduct, turker forums

Short Bio

Chris is assistant professor and head of the Language Technology group at TU Darmstadt in Germany. He received his Ph.D. from the University of Leipzig, and subsequently spent three years in industrial search engine research at Powerset and Microsoft Bing in San Francisco, California. He is regularly publishing in journals and top conferences in the field of Computational Linguistics.
His research is targeted towards self-learning structure from natural language text, specifically regarding semantic representations. Using big-data techniques, his group has built an open-source, scalable language-independent framework for symbolic distributional semantics. To connect induced structures to tasks, Chris is frequently using crowdsourcing techniques for the acquisition of natural language semantics data.


LECTURE 1: What is Crowdsourcing? 1 Crowdsourcing_Aug2016_ESSLLII.pptx
History and demographics, definitions, elementary concepts, example projects

LECTURE 2: Crowdsourcing platforms 2 Crowdsourcing_Aug2016_ESSLLII.pptx
esp. Amazon Mturk and Crowdflower. Technical possibilities, payment schemes, Do’s and Dont’s, schemes for ensuring quality

LECTURE 3: Successful design patterns 3 Crowdsourcing_Aug2016_ESSLLII.pptx
illustrated with some exemplary projects

LECTURE 4: Crowdsourcing projects for language tasks 4 Crowdsourcing_Aug2016_ESSLLII.pptx
a variety of projects, and lessons learned

LECTURE 5: Quality Control and Ethical considerations 5 Crowdsourcing_Aug2016_ESSLLII.pptx
quality control mechanisms, modelling the quality of individual workers automatically, how to treat your crowdworkers, requester code of conduct, crowdworker forums

