Natural Language Processing of Microblogs

Tatjana Scheffler and Manfred Stede

  • Area: LaCo
  • Level: I
  • Week: 2
  • Time: 14:00 – 15:30
  • Room: D1.03


Social media has become an abundant data source for NLP applications, and its analysis has gained significance as a research field of its own. In this course we introduce how to work with microblogs for linguistic theorizing as well as for developing computational models and applications. Focussing on Twitter data, we show how to collect and store social media corpora. Social media language contains many non-standard features that necessitates preprocessing or the adaptation of standard NLP tools. The course will introduce common social media processing tasks with a focus on utilizing metadata like user info, geo-tagging, or time stamps. A special feature of much social media data is its interactional nature, which we will address in a session on discourse processing for Twitter. Finally, we will apply existing tools for working with social media in a practical mini-project implementing our own linguistically-inspired Twitter bots.

Motivation & Description:

Social media such as Twitter, Facebook, blogs, chats, etc. are a generous source of user-generated data for natural language processing. On the one hand, there are many advantages of working with this kind of data: a large industrial and academic interest in analyzing and automatically processing user generated content, abundance of textual data through public APIs, the ability to monitor newly emerging trends (social sensors), etc. On the other hand, natural language processing of social media texts faces many particular challenges (Baldwin, 2012). Often, state-of-the-art NLP applications cannot be immediately applied to social media data, and even adaptation comes with a huge degradation in evaluation scores.

Some of the potential challenges for computational linguists are:

  • Volume and speed of the data stream
  • Representativeness / collection of corpora
  • Variability of style and content
  • Conversational data (essentially, written-down spoken-like dialog data)

The course is aimed at students who want to start working with social media data, especially from Twitter. We will present the available tools, methods, and approaches for the entire pipeline of linguistic or computational linguistic research on social media data, hoping to enable students to start their own research projects.

We will introduce data formats and discuss how to collect one’s own corpus from Twitter, given available tools and APIs. One specific, under-researched field within social media NLP is work on non-English languages. We can show how to obtain data in other languages. In addition, we show methods for collecting and working with conversational corpora, which exhibit many interesting features for linguistc and computational analysis.

Language variability is a huge issue in social media texts, due to many contributing factors: variability of authors, topics, text genres, dialects, etc. (Eisenstein, 2015). Two common approaches dealing with variability and non-standard language are an elaborate preprocessing step for normalization (Sidarenka et al., 2014), or adaptation of standard NLP tools to social media data. In most cases, both steps are probably needed.

After discussing corpus collection and preprocessing, we introduce state-of-the-art approaches to common microblog processing applications, such as (sentiment) classification and (topic) clustering. We put special focus on how to work with the specific metadata that distinguishes microblogs from other textual data: user information, geographical information, and time stamps.

Finally, we will present existing Python tools for working with microblogs (e.g., the Tweepy package). These tools and some scripts provided by the instructors enable us to implement our own Twitter bots with a few lines of code (Waite, 2014). Simple linguistically-informed Twitter bots could include a bot that returns a translation (or parse image) of an incoming tweet, identifies its language, etc.

Tentative Outline:

  • Session 1: Collecting corpora, structure of the data, preprocessing
  • Session 2: Working with metadata: users, geo-information, time stamps
  • Session 3: Classification and clustering: sentiment analysis
  • Session 4: Discourse processing of social media text
  • Session 5: Linguistic phenomena in microblogs

Level & Prerequisites:

The course is meant as an introduction to methods and available tools for undergraduate or graduate students interested in working with social media. Basic knowledge of linguistics and computational linguistics suffices. Familiarity with Python (for the last session) is a plus, but not required.


Useful Links

  • Detecting automated Twitter accounts: BotOrNot

Building Twitter Bots

Collect your ideas here!

Quick and easy Twitter bots: Make your own @HydrateBot

Python corpus based Twitter bots: Creative Twitter bots

Bots we made at ESSLLI!

@xiejiabot: a bot that generates bot ideas

@gaebot: generates ESSLLI class names

@ESSLLIbot: generates ESSLLI class names

@BreakfastAndArt: generates new painting names that sound like restaurant dishes

A bot that debeautifies Slovenian songs

@Eugeneralissimo – a city name generator with “factual” information

@TheBotOfPuns: a bot that tweets a random joke from a previously generated list of homophone-based jokes

@WhosThere_bot – responds to “knock knock jokes” (when it’s not over its API quota)

@drbotson – generates new Sherlock Holmes story titles

@pictureeveryday: Chuck norris Jokes

@millueh – locate ESSLLI participants

A bot that generates novel (cooking) recipes