Learning from Data: A Foundational Course for Linguists

Malvina Nissim and Johannes Bjerva

  • Area: LaCo
  • Level: F
  • Week: 2
  • Time: 17:00 – 18:30
  • Room: D1.02

Abstract

This course is aimed at students who have a background in linguistics, can see and formulate research questions on language-related problems from an application perspective, but have no knowledge of machine learning approaches to language processing. I will assume no prior knowledge of statistical language processing. The course will balance theory and practice, by covering conceptual as well as implementation aspects.

This isn’t a theoretical course on the mathematical aspects of learning, rather a course aimed at equipping the students with practical abilities to run basic machine learning experiments, building on an introductory theoretical background. Pointers will be given to those who want to expand on this in a more substantial way. During the lectures, I will introduce the basic concepts and procedures of machine learning (learning by example, features, training and testing, supervised vs unsupervised, generative vs discriminative etc.), the main algorithms that one can use, feature extraction and manipulation, basic concepts in evaluation and error understanding (bias vs variance, overfitting, learning curves, etc), and existing tools/platforms to easily run experiments. All of this will be illustrated by means of theory and practice, both during class and at home: day-to-day small practical assignments will be given so that theory can be applied and understood right away.

At the end of the course, you are expected to be able to practically run machine learning experiments on a given (NLP) problem at the end of the course. You will understand key concepts and terminology of machine learning, training and testing procedures, and use existing tools that support machine learning experiments, such as Weka, NLTK, and scikit-learn. More specifically, in setting up an experiment for a given task, you will know that you have (to make) choices in how to represent a problem, implement features for learning and pick an appropriate algorithm, and will be able to interpret the results critically, by understanding evaluation metrics, as well as possible sources of errors.

 

Slides

LFD lecture1

LFD lecture2

LFD lecture3

LFD lecture4

LFD lecture5

 

Downloading material

Approach 1: Use git (updateable, recommended if you have git)

  1. In your terminal, type: ‘git clone https://github.com/bjerva/esslli-learning-from-data-students.git’
  2. Followed by ‘cd esslli-learning-from-data-students’
  3. Whenever the code is updated, type: ‘git pull’

Approach 2: Download a zip archive (static = you need to be told when a new version is up)

  1. Download the zip archive from:
    https://github.com/bjerva/esslli-learning-from-data-students/archive/master.zip
  2. Whenever the code is updated, download the archive again.

Running scripts

  1. Navigate to your ‘esslli-learning-from-data-students’ (using cd in the terminal)
  2. To extract features and learn model:

python run_experiment.py –csv data/trainset-sentiment-extra.csv –nwords 1 –algorithms nb

The command above would use the trainset-sentiment-extra dataset with a Naive Bayes unigram model

 

Additional References

With a focus on NLP:

  • Christopher D. Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, MA. 1999. http://nlp.stanford.edu/fsnlp/
  • Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. http://nlp.stanford.edu/IR-book/
  • James Pustejovsky and Amber Stubbs, Natural Language Annotation for Machine Learning, O’Reilly. 2012.
  • Steven Bird, Ewan Klein, and Edward Loper, Natural Language Processing with Python, O’Reilly. 2009. http://www.nltk.org
  • Hal Daumé III. A course in Machine Learning. http://ciml.info (incomplete manuscript available online – some parts available for free.)

More generally on machine learning:

  • Tom Mitchell, Machine Learning, McGraw Hill. 1997.
  • Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, The Morgan Kaufmann Series in Data Management Systems. 2011.
  • Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin, Learning from Data, AMLBook. 2012.
  • Peter Flach, Machine Learning: The Art and Science of Algorithms that Make Sense of Data, Cambridge University Press. 2012.

More specific to Scikit learn (and ML with Python):

  • Luis Pedro Coehlo and Willi Richert, Building Machine Learning Systems with Python, PACKT Publishing. 2013.
  • Raúl Garreta and Guillermo Moncecchi, Learning scikit-learn: Machine Learning in Python, PACKT Publishing. 2013.