Advanced Natural Language Processing

COSC 7336 Advanced Natural Language Processing

Spring 2016

Roy G. Cullen (C) 106

Instructor: Arjun Mukherjee

Overview

This is an advanced course on Natural Language Processing and applied NLP in Web mining. The course in intended for developing advanced skills in NLP and Web data and text mining via NLP applications. The broader goal is two fold: (1) Get a thorough understanding of statistical NLP techniques (e.g., latent variable models, graphical models in NLP, etc.) and learning to build tools for solving practical text mining problems, (2) Explore recent papers in the field with presentations, talks, critiques, and defence. Throughout the course, large emphasis will be placed on tying NLP techniques to specific real-world applications through hands-on experience. The course covers fundamental topics in statistical machine learning and touches upon various topics in NLP for the Web.

Administrative details

Office hours

Instructor office hours: M 2-4 pm

Prerequisites

The course requires decent background in mathematics and sufficient programming skills. If you have taken and did well in one or more of the equivalent courses/topics such as Algorithms, Data Mining, Machine Learning, Natural Language Processing, or have decent background in probability/statistics, it will be helpful. The course however reviews and covers required mathematical and statistical foundations. Sufficient experience for building projects in a high level programming language (e.g., Java) is required.

Required reference materials:

Online resources (OR) per topic as appearing in the schedule below.
Course Materials including books and lecture notes

Grading

Component Contribution

Project 25%

Paper Presentations 55%

Critique 15%

Class Participation 5%

Rules and policies

Late Assignments: Late assignments will not, in general, be accepted. They will never be accepted if the student has not made special arrangements with me at least one day before the assignment is due. It also needs to be a justifiable reason owing to exacting circumstances. If a late assignment is accepted it is subject to a reduction in score as a late penalty.
Cheating: All submitted work (code, homeworks, exams, etc.) must be your own. If evidence of code sharing is found, there will be conseuences impacting your grade in the course. Please refer to the student handbook for details on academic honesty.
Statute of limitations: Grading questions or complaints, will in general not be attended to beyond one week after the item in question has been returned.

Paper Reading Assignments/Project due dates

Assignments Due date

Project 4/18

Paper: Domain Adaptation with Structural Correspondence Learning [Blitzer at al., 2006] Presenter/Defender: Fan Critique: Yifan Next regular meeting

Paper: Distance Metric Learning for Large Margin Nearest Neighbor Classification [Weinberger et al., 2006] Presenter/Defender: Marjan Critique: Huijie Next regular meeting

Paper: One-Class SVMs for Document Classification [Manevitz et al., 2001] Presenter/Defender: Santosh Critique: Marjan Next regular meeting

Paper: Hinge Loss Markov Random Fields [Bach et al., 2013] Presenter/Defender: Dainis Critique: Huijie Next regular meeting

Paper: AFRAID: Fraud Detection via Active Inference in Time-evolving Social Networks [Vlasselaer et al., 2015] Presenter/Defender: Huijie Critique: Santosh Next regular meeting

Paper: Efficient Estimation of Word Representations in Vector Space [Mikolov et al., 2013] . Ref. for background: [1], [2] Presenter/Defender: Yifan Critique: Dainis Next regular meeting

Paper: Learning Latent Representations for Domain Adaptation using Supervised Word Clustering [Xiao et al., 2013] Presenter/Defender: Fan Critique: Santosh Next regular meeting

Paper: Co-Training for Domain Adaptation [Chen et al., 2011] Presenter/Defender: Marjan Critique: Fan Next regular meeting

Paper: Supervised Random Walks [Backstrom and Leskovec, 2011] Presenter/Defender: Santosh Critique: Huijie Next regular meeting

Paper: Distributed Representations of Words and Phrases and their Compositionality [Mikolov et al., 2013] Presenter/Defender: Dainis Critique: Yifan Next regular meeting

Paper: Understanding and Combating Link Farming in the Twitter Social Network [Ghosh et al., 2012] Presenter/Defender: Huijie Critique: Santosh Next regular meeting

Paper: DeepWalk: Online Learning of Social Representations [Perozzi et al., 2014] Presenter/Defender: Yifan Critique: Santosh Next regular meeting

Paper: Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach [Glorot et al., 2011] Presenter/Defender: Fan Critique: Marjan Next regular meeting

Paper: CRF Autoencoders for Unsupervised Structured Prediction [Ammar et al., 2014] Presenter/Defender: Yifan Critique: Fan Next regular meeting

Paper: Authorship Verification as a One-Class Classification Problem [Koppel and Schler, 2004] Presenter/Defender: Marjan Critique: Dainis Next regular meeting

Paper: Unsupervised Cross-Domain Word Representation Learning [Bollegala et al., 2015] Presenter/Defender: Dainis Critique: Fan Next regular meeting

Paper: Deep Semantic Frame-based Deceptive Opinion Spam Analysis [Kim et al., 2015] Presenter/Defender: Huijie Critique: Marjan Next regular meeting

Paper: Collective Opinion Spam Detection [Rayana and Akoglu, 2015] Presenter/Defender: Santosh Critique: Dainis Next regular meeting

Paper: Learning with Marginalized Corrupted Features [Maaten et al. 2013] Presenter/Defender: Fan Critique: Marjan Next regular meeting

Paper: Bidirectional LSTM-CRF Models for Sequence Tagging [Huang et al., 2015] Ref. for background: [1] Presenter/Defender: Dainis Critique: Yifan Next regular meeting

Paper: Joint Modeling of Opinion Expression Extraction and Attribute Classification [Yang and Cardie 2014] Presenter/Defender: Yifan Critique: Dainis Next regular meeting

Paper: Frustratingly Easy Domain Adaptation [Daume III, 2007] Presenter/Defender: Marjan Critique: Dainis Next regular meeting

Paper: From Word Embeddings to Document Distances [Kusner et al., 2015] Presenter/Defender: Sanotsh Critique: Fan Next regular meeting

Paper: BIRDNEST: Bayesian INference for Review Rating Fraud [Hooi et al., 2015] Presenter/Defender: Huijie Critique: Santosh Next regular meeting

Schedule of topics

Please note that the following is a list of tentative topics. During the course, and interleaved between lectures, time will be invested in review questions, homework solutions, discussion of novel ideas, paper critiques, and concept review.

Topic(s)	Resources: Readings, Slides, Lecture notes, Papers, Pointers to useful materials, etc.
Brief Introduction to NLP Course administrivia, semester plan, course goals NLP Resources Language as a probabilistic phenomenon Word collocations, NLP and text retrival basics Text categorization Introduction to topics to be covered in the course	Required readings: Lecture notes/slides Chapter 1 FSNLP (Sections 1.2.3, 1.4, 1.4.1, 1.4.2, 1.4.3, 1.4.4) Boolean retrieval slides by H.Schutze Boolean retrieval [Manning et al., 2008] (upto section 1.4) F. Keller's tutorial on Naiye Bayes + notes of A.Moore for graph view (Slide 8)
Statistical foundations I:Basics Probability theory Conditional probability and independence	Required Readings: Lecture notes/slides Chapter 2 FSNLP (Section 2.1.1 - 2.1.10), Chapter 1 SI (Selected topics covered in class and solved examples) OR01: X.Zhu's notes on mathematical background for NLP Slides:
Statistical foundations II: Random varibales and Distributions Random variables, density and mass fuctions Mean, Variance Common families of distributions Multiple random variabls: joints and marginals	Required Readings: Lecture notes/slides Chapter 2 SI (Theorem 2.1.10, 2.2, 2.2.1, 2.2.2, 2.2.3, 2.2.5, 2.3.1, 2.3.2, 2.3.4, and topics covered in class). Chapter 3 SI (All sections + worked out examples upto 3.4), focus on distributions/problems covered in class and skip other topics. Chapter 4 SI (4.1, 4.1.1, 4.1.2, 4.1.3, 4.1.4, 4.1.5, 4.1.6, 4.1.10, 4.1.11, 4.1.12, 4.2.1, 4.2.2, 4.2.3, 4.2.4, 4.2.5). OR02: K.Zhang's notes on common families of distribution with worked out examples [Skip hyupergeometric, neg-binomial distributions and focus on the ones covered in class]. Optional Recommended reading/solved examples: OR03: Notes on Joint, marginals, worked out examples by S.Fan OR04: Tutorial on joints and marginals by M.Osborne [Contains NLP specific examples]
Hierarchical models and mixture distributions Parameter estimation: MLE vs MAP Prior, posterior, conjugate priors Binomial-Poisson hierarchy Beta-Binomial hierarchy	Required Readings: Lecture notes/slides + Chapter 4 SI (4.4, 4.4.1, 4.4.2, 4.4.5 - 4.4.6) OR05: P. Robinson's notes on parameter estimation [Slides 1-35] Optional reference: OR06: Notes on conjugate models by P. Lam [Slides 1-49] Conjugate priors for common families of distribution
Text Clustering: Semantic Clustering and Topic Models Latent semantics and clustering problem Introduction to Bayes nets and PGMs Latent Dirichlet Allocation Learning and evalauting Topic Models	Required Readings: Lecture notes + Stat review: Sampling from distributions (previous slides/lecture notes) OR13: Tutorial by D.Blei (Slides 1-17) OR14:Gibbs sampling tutorial by M.Bahadori (Slides 1, 3-5, 7, 16-20, 22 ) Gibbs sampler derivation for Latent Dirichlet Allocation. Comprehensive explanation/derivation of LDA by Gregor Heinrich LDA Gibbs sampler implementation [Java/Eclipse project] Programming resources, tools, libraries for projects and homeworks: Mallet, LingPipe Java Topic Modeling Toolkit [with implementation of LabeledLDA] Matlab Topic Modleing Toolkit [with implementation of Author-Topic model] [Implementation of advanced models] G.Heinrich's LDA and statistics base classes for sampling based algorithms in Java Supervised Topic Models Optional recommended reading for research/projects: Understanding Gibbs sampling with derivation for the Naive Bayes model (unsupervised) [Resnik and Hardisty, 2010] D.Blei's tutorial on Dirichlet priors (Slides 32-39) LDA Gibbs Sampler derivation (Chapter 2) by Yi Wang Author topic model [Rosen-Zvi et al., 2004]; Derivation and details. Applications of topic models (NIPS Workshop) Topic coherence metric for evalauting topic models [Mimno et al., 2011] Generic Gibbs sampling for Topic Models by G. Heinrich Supervised Topic Models
Sentiment Analysis and Psycholinguistics Aspect extraction Deception and opinion spam	Required Readings: Lecture notes + slides + selected topics (covered in lectures) from Chapter 11, WDM Programming resources, tools, libraries for projects and homeworks: Pos/Neg Sentiment Lexicon, SentiWordNet, Deep learning for senitment analysis Optional topics/concepts useful for research/projects Papers on opinon spam: [Mukherjee et al., 2013], [Mukherjee et al., 2012] Papers on topic modeling: [Blei et al., 2003], [Resnik and Hardisty, 2010] Aspect and Senitment Model: [Jo and Oh, 2011], slides, Accompanying data and source code Other relevant papers: [Lin and He, 2009], [Zhao et al., 2010], [Mukherjee and Liu, 2012]

Component	Contribution
Project	25%
Paper Presentations	55%
Critique	15%
Class Participation	5%

Assignments	Due date
Project	4/18
Paper: Domain Adaptation with Structural Correspondence Learning [Blitzer at al., 2006] Presenter/Defender: Fan Critique: Yifan	Next regular meeting
Paper: Distance Metric Learning for Large Margin Nearest Neighbor Classification [Weinberger et al., 2006] Presenter/Defender: Marjan Critique: Huijie	Next regular meeting
Paper: One-Class SVMs for Document Classification [Manevitz et al., 2001] Presenter/Defender: Santosh Critique: Marjan	Next regular meeting
Paper: Hinge Loss Markov Random Fields [Bach et al., 2013] Presenter/Defender: Dainis Critique: Huijie	Next regular meeting
Paper: AFRAID: Fraud Detection via Active Inference in Time-evolving Social Networks [Vlasselaer et al., 2015] Presenter/Defender: Huijie Critique: Santosh	Next regular meeting
Paper: Efficient Estimation of Word Representations in Vector Space [Mikolov et al., 2013] . Ref. for background: [1], [2] Presenter/Defender: Yifan Critique: Dainis	Next regular meeting
Paper: Learning Latent Representations for Domain Adaptation using Supervised Word Clustering [Xiao et al., 2013] Presenter/Defender: Fan Critique: Santosh	Next regular meeting
Paper: Co-Training for Domain Adaptation [Chen et al., 2011] Presenter/Defender: Marjan Critique: Fan	Next regular meeting
Paper: Supervised Random Walks [Backstrom and Leskovec, 2011] Presenter/Defender: Santosh Critique: Huijie	Next regular meeting
Paper: Distributed Representations of Words and Phrases and their Compositionality [Mikolov et al., 2013] Presenter/Defender: Dainis Critique: Yifan	Next regular meeting
Paper: Understanding and Combating Link Farming in the Twitter Social Network [Ghosh et al., 2012] Presenter/Defender: Huijie Critique: Santosh	Next regular meeting
Paper: DeepWalk: Online Learning of Social Representations [Perozzi et al., 2014] Presenter/Defender: Yifan Critique: Santosh	Next regular meeting
Paper: Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach [Glorot et al., 2011] Presenter/Defender: Fan Critique: Marjan	Next regular meeting
Paper: CRF Autoencoders for Unsupervised Structured Prediction [Ammar et al., 2014] Presenter/Defender: Yifan Critique: Fan	Next regular meeting
Paper: Authorship Verification as a One-Class Classification Problem [Koppel and Schler, 2004] Presenter/Defender: Marjan Critique: Dainis	Next regular meeting
Paper: Unsupervised Cross-Domain Word Representation Learning [Bollegala et al., 2015] Presenter/Defender: Dainis Critique: Fan	Next regular meeting
Paper: Deep Semantic Frame-based Deceptive Opinion Spam Analysis [Kim et al., 2015] Presenter/Defender: Huijie Critique: Marjan	Next regular meeting
Paper: Collective Opinion Spam Detection [Rayana and Akoglu, 2015] Presenter/Defender: Santosh Critique: Dainis	Next regular meeting
Paper: Learning with Marginalized Corrupted Features [Maaten et al. 2013] Presenter/Defender: Fan Critique: Marjan	Next regular meeting
Paper: Bidirectional LSTM-CRF Models for Sequence Tagging [Huang et al., 2015] Ref. for background: [1] Presenter/Defender: Dainis Critique: Yifan	Next regular meeting
Paper: Joint Modeling of Opinion Expression Extraction and Attribute Classification [Yang and Cardie 2014] Presenter/Defender: Yifan Critique: Dainis	Next regular meeting
Paper: Frustratingly Easy Domain Adaptation [Daume III, 2007] Presenter/Defender: Marjan Critique: Dainis	Next regular meeting
Paper: From Word Embeddings to Document Distances [Kusner et al., 2015] Presenter/Defender: Sanotsh Critique: Fan	Next regular meeting
Paper: BIRDNEST: Bayesian INference for Review Rating Fraud [Hooi et al., 2015] Presenter/Defender: Huijie Critique: Santosh	Next regular meeting