Introduction
This course will be based on the connections between the fields of natural language processing (NLP) and its important multimodal connections to computer vision and robotics; it will cover a wide variety of topics in the area of multimodal-NLP such as image/video-based captioning, retrieval, QA, and dialogue; vision+language commonsense; query-based video summarization; text-to-image/video generation; robotic navigation + manipulation instruction execution and generation; unified multimodal and embodied pretraining models (as well as ethics/bias/societal applications + issues).
Topics
- Image-Text Alignment / Matching / Retrieval
- Language Disambiguation via Images
- Referring Expression Comprehension + Generation
- Image/Video Captioning
- Image/Video Question Answering
- Image/Video Dialogue
- Visual Entailment + Future/Next-Event Prediction
- Query-based Video Moment Retrieval
- Query-based Video Summary / Highlight / Saliency Prediction
- Vision+Language Commonsense
- Text to Image/Video Generation
- Text to Image Sequence/Story/Video Generation
- Multi-modal Pretraining Models (V+L, V-->L, L-->V, Unified, DocAI)
- Executing NL instructions for navigation, articulation, manipulation, assembly, skill learning, etc.
- Human-robot collaboration and dialogue for learning new subactions, mediating shared perceptual basis, referring expression generation, etc.
- Grounding and language learning via dialogue-based and interactive games
- Automatic instruction generation + dialogue for embodied tasks
- Grounded reinforcement learning
- Grounded knowledge representations (mapping language to world)
- Gesture, Turn-taking, Gaze in Human-Robot Interaction
- Embodied Pretraining Models
- Bias, Ethics, Societal Applications
- How to Write and Review Research Papers
Prerequisites
- COMP 562 (Introduction to Machine Learning)
- Basic NLP and Vision knowledge
- Python programming language (preferably Pytorch/Tensorflow style libraries) and Linux program development environment.
Students not meeting these requirements must receive the explicit permission of the instructor to remain in this course.
Grading
Grading will (tentatively) consist of:
- Paper presentations (20\%)
- Paper written reviews + ideas (20\%)
- Midterm Project write-up and presentation (~20\%)
- Final Project write-up and presentation (~30\%)
- Class participation (~10\%).
All submissions should be emailed to: nlpcomp790unc@gmail.com.
Reference Books
For NLP concepts refresher, see:
For multimodal NLP concepts, see corresponding lectures of my previous classes and proceedings+talks of Multimodal Workshops at NLP, Vision, and Robotics conferences (e.g., RoboNLP, VQA Workshop, ALVR, LANTERN, CLVL, etc.)
Schedule (tentative)
Date | Topic | Readings | Discussion Leaders |
Jan 11 | Intro to the Course | -- | Mohit |
Jan18 | Image-Text Alignment / Matching / Retrieval & Referring Expression Comprehension | (1) CLIP: Learning Transferable Visual Models From Natural Language Supervision (Radford et al., ICML 2021); (2) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (Li et al., ICML 2022); (3) MAttNet: Modular Attention Network for Referring Expression Comprehension (Yu et al., CVPR 2018); (4) Referring Transformer: A One-step Approach to Multi-task Visual Grounding (Li and Sigal, NeurIPS 2021); Additional readings: (a) Matching Words and Pictures (Barnard et al., JMLR 2003); (b) Deep Visual-Semantic Alignments for Generating Image Descriptions (Karpathy and Fei-Fei, CVPR 2015); (c) VSE++: Improving Visual-Semantic Embeddings with Hard Negatives (Faghri et al., BMVC 2018)
a>; (d) Stacked Cross Attention for Image-Text Matching (Lee et al., ECCV 2018); (e) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions (Yu et al., CVPR 2017)
a>; (f) What are you talking about? Text-to-Image Coreference (Kong et al., CVPR 2014); (g) Align before Fuse: Vision and Language Representation
Learning with Momentum Distillation; | Shoubin, Ziyang, Mo
hit |
Jan 25 | Referring Expression Generation & Image/Video Captioning |
(1) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions (Yu et al., CVPR 2017);
(2) Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning (Sharma et al., ACL 2018);
(3) Neural Baby Talk (Lu et al., CVPR 2018);
(4) SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning (Lin et al., CVPR 2022);
Additional readings:
Deep Visual-Semantic Alignments for Generating Image Descriptions;
Explain Images with Multimodal Recurrent Neural Networks;
From captions to visual concepts and back;
Learning a Recurrent Visual Representation for Image Caption Generation;
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention;
Image Captioning with Semantic Attention;
Self-critical Sequence Training for Image Captioning;
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering;
DenseCap: Fully Convolutional Localization Networks for Dense Captioning;
Sequence to Sequence -- Video to Text;
Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks;
Reinforced Video Captioning with Entailment Rewards;
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning;
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research;
ClipCap: CLIP Prefix for Image Captioning;
Fine-grained Image Captioning with CLIP Reward;
| Abhay, Archiki, Mohit |
Feb 01 | Image/Video Question Answering & Dialogue |
(1a) VQA v2: Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering (Goyal et al., CVPR 2017);
(1b) GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering (Hudson and Manning, CVPR 2019);
(2) Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering (Anderson et al., CVPR 2018);
(3) TVQA: Localized, Compositional Video Question Answering (Lei et al., EMNLP 2018);
(4) Visual Dialog (Das et al., CVPR 2017);
Additional readings:
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding;
Bilinear Attention Networks;
Deep Modular Co-Attention Networks for Visual Question Answering;
MovieQA: Understanding Stories in Movies through Question-Answering;
A Joint Sequence Fusion Model for Video Question Answering and Retrieval;
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning;
Modality-Balanced Models for Visual Dialogue ;
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA;
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions;
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling;
| Vaidehi, Qin, Mohit |
Feb 08 | Query-based Video Moment Retrieval & Summary / Highlight / Saliency Prediction |
(1) Localizing Moments in Video with Natural Language (Hendricks et al., ICCV 2017);
(2) TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval (Lei et al., ECCV 2020);
(3) QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries (Lei et al., NeurIPS 2021);
(4) TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency (Narasimhan et al., ECCV 2022);
Additional readings:
(a) TALL: Temporal Activity Localization via Language Query;
(b) Temporal Localization of Moments in Video Collections with Natural Language;
(c) Multi-task deep visual-semantic embedding for video thumbnail selection;
(d) mTVR: Multilingual Moment Retrieval in Videos;
(e) YouCook2-Retrieval: Towards Automatic Learning of Procedures from Web Instructional Videos;
(f) Ranking Domain-specific Highlights by Analyzing Edited Videos;
(g) TVSum: Summarizing Web Videos Using Titles;
(h) Query-Focused Video Summarization: Dataset, Evaluation, and a Memory Network Based Approach;
(i) CLIP-It! Language-Guided Video Summarization;
(j) TubeDETR: Spatio-Temporal Video Grounding with Transformers;
(k) UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection;
| Mason, Yiyuan, Mohit |
Feb 15 | Vision+Language Commonsense, Reasoning | | |
Feb 22 | Text to Image & Image Sequence/Story/Video Generation | | |
Mar 01 | Vision+Language Pretraining Models (V+L, V+L as Text, V-->L, L-->V, Unified, DocAI) | | |
Mar 08 | Midterm Project Presentations | | |
Mar 15 | Spring break holiday | | |
Mar 22 | Executing NL instructions for navigation, articulation, manipulation, assembly, skill learning, etc. | | |
Mar 29 | Human-robot collaboration and dialogue: instruction generation, emdodied Q&A, learning new subactions, etc. | | |
Apr 05 | Grounding and language learning/emergence via multi-agent dialogue-based and interactive games | | |
Apr 12 | Non-Verbal Human-Robot Interaction/Communication: Gestures, Turn-taking, Gaze | | |
Apr 19 | Multimodal Biases and Shortcuts; How to Write and Review Research Papers | | |
Apr 26 | Last Class: Final Project Presentations (tentative) | | |
|
Disclaimer
The professor reserves the right to make changes to the syllabus, including project due dates. These changes will be announced as early as possible.