COMP 790/590: Connecting Language to Vision and Robotics (Spring 2023)

Instructor: Mohit Bansal
Units: 3
Lectures: Wed 11am-1.30pm ET, Room SN-115
Office Hours: Wed 1:30pm-2:15pm ET (by appointment) (remote/zoom option)
Course Webpage:
Course Email: nlpcomp790unc -at-


This course will be based on the connections between the fields of natural language processing (NLP) and its important multimodal connections to computer vision and robotics; it will cover a wide variety of topics in the area of multimodal-NLP such as image/video-based captioning, retrieval, QA, and dialogue; vision+language commonsense; query-based video summarization; text-to-image/video generation; robotic navigation + manipulation instruction execution and generation; unified multimodal and embodied pretraining models (as well as ethics/bias/societal applications + issues).



Students not meeting these requirements must receive the explicit permission of the instructor to remain in this course.


Grading will (tentatively) consist of:

All submissions should be emailed to:

Reference Books

For NLP concepts refresher, see:

For multimodal NLP concepts, see corresponding lectures of my previous classes and proceedings+talks of Multimodal Workshops at NLP, Vision, and Robotics conferences (e.g., RoboNLP, VQA Workshop, ALVR, LANTERN, CLVL, etc.)

Schedule (tentative)

DateTopic Readings Discussion Leaders
Jan 11Intro to the Course -- Mohit
Jan18 Image-Text Alignment / Matching / Retrieval & Referring Expression Comprehension (1) CLIP: Learning Transferable Visual Models From Natural Language Supervision (Radford et al., ICML 2021);
(2) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (Li et al., ICML 2022);
(3) MAttNet: Modular Attention Network for Referring Expression Comprehension (Yu et al., CVPR 2018);
(4) Referring Transformer: A One-step Approach to Multi-task Visual Grounding (Li and Sigal, NeurIPS 2021);
Additional readings:
(a) Matching Words and Pictures (Barnard et al., JMLR 2003);
(b) Deep Visual-Semantic Alignments for Generating Image Descriptions (Karpathy and Fei-Fei, CVPR 2015);
(c) VSE++: Improving Visual-Semantic Embeddings with Hard Negatives (Faghri et al., BMVC 2018);
Stacked Cross Attention for Image-Text Matching (Lee et al., ECCV 2018);
(e) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions (Yu et al., CVPR 2017);
What are you talking about? Text-to-Image Coreference (Kong et al., CVPR 2014);
(g) Align before Fuse: Vision and Language Representation Learning with Momentum Distillation;
Shoubin, Ziyang, Mo hit
Jan 25 Referring Expression Generation & Image/Video Captioning (1) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions (Yu et al., CVPR 2017);
(2) Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning (Sharma et al., ACL 2018);
Neural Baby Talk (Lu et al., CVPR 2018);
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning (Lin et al., CVPR 2022);
Additional readings:
Deep Visual-Semantic Alignments for Generating Image Descriptions;
Explain Images with Multimodal Recurrent Neural Networks;
From captions to visual concepts and back;
Learning a Recurrent Visual Representation for Image Caption Generation;
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention;
Image Captioning with Semantic Attention;
Self-critical Sequence Training for Image Captioning;
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering;
DenseCap: Fully Convolutional Localization Networks for Dense Captioning;
Sequence to Sequence -- Video to Text;
Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks;
Reinforced Video Captioning with Entailment Rewards;
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning;
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research;
ClipCap: CLIP Prefix for Image Captioning;
Fine-grained Image Captioning with CLIP Reward;
Abhay, Archiki, Mohit
Feb 01 Image/Video Question Answering & Dialogue (1a) VQA v2: Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering (Goyal et al., CVPR 2017);
(1b) GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering (Hudson and Manning, CVPR 2019);
(2) Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering (Anderson et al., CVPR 2018);
(3) TVQA: Localized, Compositional Video Question Answering (Lei et al., EMNLP 2018);
(4) Visual Dialog (Das et al., CVPR 2017);
Additional readings:
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding;
Bilinear Attention Networks;
Deep Modular Co-Attention Networks for Visual Question Answering;
MovieQA: Understanding Stories in Movies through Question-Answering;
A Joint Sequence Fusion Model for Video Question Answering and Retrieval;
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning;
Modality-Balanced Models for Visual Dialogue ;
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA;
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions;
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling;
Vaidehi, Qin, Mohit
Feb 08 Query-based Video Moment Retrieval & Summary / Highlight / Saliency Prediction (1) Localizing Moments in Video with Natural Language (Hendricks et al., ICCV 2017);
(2) TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval (Lei et al., ECCV 2020);
(3) QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries (Lei et al., NeurIPS 2021);
(4) TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency (Narasimhan et al., ECCV 2022);
Additional readings:
(a) TALL: Temporal Activity Localization via Language Query;
(b) Temporal Localization of Moments in Video Collections with Natural Language;
(c) Multi-task deep visual-semantic embedding for video thumbnail selection;
(d) mTVR: Multilingual Moment Retrieval in Videos;
(e) YouCook2-Retrieval: Towards Automatic Learning of Procedures from Web Instructional Videos;
(f) Ranking Domain-specific Highlights by Analyzing Edited Videos;
(g) TVSum: Summarizing Web Videos Using Titles;
(h) Query-Focused Video Summarization: Dataset, Evaluation, and a Memory Network Based Approach;
(i) CLIP-It! Language-Guided Video Summarization;
(j) TubeDETR: Spatio-Temporal Video Grounding with Transformers;
(k) UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection;
Mason, Yiyuan, Mohit
Feb 15 Vision+Language Commonsense, Reasoning
Feb 22 Text to Image & Image Sequence/Story/Video Generation
Mar 01 Vision+Language Pretraining Models (V+L, V+L as Text, V-->L, L-->V, Unified, DocAI)
Mar 08 Midterm Project Presentations
Mar 15 Spring break holiday
Mar 22 Executing NL instructions for navigation, articulation, manipulation, assembly, skill learning, etc.
Mar 29 Human-robot collaboration and dialogue: instruction generation, emdodied Q&A, learning new subactions, etc.
Apr 05 Grounding and language learning/emergence via multi-agent dialogue-based and interactive games
Apr 12 Non-Verbal Human-Robot Interaction/Communication: Gestures, Turn-taking, Gaze
Apr 19 Multimodal Biases and Shortcuts; How to Write and Review Research Papers
Apr 26 Last Class: Final Project Presentations (tentative)


The professor reserves the right to make changes to the syllabus, including project due dates. These changes will be announced as early as possible.