COMP 790/590: Connecting Language to Vision and Robotics (Fall 2021)

Instructor: Mohit Bansal
Units: 3
Office: FB 244
Lectures: Wed 10:40am-1:10pm ET, Rm SN-011
Office Hours: Wed 1:10pm-2:10pm ET (by appointment) (remote/zoom option)
Course Webpage:
Course Email: nlpcomp790unc -at-



This course will be based on the connections between the fields of natural language processing (NLP) and its important multimodal connections to computer vision and robotics; it will cover a wide variety of topics in the area of multimodal-NLP such as image/video-based captioning, retrieval, QA, and dialogue; vision+language commonsense; query-based video summarization; robotic navigation + manipulation instruction execution and generation; multimodal and embodied pretraining models (as well as their ethics/bias/societal applications + issues).

Topics (tentative)


Students not meeting these requirements must receive the explicit permission of the instructor to remain in this course.

Grading (tentative)

Grading will (tentatively) consist of:

All submissions should be emailed to:

Reference Books

For NLP concepts refresher, see:

For multimodal NLP concepts, see corresponding lectures of my previous classes and proceedings+talks of Multimodal Workshops at NLP, Vision, and Robotics conferences (e.g., RoboNLP, VQA Workshop, ALVR, LANTERN, etc.)

Schedule (tentative; slides coming soon)

DateTopic Readings Discussion LeadersTodo's
Aug 18Intro to the Course -- Mohit --
Aug 25 Image-Text Alignment / Matching / Retrieval & Referring Expression Comprehension (1) Deep Visual-Semantic Alignments for Generating Image Descriptions (Karpathy and Fei-Fei, CVPR 2015);
(2) CLIP: Learning Transferable Visual Models From Natural Language Supervision (Radford et al., ICML 2021);
(3) MAttNet: Modular Attention Network for Referring Expression Comprehension (Yu et al., CVPR 2018);
Additional readings:
(a) Matching Words and Pictures (Barnard et al., JMLR 2003);
(b) VSE++: Improving Visual-Semantic Embeddings with Hard Negatives (Faghri et al., BMVC 2018);
(c) Stacked Cross Attention for Image-Text Matching (Lee et al., ECCV 2018);
(d) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions;
(e) What are you talking about? Text-to-Image Coreference (Kong et al., CVPR 2014);
Jaemin, Mohit Instructions emailed for presenter (and everyone else as a reviewer)
Sep 01 Referring Expression Generation & Image/Video Captioning (1) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (Xu et al., ICML 2015);
(2) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions (Yu et al., CVPR 2017);
(3) Sequence to Sequence -- Video to Text (Venugopalan et al., ICCV 2015);
(4) Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks (Yu et al., CVPR 2016);
Additional readings:
(a) Deep Visual-Semantic Alignments for Generating Image Descriptions;
(b) Explain Images with Multimodal Recurrent Neural Networks;
(c) From captions to visual concepts and back;
(d) Learning a Recurrent Visual Representation for Image Caption Generation;
(e) Image Captioning with Semantic Attention;
(f) Self-critical Sequence Training for Image Captioning;
(g) Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering;
(h) DenseCap: Fully Convolutional Localization Networks for Dense Captioning;
(i) Reinforced Video Captioning with Entailment Rewards;
(j) MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning;
Rakesh, Muqeeth, Mohit Instructions emailed for presenter (and everyone else as a reviewer)
Sep 08 Image/Video Question Answering & Dialogue (1) VQA v2: Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering (Goyal et al., CVPR 2017);
(2) Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering (Anderson et al., CVPR 2018);
(3) TVQA: Localized, Compositional Video Question Answering (Lei et al., EMNLP 2018);
(4) Visual Dialog (Das et al., CVPR 2017);
Additional readings:
(a) Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding;
(b) Bilinear Attention Networks;
(c) Deep Modular Co-Attention Networks for Visual Question Answering;
(d) MovieQA: Understanding Stories in Movies through Question-Answering;
(e) A Joint Sequence Fusion Model for Video Question Answering and Retrieval;
(f) Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning;
Mohaiminul, Zhuofan, Mohit Instructions emailed for presenter (and everyone else as a reviewer)
Sep 15 Query-based Video Moment Retrieval & Summary / Highlight / Saliency Prediction (1) Localizing Moments in Video with Natural Language (Hendricks et al., ICCV 2017);
(2) TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval (Lei et al., ECCV 2020);
(3) Query-Focused Video Summarization: Dataset, Evaluation, and a Memory Network Based Approach (Sharghi et al., CVPR 2017);
(4) QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries (Lei et al., ArXiv 2021);
Additional readings:
(a) TALL: Temporal Activity Localization via Language Query;
(b) Temporal Localization of Moments in Video Collections with Natural Language;
(c) Multi-task deep visual-semantic embedding for video thumbnail selection;
(d) mTVR: Multilingual Moment Retrieval in Videos;
Haikang, David, Mohit Instructions emailed for presenter (and everyone else as a reviewer)
Sep 22 Vision+Language Commonsense, Reasoning;
Project Brainstorming+Feedback
(1) Predicting Motivations of Actions by Leveraging Text (CVPR 2016);
(2) VCR: From Recognition to Cognition: Visual Commonsense Reasoning (CVPR 2019);
(3) VisualCOMET: Reasoning about the Dynamic Context of a Still Image (ECCV 2020);
(4) VLEP: What is More Likely to Happen Next? Video-and-Language Future Event Prediction (EMNLP 2020);
Additional readings:
(a) Visual Persuasion: Inferring Communicative Intents of Images;
(b) Don’t Just Listen, Use Your Imagination: Leveraging Visual Common Sense for Non-Visual Tasks;
(c) Learning Common Sense Through Visual Abstraction;
(d) Anticipating Visual Representations from Unlabeled Video;
(e) "What Happens If..." Learning to Predict the Effect of Forces in Images;
(f) Grounding Visual Explanations;
Eva, Blaine, Mohit Instructions emailed for presenter (and everyone else as a reviewer)
Sep 29 Text to Image & Image Sequence/Story/Video Generation (1) Generative Adversarial Text to Image Synthesis (ICML 2016);
(2) DALL-E: Zero-Shot Text-to-Image Generation (ICML 2021);
(3) Video Generation From Text (AAAI 2018);
(4) Duco-StoryGAN: Improving Generation and Evaluation of Visual Stories via Semantic Consistency (NAACL 2021);
Additional readings:
(a) StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks;
(b) AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks;
(c) X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers;
(d) XMC-GAN: Cross-Modal Contrastive Learning for Text-to-Image Generation;
(e) TGAN-C: To Create What You Tell: Generating Videos from Captions;
(f) TFGAN: Conditional GAN with Discriminative Filter Generation for Text-to-Video Synthesis;
(g) Imagine This! Scripts to Compositions to Videos;
Derek, Shiv, Mohit Instructions emailed for presenter (and everyone else as a reviewer)
Oct 06 Multi-modal Pretraining Models (V+L, V-->L, L-->V): Strengths and Weaknesses
Oct 13 Midterm Project Presentations (tentative) Midterm reports due Oct20
Oct 20 Executing NL instructions for navigation, articulation, manipulation, assembly, skill learning, etc.
Oct 27 Human-robot collaboration and dialogue: instruction generation, learning new subactions, mediating shared perceptual basis, referring expression generation, etc.
Nov 03 Grounding and language learning via dialogue-based and interactive games
Nov 10 Gesture, Turn-taking, Gaze in Human-Robot Interaction
Nov 17 Bias, Ethics, Societal Applications; How to Write and Review Research Papers
Nov 24 Thanksgiving Holiday
Dec 01 Last Class: Final Project Presentations (tentative) Final Project Write-ups Due Dec 08


The professor reserves the right to make changes to the syllabus, including project due dates. These changes will be announced as early as possible.