COMP 790/590: Connecting Language to Vision and Robotics (Fall 2021)

Instructor: Mohit Bansal
Units: 3
Office: FB 244
Lectures: Wed 10:40am-1:10pm ET, Rm SN-011
Office Hours: Wed 1:10pm-2:10pm ET (by appointment) (remote/zoom option)
Course Webpage:
Course Email: nlpcomp790unc -at-



This course will be based on the connections between the fields of natural language processing (NLP) and its important multimodal connections to computer vision and robotics; it will cover a wide variety of topics in the area of multimodal-NLP such as image/video-based captioning, retrieval, QA, and dialogue; vision+language commonsense; query-based video summarization; robotic navigation + manipulation instruction execution and generation; multimodal and embodied pretraining models (as well as their ethics/bias/societal applications + issues).



Students not meeting these requirements must receive the explicit permission of the instructor to remain in this course.


Grading will (tentatively) consist of:

All submissions should be emailed to:

Reference Books

For NLP concepts refresher, see:

For multimodal NLP concepts, see corresponding lectures of my previous classes and proceedings+talks of Multimodal Workshops at NLP, Vision, and Robotics conferences (e.g., RoboNLP, VQA Workshop, ALVR, LANTERN, etc.)


DateTopic Readings Discussion LeadersTodo's
Aug 18Intro to the Course -- Mohit --
Aug 25 Image-Text Alignment / Matching / Retrieval & Referring Expression Comprehension (1) Deep Visual-Semantic Alignments for Generating Image Descriptions (Karpathy and Fei-Fei, CVPR 2015);
(2) CLIP: Learning Transferable Visual Models From Natural Language Supervision (Radford et al., ICML 2021);
(3) MAttNet: Modular Attention Network for Referring Expression Comprehension (Yu et al., CVPR 2018);
Additional readings:
(a) Matching Words and Pictures (Barnard et al., JMLR 2003);
(b) VSE++: Improving Visual-Semantic Embeddings with Hard Negatives (Faghri et al., BMVC 2018);
(c) Stacked Cross Attention for Image-Text Matching (Lee et al., ECCV 2018);
(d) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions;
(e) What are you talking about? Text-to-Image Coreference (Kong et al., CVPR 2014);
Jaemin, Mohit Instructions emailed for presenter (and everyone else as a reviewer)
Sep 01 Referring Expression Generation & Image/Video Captioning (1) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (Xu et al., ICML 2015);
(2) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions (Yu et al., CVPR 2017);
(3) Sequence to Sequence -- Video to Text (Venugopalan et al., ICCV 2015);
(4) Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks (Yu et al., CVPR 2016);
Additional readings:
(a) Deep Visual-Semantic Alignments for Generating Image Descriptions;
(b) Explain Images with Multimodal Recurrent Neural Networks;
(c) From captions to visual concepts and back;
(d) Learning a Recurrent Visual Representation for Image Caption Generation;
(e) Image Captioning with Semantic Attention;
(f) Self-critical Sequence Training for Image Captioning;
(g) Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering;
(h) DenseCap: Fully Convolutional Localization Networks for Dense Captioning;
(i) Reinforced Video Captioning with Entailment Rewards;
(j) MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning;
(k) VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research;
Rakesh, Muqeeth, Mohit Instructions emailed for presenter (and everyone else as a reviewer)
Sep 08 Image/Video Question Answering & Dialogue (1) VQA v2: Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering (Goyal et al., CVPR 2017);
(2) Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering (Anderson et al., CVPR 2018);
(3) TVQA: Localized, Compositional Video Question Answering (Lei et al., EMNLP 2018);
(4) Visual Dialog (Das et al., CVPR 2017);
Additional readings:
(a) Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding;
(b) Bilinear Attention Networks;
(c) Deep Modular Co-Attention Networks for Visual Question Answering;
(d) MovieQA: Understanding Stories in Movies through Question-Answering;
(e) A Joint Sequence Fusion Model for Video Question Answering and Retrieval;
(f) Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning;
Mohaiminul, Zhuofan, Mohit Instructions emailed for presenter (and everyone else as a reviewer)
Sep 15 Query-based Video Moment Retrieval & Summary / Highlight / Saliency Prediction (1) Localizing Moments in Video with Natural Language (Hendricks et al., ICCV 2017);
(2) TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval (Lei et al., ECCV 2020);
(3) Query-Focused Video Summarization: Dataset, Evaluation, and a Memory Network Based Approach (Sharghi et al., CVPR 2017);
(4) QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries (Lei et al., ArXiv 2021);
Additional readings:
(a) TALL: Temporal Activity Localization via Language Query;
(b) Temporal Localization of Moments in Video Collections with Natural Language;
(c) Multi-task deep visual-semantic embedding for video thumbnail selection;
(d) mTVR: Multilingual Moment Retrieval in Videos;
(e) YouCook2-Retrieval: Towards Automatic Learning of Procedures from Web Instructional Videos;
Haikang, David, Mohit Instructions emailed for presenter (and everyone else as a reviewer)
Sep 22 Vision+Language Commonsense, Reasoning;
Project Brainstorming+Feedback
(1) Predicting Motivations of Actions by Leveraging Text (CVPR 2016);
(2) VCR: From Recognition to Cognition: Visual Commonsense Reasoning (CVPR 2019);
(3) VisualCOMET: Reasoning about the Dynamic Context of a Still Image (ECCV 2020);
(4) VLEP: What is More Likely to Happen Next? Video-and-Language Future Event Prediction (EMNLP 2020);
Additional readings:
(a) Visual Persuasion: Inferring Communicative Intents of Images;
(b) Don’t Just Listen, Use Your Imagination: Leveraging Visual Common Sense for Non-Visual Tasks;
(c) Learning Common Sense Through Visual Abstraction;
(d) Anticipating Visual Representations from Unlabeled Video;
(e) "What Happens If..." Learning to Predict the Effect of Forces in Images;
(f) Grounding Visual Explanations;
(g) VIOLIN: A Large-Scale Dataset for Video-and-Language Inference;
Eva, Blaine, Mohit Instructions emailed for presenter (and everyone else as a reviewer)
Sep 29 Text to Image & Image Sequence/Story/Video Generation (1) Generative Adversarial Text to Image Synthesis (ICML 2016);
(2) DALL-E: Zero-Shot Text-to-Image Generation (ICML 2021);
(3) Video Generation From Text (AAAI 2018);
(4) Duco-StoryGAN: Improving Generation and Evaluation of Visual Stories via Semantic Consistency (NAACL 2021);
Additional readings:
(a) StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks;
(b) AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks;
(c) X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers;
(d) XMC-GAN: Cross-Modal Contrastive Learning for Text-to-Image Generation;
(e) TGAN-C: To Create What You Tell: Generating Videos from Captions;
(f) TFGAN: Conditional GAN with Discriminative Filter Generation for Text-to-Video Synthesis;
(g) Imagine This! Scripts to Compositions to Videos;
Derek, Shiv, Mohit Instructions emailed for presenter (and everyone else as a reviewer)
Oct 06 Vision+Language Pretraining Models (V+L, V+L as Text, V-->L, L-->V) (1) LXMERT: Learning Cross-Modality Encoder Representations from Transformers (EMNLP 2019);
/ ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks (NeurIPS 2019);
(2) Unifying Vision-and-Language Tasks via Text Generation (ICML 2021);
(3) Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision (EMNLP 2020);
(4) VirTex: Learning Visual Representations from Textual Annotations (CVPR 2021);
Additional readings:
(a) Several other VLP models (e.g., UNITER, VL-BERT, VisualBERT, etc.) mentioned in this link;
(b) Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks;
(c) ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision;
(d) ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph;
(e) VideoBERT: A Joint Model for Video and Language Representation Learning;
(f) HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training;
(g) UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation ;
Yi-Lin, Zineng, Mohit Instructions emailed for presenter (and everyone else as a reviewer)
Oct 13 Midterm Project Presentations Midterm reports due Oct20
Oct 20 Executing NL instructions for navigation, articulation, manipulation, assembly, skill learning, etc. (1) Learning to Interpret Natural Language Navigation Instructions from Observations (AAAI, 2011) ;
(2) Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences (AAAI, 2016) ;
(3) Room2Room: Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments (CVPR, 2018) ;
(4) ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks (CVPR, 2020) ;
Additional readings:
(a) Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions ;
(b) Learning to Follow Navigational Directions ;
(1) Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation ;
(c) A Natural Language Planner Interface for Mobile Manipulators ;
(d) Natural Language Communication with Robots ;
(e) Interpreting and Executing Recipes with a Cooking Robot ;
(f) Tell Me Dave: Context-Sensitive Grounding of Natural Language to Mobile Manipulation Instructions ;
(g) Robobarista: Object Part based Transfer of Manipulation Trajectories from Crowd-sourcing in 3D Pointclouds ; (h) Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments ;
(i) ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments ;
(j) Speaker-Follower Models for Vision-and-Language Navigation ;
(k) Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation ;
(l) Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout ;
Zheng, Jialu, Mohit
Oct 27 Human-robot collaboration and dialogue: instruction generation, emdodied Q&A, learning new subactions, etc. (1) Back to the Blocks World: Learning New Actions through Situated Human-Robot Dialogue (SigDIAL, 2014) ;
(2) Learning to Interpret Natural Language Commands through Human-Robot Dialog (IJCAI, 2015) ;
(3) Embodied Question Answering (CVPR, 2018) ;
(4) CVDN: Vision-and-Dialog Navigation (CoRL, 2019) ;
Additional readings:
(a) Clarifying Commands with Information-Theoretic Human-Robot Dialog ;
(b) Asking for Help Using Inverse Semantics ;
(c) Navigational Instruction Generation as Inverse Reinforcement Learning with Neural Machine Translation ;
(d) PLOW: A Collaborative Task Learning Agent ;
(e) REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments ;
(f) Multi-Target Embodied Question Answering ;
Yumeng, Shengze, Mohit
Nov 03 Grounding and language learning/emergence via multi-agent dialogue-based and interactive games (1) Learning Language Games through Interaction (ACL, 2016) ;
(2) Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols (NeurIPS, 2017) ;
(3) Emergence of Grounded Compositional Language in Multi-Agent Populations (AAAI, 2018) ;
(4) Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog (EMNLP, 2017) ;
Additional readings:
(a) Collaborative Models for Referring Expression Generation in Situated Dialogue ;
(b) Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning ;
(c) Multi-Agent Cooperation and the Emergence of (Natural) Language ;
(d) Learning to Play Guess Who? and Inventing a Grounded Language as a Consequence ;
(e) Learning to Speak and Act in a Fantasy Text Adventure Game ;
(f) Emergent Multi-Agent Communication in the Deep Learning Era ;
David, Jaemin, Mohit
Nov 10 Non-Verbal Human-Robot Interaction/Communication: Gestures, Turn-taking, Gaze (1) "Generation of Nodding, Head Tilting and Eye Gazing for Human-Robot Dialogue Interaction (HRI, 2012)";
(2) " Conversational Gaze Aversion for Humanlike Robots (HRI, 2014)";
(3) Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning (ICCV, 2019) ;
(4) " Simon plays Simon says: The timing of turn-taking in an imitation game (Ro-Man, 2011)";
Additional readings:
(a) " Effects of Responding to, Initiating and Ensuring Joint Attention in Human-Robot Interaction”;
(b) See You See Me: The Role of Eye Contact in Multimodal Human-Robot Interaction ;
(c) Social eye gaze in human-robot interaction: a review ;
(d) Embodiment in Socially Interactive Robots ;
(e) Understanding Teacher Gaze Patterns for Robot Learning ;
(f) Human Gaze Following for Human-Robot Interaction ;
(g) A review of eye gaze in virtual agents, social robotics and hci: Behaviour generation, user interaction and perception ;
(h) Gaze and Attention Management for EmbodiedConversational Agents ;
Shiv, Shengze, Mohaiminul, Muqeeth, Mohit
Nov 17 Multimodal Biases and Shortcuts; How to Write and Review Research Papers (1) Multimodal datasets: misogyny, pornography, and malignant stereotypes;
(2) RUBi: Reducing Unimodal Biases in Visual Question Answering (NeurIPS, 2019);
(3) Measuring Social Biases in Grounded Vision and Language Embeddings (NAACL, 2021) ;
(4) Diagnosing the Environment Bias in Vision-and-Language Navigation (IJCAI, 2020);
Additional readings:
(a) Shifting the Baseline: Single Modality Performance on Visual Navigation & QA ;
(b) Modality-Balanced Models for Visual Dialogue ;
(c) Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions ”;
(d) MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering ;
(e) Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models ;
(f) Are Gender-Neutral Queries Really Gender-Neutral? Mitigating Gender Bias in Image Search ;
(g) Language (Technology) is Power: A Critical Survey of "Bias" in NLP ;
(h) ACL Wiki: Ethics in NLP Resources/Courses;
Rakesh, Derek, Yi-Lin, Jialu, Mohit
Nov 24 Thanksgiving Holiday
Dec 01 Last Class: Final Project Presentations Final Project Write-ups Due Dec 08


The professor reserves the right to make changes to the syllabus, including project due dates. These changes will be announced as early as possible.