COMP 790/590: Connecting Language to Vision and Robotics (Spring 2023)

Instructor: Mohit Bansal
Units: 3
Lectures: Wed 11am-1.30pm ET, Room SN-115
Office Hours: Wed 1:30pm-2:15pm ET (by appointment) (remote/zoom option)
Course Webpage:
Course Email: nlpcomp790unc -at-


This course will be based on the connections between the fields of natural language processing (NLP) and its important multimodal connections to computer vision and robotics; it will cover a wide variety of topics in the area of multimodal-NLP such as image/video-based captioning, retrieval, QA, and dialogue; vision+language commonsense; query-based video summarization; text-to-image/video generation; robotic navigation + manipulation instruction execution and generation; unified multimodal and embodied pretraining models (as well as ethics/bias/societal applications + issues).



Students not meeting these requirements must receive the explicit permission of the instructor to remain in this course.


Grading will (tentatively) consist of:

All submissions should be emailed to:

Reference Books

For NLP concepts refresher, see:

For multimodal NLP concepts, see corresponding lectures of my previous classes and proceedings+talks of Multimodal Workshops at NLP, Vision, and Robotics conferences (e.g., RoboNLP, VQA Workshop, ALVR, LANTERN, CLVL, etc.)

Schedule (tentative)

DateTopic Readings Discussion Leaders
Jan 11Intro to the Course -- Mohit
Jan18 Image-Text Alignment / Matching / Retrieval & Referring Expression Comprehension (1) CLIP: Learning Transferable Visual Models From Natural Language Supervision (Radford et al., ICML 2021);
(2) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (Li et al., ICML 2022);
(3) MAttNet: Modular Attention Network for Referring Expression Comprehension (Yu et al., CVPR 2018);
(4) Referring Transformer: A One-step Approach to Multi-task Visual Grounding (Li and Sigal, NeurIPS 2021);
Additional readings:
(a) Matching Words and Pictures (Barnard et al., JMLR 2003);
(b) Deep Visual-Semantic Alignments for Generating Image Descriptions (Karpathy and Fei-Fei, CVPR 2015);
(c) VSE++: Improving Visual-Semantic Embeddings with Hard Negatives (Faghri et al., BMVC 2018);
Stacked Cross Attention for Image-Text Matching (Lee et al., ECCV 2018);
(e) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions (Yu et al., CVPR 2017);
What are you talking about? Text-to-Image Coreference (Kong et al., CVPR 2014);
(g) Align before Fuse: Vision and Language Representation Learning with Momentum Distillation;
Shoubin, Ziyang, Mo hit
Jan 25 Referring Expression Generation & Image/Video Captioning (1) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions (Yu et al., CVPR 2017);
(2) Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning (Sharma et al., ACL 2018);
Neural Baby Talk (Lu et al., CVPR 2018);
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning (Lin et al., CVPR 2022);
Additional readings:
Deep Visual-Semantic Alignments for Generating Image Descriptions;
Explain Images with Multimodal Recurrent Neural Networks;
From captions to visual concepts and back;
Learning a Recurrent Visual Representation for Image Caption Generation;
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention;
Image Captioning with Semantic Attention;
Self-critical Sequence Training for Image Captioning;
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering;
DenseCap: Fully Convolutional Localization Networks for Dense Captioning;
Sequence to Sequence -- Video to Text;
Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks;
Reinforced Video Captioning with Entailment Rewards;
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning;
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research;
ClipCap: CLIP Prefix for Image Captioning;
Fine-grained Image Captioning with CLIP Reward;
Abhay, Archiki, Mohit
Feb 01 Image/Video Question Answering & Dialogue (1a) VQA v2: Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering (Goyal et al., CVPR 2017);
(1b) GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering (Hudson and Manning, CVPR 2019);
(2) Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering (Anderson et al., CVPR 2018);
(3) TVQA: Localized, Compositional Video Question Answering (Lei et al., EMNLP 2018);
(4) Visual Dialog (Das et al., CVPR 2017);
Additional readings:
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding;
Bilinear Attention Networks;
Deep Modular Co-Attention Networks for Visual Question Answering;
MovieQA: Understanding Stories in Movies through Question-Answering;
A Joint Sequence Fusion Model for Video Question Answering and Retrieval;
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning;
Modality-Balanced Models for Visual Dialogue ;
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA;
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions;
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling;
Vaidehi, Qin, Mohit
Feb 08 Query-based Video Moment Retrieval & Summary / Highlight / Saliency Prediction (1) Localizing Moments in Video with Natural Language (Hendricks et al., ICCV 2017);
(2) TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval (Lei et al., ECCV 2020);
(3) QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries (Lei et al., NeurIPS 2021);
(4) TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency (Narasimhan et al., ECCV 2022);
Additional readings:
(a) TALL: Temporal Activity Localization via Language Query;
(b) Temporal Localization of Moments in Video Collections with Natural Language;
(c) Multi-task deep visual-semantic embedding for video thumbnail selection;
(d) mTVR: Multilingual Moment Retrieval in Videos;
(e) YouCook2-Retrieval: Towards Automatic Learning of Procedures from Web Instructional Videos;
(f) Ranking Domain-specific Highlights by Analyzing Edited Videos;
(g) TVSum: Summarizing Web Videos Using Titles;
(h) Query-Focused Video Summarization: Dataset, Evaluation, and a Memory Network Based Approach;
(i) CLIP-It! Language-Guided Video Summarization;
(j) TubeDETR: Spatio-Temporal Video Grounding with Transformers;
(k) UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection;
Mason, Yiyuan, Mohit
Feb 15 Vision+Language Commonsense, World Knowledge, Reasoning;
Project Brainstorming+Feedback
(1) OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge (CVPR 2019);
(2) VCR: From Recognition to Cognition: Visual Commonsense Reasoning (CVPR 2019);
(3) VisualCOMET: Reasoning about the Dynamic Context of a Still Image (ECCV 2020);
(4) VLEP: What is More Likely to Happen Next? Video-and-Language Future Event Prediction (EMNLP 2020);
Additional readings:
(a) Visual Persuasion: Inferring Communicative Intents of Images;
(b) Don't Just Listen, Use Your Imagination: Leveraging Visual Common Sense for Non-Visual Tasks;
(c) Learning Common Sense Through Visual Abstraction;
(d) Anticipating Visual Representations from Unlabeled Video;
(e) Predicting Motivations of Actions by Leveraging Text;
(f) "What Happens If..." Learning to Predict the Effect of Forces in Images;
(g) Grounding Visual Explanations;
(h) VIOLIN: A Large-Scale Dataset for Video-and-Language Inference;
(i) PIQA: Reasoning about Physical Commonsense in Natural Language;
(j) A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge;
(k) Visual Abductive Reasoning;
(l) Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning;
(m) Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding;
(n) EgoTaskQA: Understanding Human Tasks in Egocentric Videos;
(o) CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination;
Brandon, Maggie, Mohit
Feb 22 Text to Image & Image Sequence/Story/Video Generation (1) DALL-E: Zero-Shot Text-Guided Object Generation with Dream Fields;
(2) DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Models;
(3) Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding;
(4) Stable Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models;
(5) StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation;
(6) Make-A-Video: Text-to-Video Generation without Text-Video Data;
Additional readings:
(a) Generative Adversarial Text to Image Synthesis;
(b) StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks;
(c) AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks;
(d) Video Generation From Text;
(e) X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers;
(f) An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion;
(g) DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation;
(h) Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors;
(i) Muse: Text-To-Image Generation via Masked Generative Transformers;
(j) CogView: Mastering Text-to-Image Generation via Transformers;
(k) DALLE2: Hierarchical Text-Conditional Image Generation with CLIP Latents;
(l) Parti: Scaling Autoregressive Models for Content-Rich Text-to-Image Generation;
(m) NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion;
(n) Imagen Video: High Definition Video Generation with Diffusion Models;
(o) CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers;
Pierre, Long, Mohit
Mar 01 Vision+Language Pretraining Models (V+L, Unified V+L, V-->L, L-->V, Efficient VL) (1) LXMERT: Learning Cross-Modality Encoder Representations from Transformers;
/ ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks;
(2) VL-T5: Unifying Vision-and-Language Tasks via Text Generation;
/ Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks;
(3) Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision;
/ VirTex: Learning Visual Representations from Textual Annotations;
(4) Multimodal Few-Shot Learning with Frozen Language Models;
/ VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks;
Additional readings:
(a) Several other VLP models (e.g., UNITER, VL-BERT, VisualBERT, etc.) mentioned in this link;
(b) Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks;
(c) ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision;
(d) ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph;
(e) VideoBERT: A Joint Model for Video and Language Representation Learning;
(f) HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training;
(g) UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation ;
(h) Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling;
(i) MERLOT: Multimodal Neural Script Knowledge Models;
(j) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation;
(k) UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling;
(l) OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework;
(m) SimVLM: Simple Visual Language Model Pretraining with Weak Supervision;
(n) CoCa: Contrastive Captioners are Image-Text Foundation Models;
(o) TVLT: Textless Vision-Language Transformer;
(p) BEiT: BERT Pre-Training of Image Transformers;
(q) Flamingo: a Visual Language Model for Few-Shot Learning;
(r) LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning;
(s) PICa: An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA;
(t) VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners;
(u) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models;
(v) Contrastive Learning of Medical Visual Representations from Paired Images and Text;
(w)VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer;
Shoubin, Qin, Mohit
Mar 08 Midterm Project Presentations
Mar 15 Spring break holiday
Mar 22 Language instructions for robotic navigation, articulation, manipulation, assembly, skill learning, etc. (1a) DUET: Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation;
/ (1b) EnvEdit: Environment Editing for Vision-and-Language Navigation;
(2) PaLM-E: An Embodied Multimodal Language Model ;
(3) Room2Room: Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments ;
(4) ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks ;
Additional readings:
(a) Learning to Interpret Natural Language Navigation Instructions from Observations (AAAI, 2011) ;
(b) Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences (AAAI, 2016) ;
(c) Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions ;
(d) Learning to Follow Navigational Directions ;
(1) Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation ;
(e) A Natural Language Planner Interface for Mobile Manipulators ;
(f) Natural Language Communication with Robots ;
(g) Interpreting and Executing Recipes with a Cooking Robot ;
(h) Tell Me Dave: Context-Sensitive Grounding of Natural Language to Mobile Manipulation Instructions ;
(i) Robobarista: Object Part based Transfer of Manipulation Trajectories from Crowd-sourcing in 3D Pointclouds ; (j) Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments ;
(k) ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments ;
(l) Speaker-Follower Models for Vision-and-Language Navigation ;
(m) Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation ;
(n) Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout ;
(o) Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding ;
(p) A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning ;
(q) HAMT: History Aware Multimodal Transformer for Vision-and-Language Navigation ;
(r) HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation ;
(s) MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge ;
(t) Simple but Effective: CLIP Embeddings for Embodied AI ;
(u) CLIPort: What and Where Pathways for Robotic Manipulation ;
Ziyang, Mason, Mohit
Mar 29 Human-robot collaboration and dialogue: instruction generation, embodied Q&A, learning new subactions, etc. (1) Back to the Blocks World: Learning New Actions through Situated Human-Robot Dialogue ;
(2a) Embodied Question Answering;
/ (2b) IQA: Visual Question Answering in Interactive Environments ;
(3a) CVDN: Vision-and-Dialog Navigation;
/ (3b) TEACh: Task-driven Embodied Agents that Chat;
(4) CerealBar: Executing Instructions in Situated Collaborative Interactions;
Additional readings:
(a) Learning to Interpret Natural Language Commands through Human-Robot Dialog ;
(b) Clarifying Commands with Information-Theoretic Human-Robot Dialog ;
(c) Asking for Help Using Inverse Semantics ;
(d) Navigational Instruction Generation as Inverse Reinforcement Learning with Neural Machine Translation ;
(e) PLOW: A Collaborative Task Learning Agent ;
(f) REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments ;
(g) Multi-Target Embodied Question Answering ;
(h) AutoVLN: Learning from Unlabeled 3D Environments for Vision-and-Language Navigation ;
Abhay, Vaidehi, Mohit
Apr 05 Grounded language learning/emergence via multi-agent dialogue-based and interactive games (1) Grounded Language Learning in a Simulated 3D World;
(2) Emergence of Linguistic Communication from Referential Games with Symbolic and Pixel Input;
(3) Iconary: A Pictionary-Based Game for Testing Multimodal Communication with Drawings and Text;
(4) Do As I Can, Not As I Say: Grounding Language in Robotic Affordances;
Additional readings:
(a) Learning Language Games through Interaction;
(b) Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols ;
(c) Emergence of Grounded Compositional Language in Multi-Agent Populations;
(d) Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog ;
(e) Collaborative Models for Referring Expression Generation in Situated Dialogue ;
(f) Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning ;
(g) Multi-Agent Cooperation and the Emergence of (Natural) Language ;
(h) Learning to Play Guess Who? and Inventing a Grounded Language as a Consequence ;
(i) Learning to Speak and Act in a Fantasy Text Adventure Game ;
(j) Emergent Multi-Agent Communication in the Deep Learning Era ;
(k) Emergence of Linguistic Communication from Referential Games with Symbolic and Pixel Input;
(l) Emergent Communication through Negotiation;
(m) BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning;
(n) Learning when to Communicate at Scale in Multiagent Cooperative and Competitive Tasks;
(o) On the Pitfalls of Measuring Emergent Communication;
(p) Two Body Problem: Collaborative Visual Task Completion;
(q) Emergent Communication at Scale;
Long, Pierre, Mohit
Apr 12 Non-Verbal Human-Robot Interaction/Communication: Gestures, Turn-taking, Gaze (1) "Generation of Nodding, Head Tilting and Eye Gazing for Human-Robot Dialogue Interaction";
(2) " Conversational Gaze Aversion for Humanlike Robots (HRI, 2014)";
(3) Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning;
(4) " Simon plays Simon says: The timing of turn-taking in an imitation game";
Additional readings:
(a) " Effects of Responding to, Initiating and Ensuring Joint Attention in Human-Robot Interaction";
(b) See You See Me: The Role of Eye Contact in Multimodal Human-Robot Interaction ;
(c) Social eye gaze in human-robot interaction: a review ;
(d) Embodiment in Socially Interactive Robots ;
(e) Understanding Teacher Gaze Patterns for Robot Learning ;
(f) Human Gaze Following for Human-Robot Interaction ;
(g) A review of eye gaze in virtual agents, social robotics and hci: Behaviour generation, user interaction and perception ;
(h) Gaze and Attention Management for Embodied Conversational Agents ;
(i) Learning About Objects by Learning to Interact with Them;
(j) Pushing it out of the Way: Interactive Visual Navigation;
Maggie, Brandon, Mohit
Apr 19 Multimodal Biases and Shortcuts; How to Write and Review Research Papers (1a) RUBi: Reducing Unimodal Biases in Visual Question Answering;
/ (1b) Debiased Visual Question Answering from Feature and Sample Perspectives;
(2) Revealing Single Frame Bias for Video-and-Language Learning;
(3) Multimodal datasets: misogyny, pornography, and malignant stereotypes;
(4a) Measuring Social Biases in Grounded Vision and Language Embeddings;
/ (4b) Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations;
Additional readings:
(a) Shifting the Baseline: Single Modality Performance on Visual Navigation & QA ;
(b) Modality-Balanced Models for Visual Dialogue ;
(c) Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions ";
(d) MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering ;
(e) REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets ;
(f) VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives ;
(g) Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models ;
(h) Diagnosing the Environment Bias in Vision-and-Language Navigation;
(i) Deconfounded Visual Grounding ;
(j) Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models ;
(k) Are Gender-Neutral Queries Really Gender-Neutral? Mitigating Gender Bias in Image Search ;
(l) Language (Technology) is Power: A Critical Survey of "Bias" in NLP ;
(m) Advances, challenges and opportunities in creating data for trustworthy AI ;
(n) Data and its (dis)contents: A survey of dataset development and use in machine learning research ;
(o) ACL Wiki: Ethics in NLP Resources/Courses;
Archiki, Yiyuan, Mohit
Apr 26 Last Class: Final Project Presentations (tentative)


The professor reserves the right to make changes to the syllabus, including project due dates. These changes will be announced as early as possible.