COMP 690: Multimodal AI: Connecting Language to Vision and Robotics (Spring 2026)

Instructor: Mohit Bansal
Units: 3
Lectures: Wednesdays 11.15am-1.45pm ET, Room SN-115
Office Hours: Mondays By appointment
Course Webpage: https://www.cs.unc.edu/~mbansal/teaching/multimodal-ai-comp690-spring26.html
Course Email: nlpcomp790unc -at- gmail.com




Introduction

This course will be based on the connections between the fields of natural language processing (NLP) and its important multimodal connections to computer vision and robotics; it will cover a wide variety of topics in the area of multimodal-NLP such as image/video-based captioning, retrieval, QA, and dialogue; vision+language commonsense; query-based video summarization; text-to-image/video generation; any to any/3D generation, robotic navigation + manipulation instruction execution and generation; unified multimodal and embodied pretraining models (as well as ethics/bias/societal applications + issues).


Topics



Prerequisites

Students not meeting these requirements must receive the explicit permission of the instructor to remain in this course.


Grading

Grading will (tentatively) consist of:

All submissions should be emailed to: nlpcomp790unc@gmail.com.


Reference Books

For NLP concepts refresher, see:

For multimodal NLP concepts, see corresponding lectures of my previous classes and proceedings+talks of Multimodal Workshops at NLP, Vision, and Robotics conferences (e.g., RoboNLP, VQA Workshop, ALVR, LANTERN, CLVL, etc.)


Schedule (tentative)

DateTopic Readings Discussion Leaders
Jan7Intro to the Course -- Mohit
Jan14 Image-Text Alignment / Matching / Retrieval & Referring Expression Comprehension (1) CLIP: Learning Transferable Visual Models From Natural Language Supervision (Radford et al., ICML 2021);
(2) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (Li et al., ICML 2022);
(3) MAttNet: Modular Attention Network for Referring Expression Comprehension (Yu et al., CVPR 2018);
(4) Referring Transformer: A One-step Approach to Multi-task Visual Grounding (Li and Sigal, NeurIPS 2021);
Additional readings:
(a) Matching Words and Pictures (Barnard et al., JMLR 2003);
(b) Deep Visual-Semantic Alignments for Generating Image Descriptions (Karpathy and Fei-Fei, CVPR 2015);
(c) VSE++: Improving Visual-Semantic Embeddings with Hard Negatives (Faghri et al., BMVC 2018);
(d)
Stacked Cross Attention for Image-Text Matching (Lee et al., ECCV 2018);
(e) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions (Yu et al., CVPR 2017);
(f)
What are you talking about? Text-to-Image Coreference (Kong et al., CVPR 2014);
(g) Align before Fuse: Vision and Language Representation Learning with Momentum Distillation;
(h) Segment Anything;
Yidong, Atharv, Mohit
Jan21 Referring Expression Generation & Image/Video Captioning (1) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions (Yu et al., CVPR 2017);
(2) Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning (Sharma et al., ACL 2018);
(3) Neural Baby Talk (Lu et al., CVPR 2018);
(4) SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning (Lin et al., CVPR 2022);
Additional readings:
Deep Visual-Semantic Alignments for Generating Image Descriptions;
Explain Images with Multimodal Recurrent Neural Networks;
From captions to visual concepts and back;
Learning a Recurrent Visual Representation for Image Caption Generation;
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention;
Image Captioning with Semantic Attention;
Self-critical Sequence Training for Image Captioning;
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering;
DenseCap: Fully Convolutional Localization Networks for Dense Captioning;
Sequence to Sequence -- Video to Text;
Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks;
Reinforced Video Captioning with Entailment Rewards;
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning;
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research;
ClipCap: CLIP Prefix for Image Captioning;
Fine-grained Image Captioning with CLIP Reward;
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning;
Rohan, Amogh, Mohit
Jan 28 Image/Video Question Answering & Dialogue (1) VQA v2: Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering (Goyal et al., CVPR 2017);
(2) Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering (Anderson et al., CVPR 2018);
(3) TVQA: Localized, Compositional Video Question Answering (Lei et al., EMNLP 2018);
(4) VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos (Wang et al., CVPR 2025) ;
Additional readings:
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering;
Visual Dialog (Das et al., CVPR 2017);
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding;
Bilinear Attention Networks;
Deep Modular Co-Attention Networks for Visual Question Answering;
MovieQA: Understanding Stories in Movies through Question-Answering;
A Joint Sequence Fusion Model for Video Question Answering and Retrieval;
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning;
Modality-Balanced Models for Visual Dialogue ;
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA;
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions;
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling;
Self-Chained Image-Language Model for Video Localization and Question Answering ;
A Simple LLM Framework for Long-Range Video Question-Answering ;
Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding ;
Adam, Alex, Mohit
Feb4 Query-based Video Moment Retrieval & Summary / Highlight / Saliency Prediction (1) TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval (Lei et al., ECCV 2020);
(2) UniVTG: Towards Unified Video-Language Temporal Grounding (Lin et al., ICCV 2023);
(3) QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries (Lei et al., NeurIPS 2021);
(4) UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection (Liu et al., CVPR 2022);
Additional readings:
Localizing Moments in Video with Natural Language;
TALL: Temporal Activity Localization via Language Query;
Temporal Localization of Moments in Video Collections with Natural Language;
Multi-task deep visual-semantic embedding for video thumbnail selection;
mTVR: Multilingual Moment Retrieval in Videos;
YouCook2-Retrieval: Towards Automatic Learning of Procedures from Web Instructional Videos;
Ranking Domain-specific Highlights by Analyzing Edited Videos;
TVSum: Summarizing Web Videos Using Titles;
Query-Focused Video Summarization: Dataset, Evaluation, and a Memory Network Based Approach;
CLIP-It! Language-Guided Video Summarization;
TubeDETR: Spatio-Temporal Video Grounding with Transformers;
Hierarchical Video-Moment Retrieval and Step-Captioning ;
VidChapters-7M: Video Chapters at Scale ;
TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency ;
Query-Dependent Video Representation for Moment Retrieval and Highlight Detection
Timechat: A time-sensitive multimodal large language model for long video understanding ;
Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection ;
TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection ;
Momentdiff: Generative video moment retrieval from random to real ;
VTimeLLM: Empower LLM to Grasp Video Moments ;
LITA: Language Instructed Temporal-Localization Assistant ;
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning ;
Rishab, Joy, Mohit
Feb11 Vision+Language Commonsense, World Knowledge, Reasoning;
Project Brainstorming+Feedback
(1) VisualCOMET: Reasoning about the Dynamic Context of a Still Image (ECCV 2020);
(2) MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (CVPR 2024);
(3) VLEP: What is More Likely to Happen Next? Video-and-Language Future Event Prediction (EMNLP 2020);
(4a) Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models (NeurIPS 2024);
(4b) OpenAI: Thinking with images
Additional readings:
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge;
VCR: From Recognition to Cognition: Visual Commonsense Reasoning;
Visual Persuasion: Inferring Communicative Intents of Images;
Don't Just Listen, Use Your Imagination: Leveraging Visual Common Sense for Non-Visual Tasks;
Learning Common Sense Through Visual Abstraction;
Anticipating Visual Representations from Unlabeled Video;
Predicting Motivations of Actions by Leveraging Text;
"What Happens If..." Learning to Predict the Effect of Forces in Images;
Grounding Visual Explanations;
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference;
PIQA: Reasoning about Physical Commonsense in Natural Language;
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge;
Visual Abductive Reasoning;
Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning;
Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding;
EgoTaskQA: Understanding Human Tasks in Egocentric Videos;
CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination;
Ego4D: Around the World in 3,000 Hours of Egocentric Video;
Visual Programming: Compositional visual reasoning without training;
ViperGPT: Visual Inference via Python Execution for Reasoning;
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning;
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning;
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding;
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos;
Visual-RFT: Visual Reinforcement Fine-Tuning;
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models;
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning;
Qi, Om, Mohit
Feb18 Vision+Language Pretrained Models (V+L, V-->L, L-->V, Unified, Any-to-Any) (1) LXMERT: Learning Cross-Modality Encoder Representations from Transformers;
/ ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks;
(2) Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision;
/ VirTex: Learning Visual Representations from Textual Annotations;
(3) VL-T5: Unifying Vision-and-Language Tasks via Text Generation;
/ Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks;
(4) Any-to-Any Generation via Composable Diffusion;
/ CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation;
Additional readings:
Several other VLP models (e.g., UNITER, VL-BERT, VisualBERT, etc.) mentioned in this link;
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks;
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks;
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision;
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph;
VideoBERT: A Joint Model for Video and Language Representation Learning;
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training;
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation ;
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling;
MERLOT: Multimodal Neural Script Knowledge Models;
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation;
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling;
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework;
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision;
CoCa: Contrastive Captioners are Image-Text Foundation Models;
TVLT: Textless Vision-Language Transformer;
BEiT: BERT Pre-Training of Image Transformers;
Flamingo: a Visual Language Model for Few-Shot Learning;
PICa: An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA;
VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners;
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models;
Contrastive Learning of Medical Visual Representations from Paired Images and Text;
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer;
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks;
Idefics3-8B: Building and better understanding vision-language models;
PaliGemma 2: A Family of Versatile VLMs for Transfer ;
InternVL 2.5: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling ;
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding ;
Qwen2.5-VL: Enhanced Vision-Language Capabilities in the Qwen Series ;
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks ;
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation ;
LLaVA: Visual Instruction Tuning;
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation;
OneLLM: One Framework to Align All Modalities with Language;
Chameleon: Mixed-Modal Early-Fusion Foundation Models;
Qwen3-VL Technical Report;
BAGEL: The Open-Source Unified Multimodal Model;
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities;
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models;
Amogh, Yidong, Mohit
Feb25 Image/Story/Video Generation; 3D Image/Video & Game Generation + World Models -- TBD
Mar4 DocAI, Efficient VL -- TBD
Mar11 MIDTERM PROJECT PRESENTATIONS -- TBD
Mar18 SPRING BREAK -- TBD
Mar25 Language instructions for robotic navigation, articulation, manipulation, assembly, skill learning, etc. -- TBD
Apr1 Embodied Pretraining Models & Vision-Language-Action (VLA) Benchmarks/Models -- TBD
Apr8 Human-robot collaboration and dialogue: instruction generation, embodied Q&A, learning new subactions, etc. -- TBD
Apr15 Grounded language learning/emergence via multi-agent dialogue-based and interactive games; Non-Verbal Human-Robot Interaction/Communication: Gestures, Turn-taking, Gaze -- TBD
Apr22 Multimodal Biases and Shortcuts; How to Write and Review Research Papers -- TBD
Apr29 FINAL PROJECT PRESENTATIONS -- TBD




Disclaimer

The professor reserves the right to make changes to the syllabus, including project due dates. These changes will be announced as early as possible.