Schedule (tentative)
| Date | Topic | Readings | Discussion Leaders |
| Jan7 | Intro to the Course | -- | Mohit |
| Jan14 | Image-Text Alignment / Matching / Retrieval & Referring Expression Comprehension | (1) CLIP: Learning Transferable Visual Models From Natural Language Supervision (Radford et al., ICML 2021); (2) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (Li et al., ICML 2022); (3) MAttNet: Modular Attention Network for Referring Expression Comprehension (Yu et al., CVPR 2018); (4) Referring Transformer: A One-step Approach to Multi-task Visual Grounding (Li and Sigal, NeurIPS 2021); Additional readings: (a) Matching Words and Pictures (Barnard et al., JMLR 2003); (b) Deep Visual-Semantic Alignments for Generating Image Descriptions (Karpathy and Fei-Fei, CVPR 2015); (c) VSE++: Improving Visual-Semantic Embeddings with Hard Negatives (Faghri et al., BMVC 2018)
a>; (d) Stacked Cross Attention for Image-Text Matching (Lee et al., ECCV 2018); (e) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions (Yu et al., CVPR 2017)
a>; (f) What are you talking about? Text-to-Image Coreference (Kong et al., CVPR 2014); (g) Align before Fuse: Vision and Language Representation
Learning with Momentum Distillation; (h) Segment Anything; | Yidong, Atharv, Mohit |
| Jan21 | Referring Expression Generation & Image/Video Captioning |
(1) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions (Yu et al., CVPR 2017);
(2) Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning (Sharma et al., ACL 2018);
(3) Neural Baby Talk (Lu et al., CVPR 2018);
(4) SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning (Lin et al., CVPR 2022);
Additional readings:
Deep Visual-Semantic Alignments for Generating Image Descriptions;
Explain Images with Multimodal Recurrent Neural Networks;
From captions to visual concepts and back;
Learning a Recurrent Visual Representation for Image Caption Generation;
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention;
Image Captioning with Semantic Attention;
Self-critical Sequence Training for Image Captioning;
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering;
DenseCap: Fully Convolutional Localization Networks for Dense Captioning;
Sequence to Sequence -- Video to Text;
Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks;
Reinforced Video Captioning with Entailment Rewards;
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning;
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research;
ClipCap: CLIP Prefix for Image Captioning;
Fine-grained Image Captioning with CLIP Reward;
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning;
| Rohan, Amogh, Mohit |
| Jan 28 | Image/Video Question Answering & Dialogue |
(1) VQA v2: Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering (Goyal et al., CVPR 2017);
(2) Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering (Anderson et al., CVPR 2018);
(3) TVQA: Localized, Compositional Video Question Answering (Lei et al., EMNLP 2018);
(4) VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos (Wang et al., CVPR 2025) ;
Additional readings:
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering;
Visual Dialog (Das et al., CVPR 2017);
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding;
Bilinear Attention Networks;
Deep Modular Co-Attention Networks for Visual Question Answering;
MovieQA: Understanding Stories in Movies through Question-Answering;
A Joint Sequence Fusion Model for Video Question Answering and Retrieval;
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning;
Modality-Balanced Models for Visual Dialogue ;
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA;
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions;
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling;
Self-Chained Image-Language Model for Video Localization and Question Answering ;
A Simple LLM Framework for Long-Range Video Question-Answering ;
Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding ;
| Adam, Alex, Mohit |
| Feb4 | Query-based Video Moment Retrieval & Summary / Highlight / Saliency Prediction |
(1) TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval (Lei et al., ECCV 2020);
(2) UniVTG: Towards Unified Video-Language Temporal Grounding (Lin et al., ICCV 2023);
(3) QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries (Lei et al., NeurIPS 2021);
(4) UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection (Liu et al., CVPR 2022);
Additional readings:
Localizing Moments in Video with Natural Language;
TALL: Temporal Activity Localization via Language Query;
Temporal Localization of Moments in Video Collections with Natural Language;
Multi-task deep visual-semantic embedding for video thumbnail selection;
mTVR: Multilingual Moment Retrieval in Videos;
YouCook2-Retrieval: Towards Automatic Learning of Procedures from Web Instructional Videos;
Ranking Domain-specific Highlights by Analyzing Edited Videos;
TVSum: Summarizing Web Videos Using Titles;
Query-Focused Video Summarization: Dataset, Evaluation, and a Memory Network Based Approach;
CLIP-It! Language-Guided Video Summarization;
TubeDETR: Spatio-Temporal Video Grounding with Transformers;
Hierarchical Video-Moment Retrieval and Step-Captioning ;
VidChapters-7M: Video Chapters at Scale ;
TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency ;
Query-Dependent Video Representation for Moment Retrieval and Highlight Detection
Timechat: A time-sensitive multimodal large language model for long video understanding ;
Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection ;
TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection ;
Momentdiff: Generative video moment retrieval from random to real ;
VTimeLLM: Empower LLM to Grasp Video Moments ;
LITA: Language Instructed Temporal-Localization Assistant ;
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning ;
| Rishab, Joy, Mohit |
| Feb11 | Vision+Language Commonsense, World Knowledge, Reasoning; Project Brainstorming+Feedback |
(1) VisualCOMET: Reasoning about the Dynamic Context of a Still Image (ECCV 2020);
(2) MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (CVPR 2024);
(3) VLEP: What is More Likely to Happen Next? Video-and-Language Future Event Prediction (EMNLP 2020);
(4a) Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models (NeurIPS 2024); (4b) OpenAI: Thinking with images
Additional readings:
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge;
VCR: From Recognition to Cognition: Visual Commonsense Reasoning;
Visual Persuasion: Inferring Communicative Intents of Images;
Don't Just Listen, Use Your Imagination: Leveraging Visual Common Sense for Non-Visual Tasks;
Learning Common Sense Through Visual Abstraction;
Anticipating Visual Representations from Unlabeled Video;
Predicting Motivations of Actions by Leveraging Text;
"What Happens If..." Learning to Predict the Effect of Forces in Images;
Grounding Visual Explanations;
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference;
PIQA: Reasoning about Physical Commonsense in Natural Language;
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge;
Visual Abductive Reasoning;
Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning;
Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding;
EgoTaskQA: Understanding Human Tasks in Egocentric Videos;
CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination;
Ego4D: Around the World in 3,000 Hours of Egocentric Video;
Visual Programming: Compositional visual reasoning without training;
ViperGPT: Visual Inference via Python Execution for Reasoning;
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning;
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning;
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding;
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos;
Visual-RFT: Visual Reinforcement Fine-Tuning;
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models;
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning;
| Qi, Om, Mohit |
| Feb18 | Vision+Language Pretrained Models (V+L, V-->L, L-->V, Unified, Any-to-Any) |
(1) LXMERT: Learning Cross-Modality Encoder Representations from Transformers; / ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks;
(2) Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision; / VirTex: Learning Visual Representations from Textual Annotations;
(3) VL-T5: Unifying Vision-and-Language Tasks via Text Generation; / Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks;
(4) Any-to-Any Generation via Composable Diffusion; / CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation;
Additional readings:
Several other VLP models (e.g., UNITER, VL-BERT, VisualBERT, etc.) mentioned in this link;
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks;
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks;
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision;
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph;
VideoBERT: A Joint Model for Video and Language Representation Learning;
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training;
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation ;
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling;
MERLOT: Multimodal Neural Script Knowledge Models;
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation;
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling;
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework;
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision;
CoCa: Contrastive Captioners are Image-Text Foundation Models;
TVLT: Textless Vision-Language Transformer;
BEiT: BERT Pre-Training of Image Transformers;
Flamingo: a Visual Language Model for Few-Shot Learning;
PICa: An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA;
VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners;
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models;
Contrastive Learning of Medical Visual Representations from Paired Images and Text;
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer;
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks;
Idefics3-8B: Building and better understanding vision-language models;
PaliGemma 2: A Family of Versatile VLMs for Transfer ;
InternVL 2.5: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling ;
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding ;
Qwen2.5-VL: Enhanced Vision-Language Capabilities in the Qwen Series ;
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks ;
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation ;
LLaVA: Visual Instruction Tuning;
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation;
OneLLM: One Framework to Align All Modalities with Language;
Chameleon: Mixed-Modal Early-Fusion Foundation Models;
Qwen3-VL Technical Report;
BAGEL: The Open-Source Unified Multimodal Model;
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities;
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models;
| Amogh, Yidong, Mohit |
| Feb25 | Image/Story/Video Generation; 3D Image/Video & Game Generation + World Models | -- | TBD |
| Mar4 | DocAI, Efficient VL | -- | TBD |
| Mar11 | MIDTERM PROJECT PRESENTATIONS | -- | TBD |
| Mar18 | SPRING BREAK | -- | TBD |
| Mar25 | Language instructions for robotic navigation, articulation, manipulation, assembly, skill learning, etc. | -- | TBD |
| Apr1 | Embodied Pretraining Models & Vision-Language-Action (VLA) Benchmarks/Models | -- | TBD |
| Apr8 | Human-robot collaboration and dialogue: instruction generation, embodied Q&A, learning new subactions, etc. | -- | TBD |
| Apr15 | Grounded language learning/emergence via multi-agent dialogue-based and interactive games; Non-Verbal Human-Robot Interaction/Communication: Gestures, Turn-taking, Gaze | -- | TBD |
| Apr22 | Multimodal Biases and Shortcuts; How to Write and Review Research Papers | -- | TBD |
| Apr29 | FINAL PROJECT PRESENTATIONS | -- | TBD |
|