Schedule (tentative)
| Date | Topic | Readings | Discussion Leaders |
| Jan7 | Intro to the Course | -- | Mohit |
| Jan14 | Image-Text Alignment / Matching / Retrieval & Referring Expression Comprehension | (1) CLIP: Learning Transferable Visual Models From Natural Language Supervision (Radford et al., ICML 2021); (2) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (Li et al., ICML 2022); (3) MAttNet: Modular Attention Network for Referring Expression Comprehension (Yu et al., CVPR 2018); (4) Referring Transformer: A One-step Approach to Multi-task Visual Grounding (Li and Sigal, NeurIPS 2021); Additional readings: (a) Matching Words and Pictures (Barnard et al., JMLR 2003); (b) Deep Visual-Semantic Alignments for Generating Image Descriptions (Karpathy and Fei-Fei, CVPR 2015); (c) VSE++: Improving Visual-Semantic Embeddings with Hard Negatives (Faghri et al., BMVC 2018)
a>; (d) Stacked Cross Attention for Image-Text Matching (Lee et al., ECCV 2018); (e) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions (Yu et al., CVPR 2017)
a>; (f) What are you talking about? Text-to-Image Coreference (Kong et al., CVPR 2014); (g) Align before Fuse: Vision and Language Representation
Learning with Momentum Distillation; (h) Segment Anything; | Yidong, Atharv, Mohit |
| Jan21 | Referring Expression Generation & Image/Video Captioning |
(1) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions (Yu et al., CVPR 2017);
(2) Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning (Sharma et al., ACL 2018);
(3) Neural Baby Talk (Lu et al., CVPR 2018);
(4) SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning (Lin et al., CVPR 2022);
Additional readings:
Deep Visual-Semantic Alignments for Generating Image Descriptions;
Explain Images with Multimodal Recurrent Neural Networks;
From captions to visual concepts and back;
Learning a Recurrent Visual Representation for Image Caption Generation;
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention;
Image Captioning with Semantic Attention;
Self-critical Sequence Training for Image Captioning;
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering;
DenseCap: Fully Convolutional Localization Networks for Dense Captioning;
Sequence to Sequence -- Video to Text;
Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks;
Reinforced Video Captioning with Entailment Rewards;
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning;
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research;
ClipCap: CLIP Prefix for Image Captioning;
Fine-grained Image Captioning with CLIP Reward;
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning;
| Rohan, Amogh, Mohit |
| Jan 28 | Image/Video Question Answering & Dialogue |
(1) VQA v2: Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering (Goyal et al., CVPR 2017);
(2) Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering (Anderson et al., CVPR 2018);
(3) TVQA: Localized, Compositional Video Question Answering (Lei et al., EMNLP 2018);
(4) VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos (Wang et al., CVPR 2025) ;
Additional readings:
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering;
Visual Dialog (Das et al., CVPR 2017);
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding;
Bilinear Attention Networks;
Deep Modular Co-Attention Networks for Visual Question Answering;
MovieQA: Understanding Stories in Movies through Question-Answering;
A Joint Sequence Fusion Model for Video Question Answering and Retrieval;
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning;
Modality-Balanced Models for Visual Dialogue ;
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA;
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions;
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling;
Self-Chained Image-Language Model for Video Localization and Question Answering ;
A Simple LLM Framework for Long-Range Video Question-Answering ;
Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding ;
| Adam, Alex, Mohit |
| Feb4 | Query-based Video Moment Retrieval & Summary / Highlight / Saliency Prediction |
(1) TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval (Lei et al., ECCV 2020);
(2) UniVTG: Towards Unified Video-Language Temporal Grounding (Lin et al., ICCV 2023);
(3) QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries (Lei et al., NeurIPS 2021);
(4) UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection (Liu et al., CVPR 2022);
Additional readings:
Localizing Moments in Video with Natural Language;
TALL: Temporal Activity Localization via Language Query;
Temporal Localization of Moments in Video Collections with Natural Language;
Multi-task deep visual-semantic embedding for video thumbnail selection;
mTVR: Multilingual Moment Retrieval in Videos;
YouCook2-Retrieval: Towards Automatic Learning of Procedures from Web Instructional Videos;
Ranking Domain-specific Highlights by Analyzing Edited Videos;
TVSum: Summarizing Web Videos Using Titles;
Query-Focused Video Summarization: Dataset, Evaluation, and a Memory Network Based Approach;
CLIP-It! Language-Guided Video Summarization;
TubeDETR: Spatio-Temporal Video Grounding with Transformers;
Hierarchical Video-Moment Retrieval and Step-Captioning ;
VidChapters-7M: Video Chapters at Scale ;
TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency ;
Query-Dependent Video Representation for Moment Retrieval and Highlight Detection
Timechat: A time-sensitive multimodal large language model for long video understanding ;
Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection ;
TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection ;
Momentdiff: Generative video moment retrieval from random to real ;
VTimeLLM: Empower LLM to Grasp Video Moments ;
LITA: Language Instructed Temporal-Localization Assistant ;
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning ;
| Rishabhdev, Joy, Mohit |
| Feb11 | Vision+Language Commonsense, World Knowledge, Reasoning; Project Brainstorming+Feedback |
(1) VisualCOMET: Reasoning about the Dynamic Context of a Still Image (ECCV 2020);
(2) MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (CVPR 2024);
(3) VLEP: What is More Likely to Happen Next? Video-and-Language Future Event Prediction (EMNLP 2020);
(4a) Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models (NeurIPS 2024); (4b) OpenAI: Thinking with images
Additional readings:
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge;
VCR: From Recognition to Cognition: Visual Commonsense Reasoning;
Visual Persuasion: Inferring Communicative Intents of Images;
Don't Just Listen, Use Your Imagination: Leveraging Visual Common Sense for Non-Visual Tasks;
Learning Common Sense Through Visual Abstraction;
Anticipating Visual Representations from Unlabeled Video;
Predicting Motivations of Actions by Leveraging Text;
"What Happens If..." Learning to Predict the Effect of Forces in Images;
Grounding Visual Explanations;
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference;
PIQA: Reasoning about Physical Commonsense in Natural Language;
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge;
Visual Abductive Reasoning;
Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning;
Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding;
EgoTaskQA: Understanding Human Tasks in Egocentric Videos;
CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination;
Ego4D: Around the World in 3,000 Hours of Egocentric Video;
Visual Programming: Compositional visual reasoning without training;
ViperGPT: Visual Inference via Python Execution for Reasoning;
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning;
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning;
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding;
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos;
Visual-RFT: Visual Reinforcement Fine-Tuning;
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models;
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning;
| Qi, Om, Mohit |
| Feb18 | Vision+Language Pretrained Models (V+L, V-->L, L-->V, Unified, Any-to-Any, DocAI, Efficient VL) |
(1) LXMERT: Learning Cross-Modality Encoder Representations from Transformers; / ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks;
(2) Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision; / VirTex: Learning Visual Representations from Textual Annotations;
(3) VL-T5: Unifying Vision-and-Language Tasks via Text Generation; / Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks;
(4) Any-to-Any Generation via Composable Diffusion; / CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation;
Additional readings:
Several other VLP models (e.g., UNITER, VL-BERT, VisualBERT, etc.) mentioned in this link;
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks;
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks;
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision;
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph;
VideoBERT: A Joint Model for Video and Language Representation Learning;
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training;
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation ;
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling;
MERLOT: Multimodal Neural Script Knowledge Models;
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation;
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling;
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework;
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision;
CoCa: Contrastive Captioners are Image-Text Foundation Models;
TVLT: Textless Vision-Language Transformer;
BEiT: BERT Pre-Training of Image Transformers;
Flamingo: a Visual Language Model for Few-Shot Learning;
PICa: An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA;
VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners;
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models;
Contrastive Learning of Medical Visual Representations from Paired Images and Text;
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer;
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks;
Idefics3-8B: Building and better understanding vision-language models;
PaliGemma 2: A Family of Versatile VLMs for Transfer ;
InternVL 2.5: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling ;
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding ;
Qwen2.5-VL: Enhanced Vision-Language Capabilities in the Qwen Series ;
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks ;
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation ;
LLaVA: Visual Instruction Tuning;
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation;
OneLLM: One Framework to Align All Modalities with Language;
Chameleon: Mixed-Modal Early-Fusion Foundation Models;
Qwen3-VL Technical Report;
BAGEL: The Open-Source Unified Multimodal Model;
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities;
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models;
LayoutLM: Pre-training of Text and Layout for Document Image Understanding;
DocVQA: A Dataset for VQA on Document Images;
Multimodal Few-Shot Learning with Frozen Language Models;
OCR-free Document Understanding Transformer;
VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks;
LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning;
Unifying Vision, Text, and Layout for Universal Document Processing;
Document Understanding Dataset and Evaluation (DUDE);
DocLLM: A layout-aware generative language model for multimodal document understanding;
ColPali: Efficient Document Retrieval with Vision Language Models;
MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations;
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding;
| Amogh, Yidong, Mohit |
| Feb25 | Image/Story/Video Generation; 3D Image/Video & Game Generation + World Models |
(1) DiT: Scalable Diffusion Models with Transformers;
(2) Video Diffusion Models;
(3) DreamFusion: Text-to-3D using 2D Diffusion;
(4) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning;
(5) Wan: Open and Advanced Large-Scale Video Generative Models;
(6) Genie3: Experimenting with infinite, interactive worlds;
Additional readings:
Stable Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models;
DALL-E: Zero-Shot Text-to-Image Generation;
Generative Adversarial Text to Image Synthesis;
StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks;
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks;
Video Generation From Text;
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers;
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion;
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation;
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors;
Muse: Text-To-Image Generation via Masked Generative Transformers;
CogView: Mastering Text-to-Image Generation via Transformers;
DALLE2: Hierarchical Text-Conditional Image Generation with CLIP Latents;
Parti: Scaling Autoregressive Models for Content-Rich Text-to-Image Generation;
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion;
Imagen Video: High Definition Video Generation with Diffusion Models;
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers;
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding;
StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation;
Make-A-Video: Text-to-Video Generation without Text-Video Data;
ControlNet: Adding Conditional Control to Text-to-Image Diffusion Models;
Scaling up GANs for Text-to-Image Synthesis;
VPGen: Visual Programming for Text-to-Image Generation and Evaluation;
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models;
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning;
Unbounded: A Generative Infinite Game of Character Life Simulation;
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models;
GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment;
T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation;
Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation;
TIFA: Text-to-Image Faithfulness Evaluation with Question Answering;
GenAI Arena: An Open Evaluation Platform for Generative Models;
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models;
Cosmos: World Foundation Model Platform for Physical AI;
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction;
Zero-1-to-3: Zero-shot One Image to 3D Object;
VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation;
Metamorph: Multimodal Understanding and Generation via Instruction Tuning;
Diffusion Models Are Real-Time Game Engines;
Navigation World Models;
RAE: Diffusion Transformers with Representation Autoencoders;
Back to Basics: Let Denoising Generative Models Denoise;
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion;
| Joykirat, Rishabhdev, Mohit |
| Mar4 | Project Discussion/Feedback | -- | Mohit |
| Mar11 | MIDTERM PROJECT PRESENTATIONS | -- | TBD |
| Mar18 | SPRING BREAK | -- | TBD |
| Mar25 | Language instructions for robotic navigation, articulation, manipulation, assembly, skill learning, etc. | -- | TBD |
| Apr1 | Embodied Pretraining Models & Vision-Language-Action (VLA) Benchmarks/Models | -- | TBD |
| Apr8 | Human-robot collaboration and dialogue: instruction generation, embodied Q&A, learning new subactions, etc. | -- | TBD |
| Apr15 | Grounded language learning/emergence via multi-agent dialogue-based and interactive games; Non-Verbal Human-Robot Interaction/Communication: Gestures, Turn-taking, Gaze | -- | TBD |
| Apr22 | Multimodal Biases and Shortcuts; How to Write and Review Research Papers | -- | TBD |
| Apr29 | FINAL PROJECT PRESENTATIONS | -- | TBD |
|