Schedule (tentative)
Date | Topic | Readings | Discussion Leaders |
Jan 11 | Intro to the Course | -- | Mohit |
Jan18 | Image-Text Alignment / Matching / Retrieval & Referring Expression Comprehension | (1) CLIP: Learning Transferable Visual Models From Natural Language Supervision (Radford et al., ICML 2021); (2) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (Li et al., ICML 2022); (3) MAttNet: Modular Attention Network for Referring Expression Comprehension (Yu et al., CVPR 2018); (4) Referring Transformer: A One-step Approach to Multi-task Visual Grounding (Li and Sigal, NeurIPS 2021); Additional readings: (a) Matching Words and Pictures (Barnard et al., JMLR 2003); (b) Deep Visual-Semantic Alignments for Generating Image Descriptions (Karpathy and Fei-Fei, CVPR 2015); (c) VSE++: Improving Visual-Semantic Embeddings with Hard Negatives (Faghri et al., BMVC 2018)
a>; (d) Stacked Cross Attention for Image-Text Matching (Lee et al., ECCV 2018); (e) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions (Yu et al., CVPR 2017)
a>; (f) What are you talking about? Text-to-Image Coreference (Kong et al., CVPR 2014); (g) Align before Fuse: Vision and Language Representation
Learning with Momentum Distillation; | Shoubin, Ziyang, Mo
hit |
Jan 25 | Referring Expression Generation & Image/Video Captioning |
(1) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions (Yu et al., CVPR 2017);
(2) Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning (Sharma et al., ACL 2018);
(3) Neural Baby Talk (Lu et al., CVPR 2018);
(4) SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning (Lin et al., CVPR 2022);
Additional readings:
Deep Visual-Semantic Alignments for Generating Image Descriptions;
Explain Images with Multimodal Recurrent Neural Networks;
From captions to visual concepts and back;
Learning a Recurrent Visual Representation for Image Caption Generation;
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention;
Image Captioning with Semantic Attention;
Self-critical Sequence Training for Image Captioning;
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering;
DenseCap: Fully Convolutional Localization Networks for Dense Captioning;
Sequence to Sequence -- Video to Text;
Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks;
Reinforced Video Captioning with Entailment Rewards;
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning;
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research;
ClipCap: CLIP Prefix for Image Captioning;
Fine-grained Image Captioning with CLIP Reward;
| Abhay, Archiki, Mohit |
Feb 01 | Image/Video Question Answering & Dialogue |
(1a) VQA v2: Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering (Goyal et al., CVPR 2017);
(1b) GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering (Hudson and Manning, CVPR 2019);
(2) Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering (Anderson et al., CVPR 2018);
(3) TVQA: Localized, Compositional Video Question Answering (Lei et al., EMNLP 2018);
(4) Visual Dialog (Das et al., CVPR 2017);
Additional readings:
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding;
Bilinear Attention Networks;
Deep Modular Co-Attention Networks for Visual Question Answering;
MovieQA: Understanding Stories in Movies through Question-Answering;
A Joint Sequence Fusion Model for Video Question Answering and Retrieval;
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning;
Modality-Balanced Models for Visual Dialogue ;
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA;
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions;
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling;
| Vaidehi, Qin, Mohit |
Feb 08 | Query-based Video Moment Retrieval & Summary / Highlight / Saliency Prediction |
(1) Localizing Moments in Video with Natural Language (Hendricks et al., ICCV 2017);
(2) TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval (Lei et al., ECCV 2020);
(3) QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries (Lei et al., NeurIPS 2021);
(4) TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency (Narasimhan et al., ECCV 2022);
Additional readings:
(a) TALL: Temporal Activity Localization via Language Query;
(b) Temporal Localization of Moments in Video Collections with Natural Language;
(c) Multi-task deep visual-semantic embedding for video thumbnail selection;
(d) mTVR: Multilingual Moment Retrieval in Videos;
(e) YouCook2-Retrieval: Towards Automatic Learning of Procedures from Web Instructional Videos;
(f) Ranking Domain-specific Highlights by Analyzing Edited Videos;
(g) TVSum: Summarizing Web Videos Using Titles;
(h) Query-Focused Video Summarization: Dataset, Evaluation, and a Memory Network Based Approach;
(i) CLIP-It! Language-Guided Video Summarization;
(j) TubeDETR: Spatio-Temporal Video Grounding with Transformers;
(k) UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection;
| Mason, Yiyuan, Mohit |
Feb 15 | Vision+Language Commonsense, World Knowledge, Reasoning; Project Brainstorming+Feedback |
(1) OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge (CVPR 2019);
(2) VCR: From Recognition to Cognition: Visual Commonsense Reasoning (CVPR 2019);
(3) VisualCOMET: Reasoning about the Dynamic Context of a Still Image (ECCV 2020);
(4) VLEP: What is More Likely to Happen Next? Video-and-Language Future Event Prediction (EMNLP 2020);
Additional readings:
(a) Visual Persuasion: Inferring Communicative Intents of Images;
(b) Don't Just Listen, Use Your Imagination: Leveraging Visual Common Sense for Non-Visual Tasks;
(c) Learning Common Sense Through Visual Abstraction;
(d) Anticipating Visual Representations from Unlabeled Video;
(e) Predicting Motivations of Actions by Leveraging Text;
(f) "What Happens If..." Learning to Predict the Effect of Forces in Images;
(g) Grounding Visual Explanations;
(h) VIOLIN: A Large-Scale Dataset for Video-and-Language Inference;
(i) PIQA: Reasoning about Physical Commonsense in Natural Language;
(j) A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge;
(k) Visual Abductive Reasoning;
(l) Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning;
(m) Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding;
(n) EgoTaskQA: Understanding Human Tasks in Egocentric Videos;
(o) CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination;
| Brandon, Maggie, Mohit |
Feb 22 | Text to Image & Image Sequence/Story/Video Generation |
(1) DALL-E: Zero-Shot Text-to-Image Generation;
(2) DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Models;
(3) Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding;
(4) Stable Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models;
(5) StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation;
(6) Make-A-Video: Text-to-Video Generation without Text-Video Data;
Additional readings:
(a) Generative Adversarial Text to Image Synthesis;
(b) StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks;
(c) AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks;
(d) Video Generation From Text;
(e) X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers;
(f) An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion;
(g) DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation;
(h) Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors;
(i) Muse: Text-To-Image Generation via Masked Generative Transformers;
(j) CogView: Mastering Text-to-Image Generation via Transformers;
(k) DALLE2: Hierarchical Text-Conditional Image Generation with CLIP Latents;
(l) Parti: Scaling Autoregressive Models for Content-Rich Text-to-Image Generation;
(m) NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion;
(n) Imagen Video: High Definition Video Generation with Diffusion Models;
(o) CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers;
| Pierre, Long, Mohit |
Mar 01 | Vision+Language Pretraining Models (V+L, Unified V+L, V-->L, L-->V, Efficient VL) |
(1) LXMERT: Learning Cross-Modality Encoder Representations from Transformers; / ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks;
(2) VL-T5: Unifying Vision-and-Language Tasks via Text Generation; / Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks;
(3) Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision; / VirTex: Learning Visual Representations from Textual Annotations;
(4) Multimodal Few-Shot Learning with Frozen Language Models; / VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks;
Additional readings:
(a) Several other VLP models (e.g., UNITER, VL-BERT, VisualBERT, etc.) mentioned in this link;
(b) Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks;
(c) ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision;
(d) ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph;
(e) VideoBERT: A Joint Model for Video and Language Representation Learning;
(f) HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training;
(g) UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation ;
(h) Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling;
(i) MERLOT: Multimodal Neural Script Knowledge Models;
(j) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation;
(k) UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling;
(l) OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework;
(m) SimVLM: Simple Visual Language Model Pretraining with Weak Supervision;
(n) CoCa: Contrastive Captioners are Image-Text Foundation Models;
(o) TVLT: Textless Vision-Language Transformer;
(p) BEiT: BERT Pre-Training of Image Transformers;
(q) Flamingo: a Visual Language Model for Few-Shot Learning;
(r) LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning;
(s) PICa: An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA;
(t) VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners;
(u) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models;
(v) Contrastive Learning of Medical Visual Representations from Paired Images and Text;
(w)VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer;
| Shoubin, Qin, Mohit |
Mar 08 | Midterm Project Presentations | | |
Mar 15 | Spring break holiday | | |
Mar 22 | Language instructions for robotic navigation, articulation, manipulation, assembly, skill learning, etc. |
(1a) DUET: Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation; /
(1b) EnvEdit: Environment Editing for Vision-and-Language Navigation;
(2) PaLM-E: An Embodied Multimodal Language Model ;
(3) Room2Room: Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments ;
(4) ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks ;
Additional readings:
(a) Learning to Interpret Natural Language Navigation Instructions from Observations (AAAI, 2011) ;
(b) Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences (AAAI, 2016) ;
(c) Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions ;
(d) Learning to Follow Navigational Directions ; (1) Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation ;
(e) A Natural Language Planner Interface for Mobile Manipulators ;
(f) Natural Language Communication with Robots ;
(g) Interpreting and Executing Recipes with a Cooking Robot ;
(h) Tell Me Dave: Context-Sensitive Grounding of Natural Language to Mobile Manipulation Instructions ;
(i) Robobarista: Object Part based Transfer of Manipulation Trajectories from Crowd-sourcing in 3D Pointclouds ;
(j) Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments ;
(k) ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments ;
(l) Speaker-Follower Models for Vision-and-Language Navigation ;
(m) Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation ;
(n) Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout ;
(o) Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding ;
(p) A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning ;
(q) HAMT: History Aware Multimodal Transformer for Vision-and-Language Navigation ;
(r) HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation ;
(s) MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge ;
(t) Simple but Effective: CLIP Embeddings for Embodied AI ;
(u) CLIPort: What and Where Pathways for Robotic Manipulation ;
| Ziyang, Mason, Mohit |
Mar 29 | Human-robot collaboration and dialogue: instruction generation, embodied Q&A, learning new subactions, etc. |
(1) Back to the Blocks World: Learning New Actions through Situated Human-Robot Dialogue ;
(2a) Embodied Question Answering; /
(2b) IQA: Visual Question Answering in Interactive Environments ;
(3a) CVDN: Vision-and-Dialog Navigation; /
(3b) TEACh: Task-driven Embodied Agents that Chat;
(4) CerealBar: Executing Instructions in Situated Collaborative Interactions;
Additional readings:
(a) Learning to Interpret Natural Language Commands through Human-Robot Dialog ;
(b) Clarifying Commands with Information-Theoretic Human-Robot Dialog ;
(c) Asking for Help Using Inverse Semantics ;
(d) Navigational Instruction Generation as Inverse Reinforcement Learning with Neural Machine Translation ;
(e) PLOW: A Collaborative Task Learning Agent ;
(f) REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments ;
(g) Multi-Target Embodied Question Answering ;
(h) AutoVLN: Learning from Unlabeled 3D Environments for Vision-and-Language Navigation ;
| Abhay, Vaidehi, Mohit |
Apr 05 | Grounded language learning/emergence via multi-agent dialogue-based and interactive games |
(1) Grounded Language Learning in a Simulated 3D World;
(2) Emergence of Linguistic Communication from Referential Games with Symbolic and Pixel Input;
(3) Iconary: A Pictionary-Based Game for Testing Multimodal Communication with Drawings and Text;
(4) Do As I Can, Not As I Say: Grounding Language in Robotic Affordances;
Additional readings:
(a) Learning Language Games through Interaction;
(b) Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols ;
(c) Emergence of Grounded Compositional Language in Multi-Agent Populations;
(d) Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog ;
(e) Collaborative Models for Referring Expression Generation in Situated Dialogue ;
(f) Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning ;
(g) Multi-Agent Cooperation and the Emergence of (Natural) Language ;
(h) Learning to Play Guess Who? and Inventing a Grounded Language as a Consequence ;
(i) Learning to Speak and Act in a Fantasy Text Adventure Game ;
(j) Emergent Multi-Agent Communication in the Deep Learning Era ;
(k) Emergence of Linguistic Communication from Referential Games with Symbolic and Pixel Input;
(l) Emergent Communication through Negotiation;
(m) BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning;
(n) Learning when to Communicate at Scale in Multiagent Cooperative and Competitive Tasks;
(o) On the Pitfalls of Measuring Emergent Communication;
(p) Two Body Problem: Collaborative Visual Task Completion;
(q) Emergent Communication at Scale;
| Long, Pierre, Mohit |
Apr 12 | Non-Verbal Human-Robot Interaction/Communication: Gestures, Turn-taking, Gaze |
(1) "Generation of Nodding, Head Tilting and Eye Gazing for Human-Robot Dialogue Interaction";
(2) " Conversational Gaze Aversion for Humanlike Robots (HRI, 2014)";
(3) Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning;
(4) " Simon plays Simon says: The timing of turn-taking in an imitation game";
Additional readings:
(a) " Effects of Responding to, Initiating and Ensuring Joint Attention in Human-Robot Interaction";
(b) See You See Me: The Role of Eye Contact in Multimodal Human-Robot Interaction ;
(c) Social eye gaze in human-robot interaction: a review ;
(d) Embodiment in Socially Interactive Robots ;
(e) Understanding Teacher Gaze Patterns for Robot Learning ;
(f) Human Gaze Following for Human-Robot Interaction ;
(g) A review of eye gaze in virtual agents, social robotics and hci: Behaviour generation, user interaction and perception ;
(h) Gaze and Attention Management for Embodied Conversational Agents ;
(i) Learning About Objects by Learning to Interact with Them;
(j) Pushing it out of the Way: Interactive Visual Navigation;
| Maggie, Brandon, Mohit |
Apr 19 | Multimodal Biases and Shortcuts; How to Write and Review Research Papers |
(1a) RUBi: Reducing Unimodal Biases in Visual Question Answering; /
(1b) Debiased Visual Question Answering from Feature and Sample Perspectives;
(2) Revealing Single Frame Bias for Video-and-Language Learning;
(3) Multimodal datasets: misogyny, pornography, and malignant stereotypes;
(4a) Measuring Social Biases in Grounded Vision and Language Embeddings; /
(4b) Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations;
Additional readings:
(a) Shifting the Baseline: Single Modality Performance on Visual Navigation & QA ;
(b) Modality-Balanced Models for Visual Dialogue ;
(c) Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions ";
(d) MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering ;
(e) REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets ;
(f) VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives ;
(g) Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models ;
(h) Diagnosing the Environment Bias in Vision-and-Language Navigation;
(i) Deconfounded Visual Grounding ;
(j) Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models ;
(k) Are Gender-Neutral Queries Really Gender-Neutral? Mitigating Gender Bias in Image Search ;
(l) Language (Technology) is Power: A Critical Survey of "Bias" in NLP ;
(m) Advances, challenges and opportunities in creating data for trustworthy AI ;
(n) Data and its (dis)contents: A survey of dataset development and use in machine learning research ;
(o) ACL Wiki: Ethics in NLP Resources/Courses;
| Archiki, Yiyuan, Mohit |
Apr 26 | Last Class: Final Project Presentations (tentative) | | |
|