Date | Topic | Readings | Discussion Leaders | Todo's |
Aug 18 | Intro to the Course | -- | Mohit | -- |
Aug 25 | Image-Text Alignment / Matching / Retrieval & Referring Expression Comprehension | (1) Deep Visual-Semantic Alignments for Generating Image Descriptions (Karpathy and Fei-Fei, CVPR 2015); (2) CLIP: Learning Transferable Visual Models From Natural Language Supervision (Radford et al., ICML 2021); (3) MAttNet: Modular Attention Network for Referring Expression Comprehension (Yu et al., CVPR 2018); Additional readings: (a) Matching Words and Pictures (Barnard et al., JMLR 2003); (b) VSE++: Improving Visual-Semantic Embeddings with Hard Negatives (Faghri et al., BMVC 2018); (c) Stacked Cross Attention for Image-Text Matching (Lee et al., ECCV 2018); (d) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions; (e) What are you talking about? Text-to-Image Coreference (Kong et al., CVPR 2014); | Jaemin, Mohit | Instructions emailed for presenter (and everyone else as a reviewer) |
Sep 01 | Referring Expression Generation & Image/Video Captioning | (1) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (Xu et al., ICML 2015); (2) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions (Yu et al., CVPR 2017); (3) Sequence to Sequence -- Video to Text (Venugopalan et al., ICCV 2015); (4) Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks (Yu et al., CVPR 2016); Additional readings: (a) Deep Visual-Semantic Alignments for Generating Image Descriptions; (b) Explain Images with Multimodal Recurrent Neural Networks; (c) From captions to visual concepts and back; (d) Learning a Recurrent Visual Representation for Image Caption Generation; (e) Image Captioning with Semantic Attention; (f) Self-critical Sequence Training for Image Captioning; (g) Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering; (h) DenseCap: Fully Convolutional Localization Networks for Dense Captioning; (i) Reinforced Video Captioning with Entailment Rewards; (j) MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning; (k) VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research; | Rakesh, Muqeeth, Mohit | Instructions emailed for presenter (and everyone else as a reviewer) |
Sep 08 | Image/Video Question Answering & Dialogue | (1) VQA v2: Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering (Goyal et al., CVPR 2017); (2) Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering (Anderson et al., CVPR 2018); (3) TVQA: Localized, Compositional Video Question Answering (Lei et al., EMNLP 2018); (4) Visual Dialog (Das et al., CVPR 2017); Additional readings: (a) Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding; (b) Bilinear Attention Networks; (c) Deep Modular Co-Attention Networks for Visual Question Answering; (d) MovieQA: Understanding Stories in Movies through Question-Answering; (e) A Joint Sequence Fusion Model for Video Question Answering and Retrieval; (f) Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning; | Mohaiminul, Zhuofan, Mohit | Instructions emailed for presenter (and everyone else as a reviewer) |
Sep 15 | Query-based Video Moment Retrieval & Summary / Highlight / Saliency Prediction | (1) Localizing Moments in Video with Natural Language (Hendricks et al., ICCV 2017); (2) TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval (Lei et al., ECCV 2020); (3) Query-Focused Video Summarization: Dataset, Evaluation, and a Memory Network Based Approach (Sharghi et al., CVPR 2017); (4) QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries (Lei et al., ArXiv 2021); Additional readings: (a) TALL: Temporal Activity Localization via Language Query; (b) Temporal Localization of Moments in Video Collections with Natural Language; (c) Multi-task deep visual-semantic embedding for video thumbnail selection; (d) mTVR: Multilingual Moment Retrieval in Videos; (e) YouCook2-Retrieval: Towards Automatic Learning of Procedures from Web Instructional Videos; | Haikang, David, Mohit | Instructions emailed for presenter (and everyone else as a reviewer) |
Sep 22 | Vision+Language Commonsense, Reasoning; Project Brainstorming+Feedback | (1) Predicting Motivations of Actions by Leveraging Text (CVPR 2016); (2) VCR: From Recognition to Cognition: Visual Commonsense Reasoning (CVPR 2019); (3) VisualCOMET: Reasoning about the Dynamic Context of a Still Image (ECCV 2020); (4) VLEP: What is More Likely to Happen Next? Video-and-Language Future Event Prediction (EMNLP 2020); Additional readings: (a) Visual Persuasion: Inferring Communicative Intents of Images; (b) Don’t Just Listen, Use Your Imagination: Leveraging Visual Common Sense for Non-Visual Tasks; (c) Learning Common Sense Through Visual Abstraction; (d) Anticipating Visual Representations from Unlabeled Video; (e) "What Happens If..." Learning to Predict the Effect of Forces in Images; (f) Grounding Visual Explanations; (g) VIOLIN: A Large-Scale Dataset for Video-and-Language Inference; | Eva, Blaine, Mohit | Instructions emailed for presenter (and everyone else as a reviewer) |
Sep 29 | Text to Image & Image Sequence/Story/Video Generation | (1) Generative Adversarial Text to Image Synthesis (ICML 2016); (2) DALL-E: Zero-Shot Text-to-Image Generation (ICML 2021); (3) Video Generation From Text (AAAI 2018); (4) Duco-StoryGAN: Improving Generation and Evaluation of Visual Stories via Semantic Consistency (NAACL 2021); Additional readings: (a) StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks; (b) AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks; (c) X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers; (d) XMC-GAN: Cross-Modal Contrastive Learning for Text-to-Image Generation; (e) TGAN-C: To Create What You Tell: Generating Videos from Captions; (f) TFGAN: Conditional GAN with Discriminative Filter Generation for Text-to-Video Synthesis; (g) Imagine This! Scripts to Compositions to Videos; | Derek, Shiv, Mohit | Instructions emailed for presenter (and everyone else as a reviewer) |
Oct 06 | Vision+Language Pretraining Models (V+L, V+L as Text, V-->L, L-->V) |
(1) LXMERT: Learning Cross-Modality Encoder Representations from Transformers (EMNLP 2019);
/ ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks (NeurIPS 2019);
(2) Unifying Vision-and-Language Tasks via Text Generation (ICML 2021);
(3) Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision (EMNLP 2020);
(4) VirTex: Learning Visual Representations from Textual Annotations (CVPR 2021);
Additional readings:
(a) Several other VLP models (e.g., UNITER, VL-BERT, VisualBERT, etc.) mentioned in this link;
(b) Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks;
(c) ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision;
(d) ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph;
(e) VideoBERT: A Joint Model for Video and Language Representation Learning;
(f) HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training;
(g) UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation ;
| Yi-Lin, Zineng, Mohit | Instructions emailed for presenter (and everyone else as a reviewer) |
Oct 13 | Midterm Project Presentations | | | Midterm reports due Oct20 |
Oct 20 | Executing NL instructions for navigation, articulation, manipulation, assembly, skill learning, etc. |
(1) Learning to Interpret Natural Language Navigation Instructions from Observations (AAAI, 2011) ;
(2) Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences (AAAI, 2016) ;
(3) Room2Room: Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments (CVPR, 2018) ;
(4) ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks (CVPR, 2020) ;
Additional readings:
(a) Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions ;
(b) Learning to Follow Navigational Directions ; (1) Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation ;
(c) A Natural Language Planner Interface for Mobile Manipulators ;
(d) Natural Language Communication with Robots ;
(e) Interpreting and Executing Recipes with a Cooking Robot ;
(f) Tell Me Dave: Context-Sensitive Grounding of Natural Language to Mobile Manipulation Instructions ;
(g) Robobarista: Object Part based Transfer of Manipulation Trajectories from Crowd-sourcing in 3D Pointclouds ;
(h) Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments ;
(i) ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments ;
(j) Speaker-Follower Models for Vision-and-Language Navigation ;
(k) Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation ;
(l) Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout ;
| Zheng, Jialu, Mohit | |
Oct 27 | Human-robot collaboration and dialogue: instruction generation, emdodied Q&A, learning new subactions, etc. |
(1) Back to the Blocks World: Learning New Actions through Situated Human-Robot Dialogue (SigDIAL, 2014) ;
(2) Learning to Interpret Natural Language Commands through Human-Robot Dialog (IJCAI, 2015) ;
(3) Embodied Question Answering (CVPR, 2018) ;
(4) CVDN: Vision-and-Dialog Navigation (CoRL, 2019) ;
Additional readings:
(a) Clarifying Commands with Information-Theoretic Human-Robot Dialog ;
(b) Asking for Help Using Inverse Semantics ;
(c) Navigational Instruction Generation as Inverse Reinforcement Learning with Neural Machine Translation ;
(d) PLOW: A Collaborative Task Learning Agent ;
(e) REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments ;
(f) Multi-Target Embodied Question Answering ;
| Yumeng, Shengze, Mohit | |
Nov 03 | Grounding and language learning/emergence via multi-agent dialogue-based and interactive games |
(1) Learning Language Games through Interaction (ACL, 2016) ;
(2) Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols (NeurIPS, 2017) ;
(3) Emergence of Grounded Compositional Language in Multi-Agent Populations (AAAI, 2018) ;
(4) Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog (EMNLP, 2017) ;
Additional readings:
(a) Collaborative Models for Referring Expression Generation in Situated Dialogue ;
(b) Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning ;
(c) Multi-Agent Cooperation and the Emergence of (Natural) Language ;
(d) Learning to Play Guess Who? and Inventing a Grounded Language as a Consequence ;
(e) Learning to Speak and Act in a Fantasy Text Adventure Game ;
(f) Emergent Multi-Agent Communication in the Deep Learning Era ;
| David, Jaemin, Mohit | |
Nov 10 | Non-Verbal Human-Robot Interaction/Communication: Gestures, Turn-taking, Gaze |
(1) "Generation of Nodding, Head Tilting and Eye Gazing for Human-Robot Dialogue Interaction (HRI, 2012)";
(2) " Conversational Gaze Aversion for Humanlike Robots (HRI, 2014)";
(3) Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning (ICCV, 2019) ;
(4) " Simon plays Simon says: The timing of turn-taking in an imitation game (Ro-Man, 2011)";
Additional readings:
(a) " Effects of Responding to, Initiating and Ensuring Joint Attention in Human-Robot Interaction”;
(b) See You See Me: The Role of Eye Contact in Multimodal Human-Robot Interaction ;
(c) Social eye gaze in human-robot interaction: a review ;
(d) Embodiment in Socially Interactive Robots ;
(e) Understanding Teacher Gaze Patterns for Robot Learning ;
(f) Human Gaze Following for Human-Robot Interaction ;
(g) A review of eye gaze in virtual agents, social robotics and hci: Behaviour generation, user interaction and perception ;
(h) Gaze and Attention Management for EmbodiedConversational Agents ;
| Shiv, Shengze, Mohaiminul, Muqeeth, Mohit | |
Nov 17 | Multimodal Biases and Shortcuts; How to Write and Review Research Papers |
(1) Multimodal datasets: misogyny, pornography, and malignant stereotypes;
(2) RUBi: Reducing Unimodal Biases in Visual Question Answering (NeurIPS, 2019);
(3) Measuring Social Biases in Grounded Vision and Language Embeddings (NAACL, 2021) ;
(4) Diagnosing the Environment Bias in Vision-and-Language Navigation (IJCAI, 2020);
Additional readings:
(a) Shifting the Baseline: Single Modality Performance on Visual Navigation & QA ;
(b) Modality-Balanced Models for Visual Dialogue ;
(c) Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions ”;
(d) MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering ;
(e) Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models ;
(f) Are Gender-Neutral Queries Really Gender-Neutral? Mitigating Gender Bias in Image Search ;
(g) Language (Technology) is Power: A Critical Survey of "Bias" in NLP ;
(h) ACL Wiki: Ethics in NLP Resources/Courses;
| Rakesh, Derek, Yi-Lin, Jialu, Mohit | |
Nov 24 | Thanksgiving Holiday | | | |
Dec 01 | Last Class: Final Project Presentations | | | Final Project Write-ups Due Dec 08 |