COMP 790/590: Multimodal AI: Connecting Language to Vision and Robotics (Spring 2025)

Instructor: Mohit Bansal
Units: 3
Lectures: Mondays 11.15am-1.45pm ET, Room SN-115
Office Hours: Mondays By appointment
Course Webpage: https://www.cs.unc.edu/~mbansal/teaching/nlp-comp790-590-spring25.html
Course Email: nlpcomp790unc -at- gmail.com




Introduction

This course will be based on the connections between the fields of natural language processing (NLP) and its important multimodal connections to computer vision and robotics; it will cover a wide variety of topics in the area of multimodal-NLP such as image/video-based captioning, retrieval, QA, and dialogue; vision+language commonsense; query-based video summarization; text-to-image/video generation; any to any/3D generation, robotic navigation + manipulation instruction execution and generation; unified multimodal and embodied pretraining models (as well as ethics/bias/societal applications + issues).


Topics



Prerequisites

Students not meeting these requirements must receive the explicit permission of the instructor to remain in this course.


Grading

Grading will (tentatively) consist of:

All submissions should be emailed to: nlpcomp790unc@gmail.com.


Reference Books

For NLP concepts refresher, see:

For multimodal NLP concepts, see corresponding lectures of my previous classes and proceedings+talks of Multimodal Workshops at NLP, Vision, and Robotics conferences (e.g., RoboNLP, VQA Workshop, ALVR, LANTERN, CLVL, etc.)


Schedule (tentative)

DateTopic Readings Discussion Leaders
Jan 13Intro to the Course -- Mohit
Jan 20 MLK Day (no class)
Jan27 Image-Text Alignment / Matching / Retrieval & Referring Expression Comprehension (1) CLIP: Learning Transferable Visual Models From Natural Language Supervision (Radford et al., ICML 2021);
(2) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (Li et al., ICML 2022);
(3) MAttNet: Modular Attention Network for Referring Expression Comprehension (Yu et al., CVPR 2018);
(4) Referring Transformer: A One-step Approach to Multi-task Visual Grounding (Li and Sigal, NeurIPS 2021);
Additional readings:
(a) Matching Words and Pictures (Barnard et al., JMLR 2003);
(b) Deep Visual-Semantic Alignments for Generating Image Descriptions (Karpathy and Fei-Fei, CVPR 2015);
(c) VSE++: Improving Visual-Semantic Embeddings with Hard Negatives (Faghri et al., BMVC 2018);
(d)
Stacked Cross Attention for Image-Text Matching (Lee et al., ECCV 2018);
(e) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions (Yu et al., CVPR 2017);
(f)
What are you talking about? Text-to-Image Coreference (Kong et al., CVPR 2014);
(g) Align before Fuse: Vision and Language Representation Learning with Momentum Distillation;
(h) Segment Anything;
Zun, Duy, Mohit
Feb 03 Referring Expression Generation & Image/Video Captioning (1) A Joint Speaker-Listener-Reinforcer Model for Referring Expressions (Yu et al., CVPR 2017);
(2) Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning (Sharma et al., ACL 2018);
(3) Neural Baby Talk (Lu et al., CVPR 2018);
(4) SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning (Lin et al., CVPR 2022);
Additional readings:
Deep Visual-Semantic Alignments for Generating Image Descriptions;
Explain Images with Multimodal Recurrent Neural Networks;
From captions to visual concepts and back;
Learning a Recurrent Visual Representation for Image Caption Generation;
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention;
Image Captioning with Semantic Attention;
Self-critical Sequence Training for Image Captioning;
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering;
DenseCap: Fully Convolutional Localization Networks for Dense Captioning;
Sequence to Sequence -- Video to Text;
Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks;
Reinforced Video Captioning with Entailment Rewards;
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning;
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research;
ClipCap: CLIP Prefix for Image Captioning;
Fine-grained Image Captioning with CLIP Reward;
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning;
Atin, Nithin, Mohit
Feb 10 Wellness day (no class)
Feb 17 Image/Video Question Answering & Dialogue; Query-based Video Moment Retrieval & Summary / Highlight / Saliency Prediction (1) VQA v2: Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering (Goyal et al., CVPR 2017);
(2) TVQA: Localized, Compositional Video Question Answering (Lei et al., EMNLP 2018);
(3) TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval (Lei et al., ECCV 2020);
(4a) Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering (Anderson et al., CVPR 2018);
(4b) VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners (Wang et al., NeurIPS 2022);
Additional readings:
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering;
Visual Dialog;
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding;
Bilinear Attention Networks;
Deep Modular Co-Attention Networks for Visual Question Answering;
MovieQA: Understanding Stories in Movies through Question-Answering;
A Joint Sequence Fusion Model for Video Question Answering and Retrieval;
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning;
Modality-Balanced Models for Visual Dialogue ;
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA;
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions;
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling;
Localizing Moments in Video with Natural Language;
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries;
TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency;
TALL: Temporal Activity Localization via Language Query;
Temporal Localization of Moments in Video Collections with Natural Language;
Multi-task deep visual-semantic embedding for video thumbnail selection;
mTVR: Multilingual Moment Retrieval in Videos;
YouCook2-Retrieval: Towards Automatic Learning of Procedures from Web Instructional Videos;
Ranking Domain-specific Highlights by Analyzing Edited Videos;
TVSum: Summarizing Web Videos Using Titles;
Query-Focused Video Summarization: Dataset, Evaluation, and a Memory Network Based Approach;
CLIP-It! Language-Guided Video Summarization;
TubeDETR: Spatio-Temporal Video Grounding with Transformers;
UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection;
Self-Chained Image-Language Model for Video Localization and Question Answering ;
Hierarchical Video-Moment Retrieval and Step-Captioning ;
VidChapters-7M: Video Chapters at Scale ;
A Simple LLM Framework for Long-Range Video Question-Answering ;
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos ;
Pranav, Hanqi, Mohit
Feb 24 Vision+Language Commonsense, World Knowledge, Reasoning;
Project Brainstorming+Feedback
(1) OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge (CVPR 2019);
(2) VCR: From Recognition to Cognition: Visual Commonsense Reasoning (CVPR 2019);
(3) VisualCOMET: Reasoning about the Dynamic Context of a Still Image (ECCV 2020);
(4) VLEP: What is More Likely to Happen Next? Video-and-Language Future Event Prediction (EMNLP 2020);
Additional readings:
Visual Persuasion: Inferring Communicative Intents of Images;
Don't Just Listen, Use Your Imagination: Leveraging Visual Common Sense for Non-Visual Tasks;
Learning Common Sense Through Visual Abstraction;
Anticipating Visual Representations from Unlabeled Video;
Predicting Motivations of Actions by Leveraging Text;
"What Happens If..." Learning to Predict the Effect of Forces in Images;
Grounding Visual Explanations;
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference;
PIQA: Reasoning about Physical Commonsense in Natural Language;
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge;
Visual Abductive Reasoning;
Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning;
Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding;
EgoTaskQA: Understanding Human Tasks in Egocentric Videos;
CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination;
Ego4D: Around the World in 3,000 Hours of Egocentric Video;
Joseph, Xiangrui, Mohit
Mar 03 Image/Story/Video Generation; 3D Image/Video & Game Generation + World Models (1) DALL-E: Zero-Shot Text-to-Image Generation;
(2) Stable Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models;
(3) Video Diffusion Models;
(4) Genie: Generative Interactive Environments;
(5) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning;
(6) Cosmos: World Foundation Model Platform for Physical AI;
(7) Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction;
Additional readings:
Generative Adversarial Text to Image Synthesis;
StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks;
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks;
Video Generation From Text;
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers;
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion;
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation;
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors;
Muse: Text-To-Image Generation via Masked Generative Transformers;
CogView: Mastering Text-to-Image Generation via Transformers;
DALLE2: Hierarchical Text-Conditional Image Generation with CLIP Latents;
Parti: Scaling Autoregressive Models for Content-Rich Text-to-Image Generation;
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion;
Imagen Video: High Definition Video Generation with Diffusion Models;
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers;
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding;
StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation;
Make-A-Video: Text-to-Video Generation without Text-Video Data;
ControlNet: Adding Conditional Control to Text-to-Image Diffusion Models;
Scaling up GANs for Text-to-Image Synthesis;
VPGen: Visual Programming for Text-to-Image Generation and Evaluation;
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models;
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning;
Unbounded: A Generative Infinite Game of Character Life Simulation;
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models;
GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment;
T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation;
Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation;
TIFA: Text-to-Image Faithfulness Evaluation with Question Answering;
GenAI Arena: An Open Evaluation Platform for Generative Models;
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models;
Scott, Tianyi, Daeun, Mohit
Mar 10 Spring break (no class)
Mar 17 Midterm Project Presentations
Mar 24 Vision+Language Pretraining Models (V+L, V-->L, L-->V, Unified, DocAI, Any-to-Any, Efficient VL) (1) LXMERT: Learning Cross-Modality Encoder Representations from Transformers;
/ VL-T5: Unifying Vision-and-Language Tasks via Text Generation;
(2) LLaVA: Visual Instruction Tuning;
/ Show-o: One Single Transformer to Unify Multimodal Understanding and Generation;
(3) Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision;
/ VirTex: Learning Visual Representations from Textual Annotations;
(4) Multimodal Few-Shot Learning with Frozen Language Models;
/ VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks;
Additional readings:
Several other VLP models (e.g., UNITER, VL-BERT, VisualBERT, etc.) mentioned in this link;
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks;
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks;
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision;
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph;
VideoBERT: A Joint Model for Video and Language Representation Learning;
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training;
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation ;
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling;
MERLOT: Multimodal Neural Script Knowledge Models;
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation;
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling;
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework;
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision;
CoCa: Contrastive Captioners are Image-Text Foundation Models;
TVLT: Textless Vision-Language Transformer;
BEiT: BERT Pre-Training of Image Transformers;
Flamingo: a Visual Language Model for Few-Shot Learning;
LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning;
PICa: An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA;
VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners;
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models;
Contrastive Learning of Medical Visual Representations from Paired Images and Text;
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer;
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks;
Idefics3-8B: Building and better understanding vision-language models;
PaliGemma 2: A Family of Versatile VLMs for Transfer ;
InternVL 2.5: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling ;
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding ;
Qwen2.5-VL: Enhanced Vision-Language Capabilities in the Qwen Series ;
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks ;
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation ;
Zaid, Rajeev, Mohit
Mar 31 Language instructions for robotic navigation, articulation, manipulation, assembly, skill learning, etc.;
Embodied Pretraining Models & Vision-Language-Action (VLA) Benchmarks/Models
(1a) Room2Room: Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments ;
/ (1b) ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks ;
(2) Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents ;
(3) Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos ;
(4a) π0: A Vision-Language-Action Flow Model for General Robot Control ;
/ (4b) Gemini Robotics: Bringing AI into the Physical World ;
Additional readings:
Learning to Interpret Natural Language Navigation Instructions from Observations (AAAI, 2011) ;
Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences (AAAI, 2016) ;
Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions ;
Learning to Follow Navigational Directions ;
(1) Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation ;
A Natural Language Planner Interface for Mobile Manipulators ;
Natural Language Communication with Robots ;
Interpreting and Executing Recipes with a Cooking Robot ;
Tell Me Dave: Context-Sensitive Grounding of Natural Language to Mobile Manipulation Instructions ;
Robobarista: Object Part based Transfer of Manipulation Trajectories from Crowd-sourcing in 3D Pointclouds ; Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments ;
ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments ;
Speaker-Follower Models for Vision-and-Language Navigation ;
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation ;
Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout ;
Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding ;
A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning ;
HAMT: History Aware Multimodal Transformer for Vision-and-Language Navigation ;
HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation ;
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge ;
Simple but Effective: CLIP Embeddings for Embodied AI ;
CLIPort: What and Where Pathways for Robotic Manipulation ;
DUET: Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation;
EnvEdit: Environment Editing for Vision-and-Language Navigation;
PaLM-E: An Embodied Multimodal Language Model ;
GROOT: Learning to Follow Instructions by Watching Gameplay Videos ;
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model ;
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought ;
Magma: A Foundation Model for Multimodal AI Agents ;
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks ;
Chidaksh, Tripp, Mohit
Apr 07 Human-robot collaboration and dialogue: instruction generation, embodied Q&A, learning new subactions, etc. (1) Back to the Blocks World: Learning New Actions through Situated Human-Robot Dialogue ;
(2a) Embodied Question Answering;
(2b) OpenEQA: Embodied Question Answering in the Era of Foundation Models;
(3) IQA: Visual Question Answering in Interactive Environments ;
(4) CVDN: Vision-and-Dialog Navigation;
(5) TEACh: Task-driven Embodied Agents that Chat;
(6) CerealBar: Executing Instructions in Situated Collaborative Interactions;
Additional readings:
Learning to Interpret Natural Language Commands through Human-Robot Dialog ;
Clarifying Commands with Information-Theoretic Human-Robot Dialog ;
Asking for Help Using Inverse Semantics ;
Navigational Instruction Generation as Inverse Reinforcement Learning with Neural Machine Translation ;
PLOW: A Collaborative Task Learning Agent ;
REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments ;
Multi-Target Embodied Question Answering ;
AutoVLN: Learning from Unlabeled 3D Environments for Vision-and-Language Navigation ;
Titus, Nicholas, Griffin, Mohit
Apr 14 Grounded language learning/emergence via multi-agent dialogue-based and interactive games;
Non-Verbal Human-Robot Interaction/Communication: Gestures, Turn-taking, Gaze
(1) Grounded Language Learning in a Simulated 3D World;
(2) Emergence of Linguistic Communication from Referential Games with Symbolic and Pixel Input;
(3) Iconary: A Pictionary-Based Game for Testing Multimodal Communication with Drawings and Text;
(4) Do As I Can, Not As I Say: Grounding Language in Robotic Affordances;
(5) MotionGPT: Human Motion as a Foreign Language;
(6) Including Signed Languages in Natural Language Processing;
Additional readings:
Learning Language Games through Interaction;
Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols ;
Emergence of Grounded Compositional Language in Multi-Agent Populations;
Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog ;
Collaborative Models for Referring Expression Generation in Situated Dialogue ;
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning ;
Multi-Agent Cooperation and the Emergence of (Natural) Language ;
Learning to Play Guess Who? and Inventing a Grounded Language as a Consequence ;
Learning to Speak and Act in a Fantasy Text Adventure Game ;
Emergent Multi-Agent Communication in the Deep Learning Era ;
Emergence of Linguistic Communication from Referential Games with Symbolic and Pixel Input;
Emergent Communication through Negotiation;
BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning;
Learning when to Communicate at Scale in Multiagent Cooperative and Competitive Tasks;
On the Pitfalls of Measuring Emergent Communication;
Two Body Problem: Collaborative Visual Task Completion;
Emergent Communication at Scale;
Generation of Nodding, Head Tilting and Eye Gazing for Human-Robot Dialogue Interaction";
Conversational Gaze Aversion for Humanlike Robots (HRI, 2014)";
Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning;
Simon plays Simon says: The timing of turn-taking in an imitation game";
Effects of Responding to, Initiating and Ensuring Joint Attention in Human-Robot Interaction";
See You See Me: The Role of Eye Contact in Multimodal Human-Robot Interaction ;
Social eye gaze in human-robot interaction: a review ;
Embodiment in Socially Interactive Robots ;
Understanding Teacher Gaze Patterns for Robot Learning ;
Human Gaze Following for Human-Robot Interaction ;
A review of eye gaze in virtual agents, social robotics and hci: Behaviour generation, user interaction and perception ;
Gaze and Attention Management for Embodied Conversational Agents ;
Learning About Objects by Learning to Interact with Them;
Pushing it out of the Way: Interactive Visual Navigation;
LLM Knows Body Language, Too: Translating Speech Voices into Human Gestures;
Fengli, Vedant, Annie, Mohit
Apr 21 Multimodal Biases and Shortcuts; How to Write and Review Research Papers (1) Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality;
(2a) Counterfactual VQA: A Cause-Effect Look at Language Bias; /
(2b) Revealing Single Frame Bias for Video-and-Language Learning;
(3) Multimodal datasets: misogyny, pornography, and malignant stereotypes;
(4a) Measuring Social Biases in Grounded Vision and Language Embeddings; /
(4b) Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models;
Additional readings:
RUBi: Reducing Unimodal Biases in Visual Question Answering;
Debiased Visual Question Answering from Feature and Sample Perspectives;
Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations;
Shifting the Baseline: Single Modality Performance on Visual Navigation & QA ;
Modality-Balanced Models for Visual Dialogue ;
Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions ";
MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering ;
REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets ;
VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives ;
Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models ;
Diagnosing the Environment Bias in Vision-and-Language Navigation;
Deconfounded Visual Grounding ;
Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models ;
Are Gender-Neutral Queries Really Gender-Neutral? Mitigating Gender Bias in Image Search ;
Language (Technology) is Power: A Critical Survey of "Bias" in NLP ;
Advances, challenges and opportunities in creating data for trustworthy AI ;
Data and its (dis)contents: A survey of dataset development and use in machine learning research ;
ACL Wiki: Ethics in NLP Resources/Courses;
SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation;
Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models;
Pavan, Shreyas, Mohit
Apr 28 Guest lecture (tentative)
May 05 Last Class: Final Project Presentations (exam days)




Disclaimer

The professor reserves the right to make changes to the syllabus, including project due dates. These changes will be announced as early as possible.