COMP 690: Multimodal AI: Connecting Language to Vision and Robotics (Spring 2026)

Instructor: Mohit Bansal
Units: 3
Lectures: Wednesdays 11.15am-1.45pm ET, Room SN-115
Office Hours: Mondays By appointment
Course Webpage: https://www.cs.unc.edu/~mbansal/teaching/multimodal-ai-comp690-spring26.html
Course Email: nlpcomp790unc -at- gmail.com

Introduction

This course will be based on the connections between the fields of natural language processing (NLP) and its important multimodal connections to computer vision and robotics; it will cover a wide variety of topics in the area of multimodal-NLP such as image/video-based captioning, retrieval, QA, and dialogue; vision+language commonsense; query-based video summarization; text-to-image/video generation; any to any/3D generation, robotic navigation + manipulation instruction execution and generation; unified multimodal and embodied pretraining models (as well as ethics/bias/societal applications + issues).

Topics

Image-Text Alignment / Matching / Retrieval
Language Disambiguation via Images
Referring Expression Comprehension + Generation
Image/Video Captioning
Image/Video Question Answering
Image/Video Dialogue
Visual Entailment + Future/Next-Event Prediction
Query-based Video Moment Retrieval
Query-based Video Summary / Highlight / Saliency Prediction
Vision+Language Commonsense, World Knowledge, Reasoning
Text to Image/Video Generation / Image to Video Generation
3D Image/Video & Game Generation + World Models
Text to Image Sequence/Story/Long-Video Generation
Multi-modal Pretraining Models (V+L, V-->L, L-->V, Unified, DocAI, Any-to-Any, Efficient VL)
Executing NL instructions for navigation, articulation, manipulation, assembly, skill learning, etc.
Human-robot collaboration and dialogue for learning new subactions, mediating shared perceptual basis, referring expression generation, etc.
Grounding and language learning via dialogue-based and interactive games
Automatic instruction generation + dialogue for embodied tasks
Grounded reinforcement learning
Grounded knowledge representations (mapping language to world)
Gesture, Turn-taking, Gaze in Human-Robot Interaction
Embodied Pretraining Models
Vision-Language-Action (VLA) Benchmarks and Models
Bias, Ethics, Societal Applications
How to Write and Review Research Papers

Prerequisites

COMP 562 (Introduction to Machine Learning)
Basic NLP and Vision knowledge
Python programming language (preferably Pytorch/Tensorflow style libraries) and Linux program development environment.

Students not meeting these requirements must receive the explicit permission of the instructor to remain in this course.

Grading

Grading will (tentatively) consist of:

Paper presentations (25\%)
Paper written reviews + ideas (20\%)
Midterm Project write-up and presentation (~20\%)
Final Project write-up and presentation (~25\%)
Class participation (~10\%).

All submissions should be emailed to: nlpcomp790unc@gmail.com.

Reference Books

For NLP concepts refresher, see:

SLP2: Dan Jurafsky and James Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall, Second Edition, 2009
SLP3: Some draft chapters of the third edition chapters are available online.
FSNLP: Chris Manning and Hinrich Schutze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA: May 1999.
Yoav Goldberg. Neural Network Methods for Natural Language Processing. Morgan Claypool, April 2017.
Machine Learning Background: Andrew Ng's Coursera course
Stanford NLP + Deep Learning Class
Stanford Vision + Deep Learning Class
UNC CS NLP and Vision classes

The professor reserves the right to make changes to the syllabus, including project due dates. These changes will be announced as early as possible.