Embodied Ai with Web-Scale Video Data
3.0
creditsAverage Course Rating
Embodied AI from Web-Scale Multimodal Data examines how modern agents learn perception, prediction, and control by leveraging large, unstructured internet data, especially web video, egocentric human interaction recordings, and vision-language datasets. The course builds a bottom-up understanding of the perception–action loop, focusing on how motion, 3D structure, human pose and interaction cues, and multimodal signals can be extracted and aligned from video to support embodied reasoning and decision-making. Students will study recent advances in generative video/world models, 3D vision, imitation learning, and offline reinforcement learning, with an emphasis on data curation and alignment at scale. Through paper discussions, hands-on mini-assignments, and an open-ended final project, students will learn to critically evaluate current research and to design scalable learning pipelines that connect web-supervised perception to embodied tasks such as navigation, manipulation, and wearable assistants. Required course background: machine learning or deep learning; computer vision recommended. Students may receive credit for only one of 601.460/660.
No Course Evaluations found