Perceiving Systems, Computer Vision

Department Talks

  • Ailing Zeng
  • Virtual

High-quality video generation—encompassing text-to-video (T2V), image-to-video (I2V), and video-to-video (V2V) generation—plays a pivotal role in content creation and world simulation. While several DiT-based models have advanced rapidly in the past year, a thorough exploration of their capabilities, limitations, and alignment with human preferences remains incomplete. In this talk, I will present recent advancements in SORA-like T2V, I2V, and V2V models and products, bridging the gap between academic research and industry applications. Through live demonstrations and comparative analyses, I will highlight key insights across four core dimensions: i) Impact on vertical-domain applications, such as human-centric animation and robotics; ii) Core capabilities, including text alignment, motion diversity, composition, and stability; iii) Performance across ten real-world scenarios, showcasing practical utility; iv) Future potential, including usage scenarios, challenges, and directions for further research. Additionally, I will discuss recent advancements in automatic evaluation methods for generated videos, leveraging multimodal large language models to better adapt to the rapid development of generative and understanding models.

Organizers: Nikos Athanasiou Michael Black


  • Yannis Siglidis

Computer vision advancements in predicting and visualizing labels, often motivate us to consider the relationship between labels and images as a given. Yet, the prototypical nature of coherent labels, such as the alphabet of handwritten characters, can help us question assumed families of handwritten variation. At the same time conceptual categories such as the name of a country, if properly assigned to images, can provide a useful benchmark for state of the art computer vision models. Further, using synthesis methods these datasets can be mined to reveal patterns of hidden visual vocabularies that help improve our (geographical) data understanding. The goal of this talk is to motivate rethinking labels in a bidirectional way, aiming to create systems that inform how humans discretize their visual world.

Organizers: Nikos Athanasiou


  • Soumava Paul
  • Virtual (Zoom)

This talk explores novel approaches to understanding and reconstructing scenes across both spatial and temporal dimensions. Extrapolating a scene from limited observations requires generative priors to generate 3D content in unobserved areas of a scene. Existing 3D generative literature relies on 3D-aware image or video diffusion models which require pretraining on million-scale real and synthetic 3D datasets. To address this challenge, we present low-cost generative techniques built on 2D diffusion priors that require only small-scale fine-tuning on multiview data. These finetuned priors can rectify novel view renders and depth maps by inpainting missing details and removing artifacts borne out of 3D representations fitted to sparse inputs. Through autoregressive fusion of multiple novel views, we build multiview consistent 3D representations that perform competitively with state-of-the-art methods for complex 360° scenes on the MipNeRF360 dataset. Building upon this foundation of static scene understanding, we extend our investigation to dynamic scenes where physical laws govern object interactions. While current video diffusion models like OpenAI's SoRA can generate visually compelling sequences, they often fail to capture underlying physical constraints due to their purely data-driven training objectives. As a result, the generated videos often lack physical plausibility. To address this limitation, we introduce a 4D dataset with per-frame force annotations that explicates the physical interactions driving object motion in scenes. Our physical simulator can both animate objects in static 3D scenes and record particle-level forces at each timestep. This dataset aims to enable the development of physics-informed video diffusion priors, marking a step toward more physically accurate world simulators.

Organizers: Omid Taheri


  • Sergi Pujades
  • MPI IS Tuebingen, 3rd floor, Aquarium

Observing and modeling the human body has attracted scientific efforts since the very early times in history. In the recent decades, though, several imaging modalities, such as Computed Tomography scanners (CT), Magnetic Resonance Imaging (MRI), or X-ray have provided the means to “see” inside the body. Most interestingly, there is growing evidence pointing that the shape of the surface of the human body is highly correlated with its internal properties, for example, the body composition, the size of the bones, and the amount of muscle and adipose tissue (fat). In this talk I will go over the used methodology to establish the link between the shape of the surface of the body and the internal anatomic structures, based on the classical problems of segmentation, registration, statistical modeling, and inference.

Organizers: Marilyn Keller


Diffusion Models for Human Motion Synthesis

Talk
  • 14 October 2024 • 15:30—16:30
  • Guy Tevet
  • MPI-IS Tuebingen, N3.022

Character motion synthesis stands as a central challenge in computer animation and graphics. The successful adaptation of diffusion models to the field boosted synthesis quality and provided intuitive controls such as text and music. One of the earliest and most popular methods to do so is Motion Diffusion Model (MDM) [ICLR 2023]. In this talk, I will review how MDM incorporates domain know-how into the diffusion model and enables intuitive editing capabilities. Then, I will present two recent works, each suggesting a refreshing take on motion diffusion and extending its abilities to new animation tasks. Multi-view Ancestral Sampling (MAS) [CVPR 2024] is an inference time algorithm that samples 3D animations from 2D keypoint diffusion models. We demonstrated it by generating 3D animations for characters and scenarios that are challenging to record in elaborate motion capture systems, yet vastly ubiquitous on in-the-wild videos. These include for example horse racing and professional rhythmic gymnastics motions. Monkey See, Monkey Do (MoMo) [SIGGRAPH Asia 2024] explores the attention space of the motion diffusion model. A careful analysis shows the roles of the attention’s keys and queries through the generation process. With these findings in hand, we design a training-free method that generates motion following the distinct motifs of one motion while led by an outline dictated by another motion. To conclude the talk, I will give my modest take on the challenges in the fields and our lab’s current work attempting to tackle some of them.

Organizers: Omid Taheri


Reconstruction and Animation of Realistic Head Avatars

Talk
  • 10 October 2024 • 14:00—15:00
  • Egor Zakharov
  • Max-Planck-Ring 4, N3, Aquarium

Digital humans, or realistic avatars, are a centerpiece of future telepresence and special effects systems, and human head modeling is one of their main components. The abovementioned applications, however, are highly demanding in terms of avatar creation speed, as well as realism, and controllability. This talk will focus on the approaches that create controllable and detailed 3D head avatars using the data from consumer-grade devices, such as smartphones, in an uncalibrated and unconstrained capture setting. We will discuss leveraging in-the-wild internet videos and synthetic data sources to achieve a high diversity of facial expressions and appearance personalization, including detailed hair modeling. We also showcase how the resulting human-centric assets can be integrated into virtual environments for real-time telepresence and entertainment applications, illustrating the future of digital communication and gaming.

Organizers: Vanessa Sklyarova


  • Simon Donne
  • Virtual, Live stream at Max-Planck-Ring 4, N3, Aquarium

Current diffusion models only generate RGB images. If we want to make progress towards graphics-ready 3D content generation, we need a PBR foundation model, but there is not enough PBR data available to train such a model from scratch. We introduce Collaborative Control, which tightly links a new PBR diffusion model to a pre-trained RGB model. We show that this dual architecture does not risk catastrophic forgetting, outputting high-quality PBR images and generalizing well beyond the PBR training dataset. Furthermore, the frozen base model remains compatible with techniques such as IP-Adapter.

Organizers: Soubhik Sanyal


  • Slava Elizarov
  • Virtual, Live stream at Max-Planck-Ring 4, N3, Aquarium

In this talk, I will present Geometry Image Diffusion (GIMDiffusion), a novel method designed to generate 3D objects from text prompts efficiently. GIMDiffusion uses geometry images, a 2D representation of 3D shapes, which allows the use of existing image-based architectures instead of complex 3D-aware models. This approach reduces computational costs and simplifies the model design. By incorporating Collaborative Control, the method exploits rich priors of pretrained Text-to-Image models like Stable Diffusion, enabling strong generalization even with limited 3D training data. GIMDiffusion produces 3D objects with semantically meaningful, separable parts and internal structures, which enhances the ease of manipulation and editing.

Organizers: Soubhik Sanyal


Advancements in 3D Facial Expression Reconstruction

Talk
  • 23 September 2024 • 12:00—13:00
  • Panagiotis Filntisis and George Retsinas
  • Hybrid

Recent advances in 3D face reconstruction from in-the-wild images and videos have excelled at capturing the overall facial shape associated with a person's identity. However, they often struggle to accurately represent the perceptual realism of facial expressions, especially subtle, extreme, or rarely observed ones. In this talk, we will present two contributions focused on improving 3D facial expression reconstruction. The first part introduces SPECTRE—"Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos"—which offers a method for precise 3D reconstruction of mouth movements linked to speech articulation. This is achieved using a novel "lipread" loss function that enhances perceptual realism. The second part covers SMIRK—"3D Facial Expressions through Analysis-by-Neural-Synthesis"—where we explore how neural rendering techniques can overcome the limitations of differentiable rendering. This approach provides better gradients for 3D reconstruction and allows us to augment training data with diverse expressions for improved generalization. Together, these methods set new standards in accurately reconstructing facial expressions.

Organizers: Victoria Fernandez Abrevaya


Generalizable Object-aware Human Motion Synthesis

Talk
  • 12 September 2024 • 14:00—15:00
  • Wanyue Zhang
  • Max-Planck-Ring 4, N3, Aquarium

Data-driven virtual 3D character animation has recently witnessed remarkable progress. The realism of virtual characters is a core contributing factor to the quality of computer animations and user experience in immersive applications like games, movies, and VR/AR. However, existing automatic approaches for 3D virtual character motion synthesis supporting scene interactions do not generalize well to new objects outside training distributions, even when trained on extensive motion capture datasets with diverse objects and annotated interactions. In this talk, I will present ROAM, an alternative framework that generalizes to unseen objects of the same category without relying on a large dataset of human-object animations. In addition, I will share some preliminary findings from an ongoing project on hand motion interaction with articulated objects.

Organizers: Nikos Athanasiou