Perceiving Systems, Computer Vision

Department Talks

  • Siheng Chen
  • N3

As the rapid growth of AI techniques, we might witness the emergence of AI agents entering our lives, reminiscent of new species. Ensuring these AI agents can well integrate into human life would be a profounding challenge. We urge these agents to be highly performant, safe, and well-aligned with human values. However, directly training and testing AI agents in real-world environments to guarantee their performance and safety is costly and can disrupt everyday life. Thus, we are exploring a simulation-based approach to incubate these AI agents. In this talk, we will highlight the role of simulation in two key scenarios: large language models (LLMs) and autonomous driving. Through these two studies, I will demonstrate how simulation can effectively facilitate the development of LLM agents and driving agents, ensuring they are both powerful and safe for human use.

Organizers: Yao Feng

Creating High-End Visuals with Real-Time Technology

Talk
  • 08 July 2024 • 11:00—12:00
  • Yafes Sahin
  • Max Planck Ring 4, N3

Creating captivating 3D visuals, particularly photorealistic CGI, demands a diverse range of tools, techniques, and expertise, from concept design to the creation of entire 3D worlds. Linear content generation represents the highest standard of visual quality and has long been a source of inspiration for game developers. In this talk, we will explore the advancements in techniques that have contributed to the rise of real-time technologies in movies and game cinematics. We will delve into projects created with Unreal Engine, such as The Matrix Awakens, Vaulted Halls Entombed (Netflix Series: Love, Death & Robots) and Rise of Hydra (Captain America and Black Panther), We will also look at examples from movies that were created using traditional workflows to compare and understand the differences.

Organizers: Yao Feng


Text-Driven 3D Modeling of Avatars

Talk
  • 04 July 2024 • 10:00—11:00
  • Pranav Manu
  • Hybrid

Generating 3D objects poses notable challenges due to the limited availability of annotated 3D datasets, unlike their 2D counterparts. Current approaches often resort to models trained on 2D data, resulting in prolonged optimization phases. Conversely, models trained on 3D datasets enable inference without optimization but suffer from limited dataset diversity. This talk explores methodologies for generative 3D modelling of human heads and garments, pivotal for human avatar creation. First, we introduce "Clip-Head," a text-to-textured 3D head generation model that generates a textured NPHM head model. This model bypasses the need for expensive optimization processes and directly generates textured 3D heads from text prompts. Secondly, we briefly discuss "WordRobe," a text-to-garment generation framework that learns a latent space of garments. WordRobe can produce open-surface garments ready for graphics pipelines while generating consistent texture maps. This approach paves the way for text-driven garment design and virtual try-on applications.

Organizers: Victoria Fernandez Abrevaya


  • Shixiang Tang
  • Hybrid

Recent years have witnessed great research interests in Human-Centric Visual Computing, such as person re-identification in social surveillance, mesh recovery in Metaverse, and pedestrian detection in autonomous driving. The recent development of large model offers the opportunity to unify these human-centric tasks and achieve improved performance by merging public datasets from different tasks. This talk will present our recent work on developing human-centric unified models on 2D vision, 3D vision, Skelton-based and vision-language tasks. We hope our model will be integrated to the current large language models to achieve an intelligent human world model.

Organizers: Yandong Wen


Generative Rendering and Beyond

Talk
  • 02 May 2024 • 17:00—18:00
  • Shengqu Cai
  • Hybrid

Traditional 3D content creation tools empower users to bring their imagination to life by giving them direct control over a scene's geometry, appearance, motion, and camera path. Creating computer-generated videos, however, is a tedious manual process, which can be automated by emerging text-to-video diffusion models (SORA). Despite great promise, video diffusion models are difficult to control, hindering users from applying their own creativity rather than amplifying it. In this talk, we present a novel approach called Generative Rendering that combines the controllability of dynamic 3D meshes with the expressivity and editability of emerging diffusion models. Our approach takes an animated, low-fidelity rendered mesh as input and injects the ground truth correspondence information obtained from the dynamic mesh into various stages of a pre-trained text-to-image generation model to output high-quality and temporally consistent frames. Going beyond, we will discuss the various challenges and goals towards achieving controllability in video diffusion models, and conclude with a preview of our ongoing consensus video generation efforts.

Organizers: Shrisha Bharadwaj Michael Black


Modeling and Reconstructing Garments with Sewing Patterns

Talk
  • 04 April 2024 • 14:00—15:00
  • Maria Korosteleva
  • N3.022

The problems of creating new garments (modeling) or reproducing the existing ones (reconstruction) appear in various fields: from fashion production to digital human modeling for the metaverse. The talk introduces approaches to a novel garment creation paradigm: programming-based parametric sewing pattern construction and its application to generating rich synthetic datasets of garments with sewing patterns. We will then discuss how the availability of ground truth sewing patterns allows posing the learning-based garment reconstruction problem as a sewing pattern recovery. Such reformulation enables obtaining high-quality 3D garment models from sparse point clouds with effective design generalization while simultaneously providing designer-friendly garment representation for further use in traditional garment processing pipelines.

Organizers: Yao Feng Michael Black


Geometric Regularizations for 3D Shape Generation

Talk
  • 13 March 2024 • 15:00—16:00
  • Qixing Huang
  • N3.022

Generative models, which map a latent parameter space to instances in an ambient space, enjoy various applications in 3D Vision and related domains. A standard scheme of these models is probabilistic, which aligns the induced ambient distribution of a generative model from a prior distribution of the latent space with the empirical ambient distribution of training instances. While this paradigm has proven to be quite successful on images, its current applications in 3D generation encounter fundamental challenges in the limited training data and generalization behavior. The key difference between image generation and shape generation is that 3D shapes possess various priors in geometry, topology, and physical properties. Existing probabilistic 3D generative approaches do not preserve these desired properties, resulting in synthesized shapes with various types of distortions. In this talk, I will discuss recent work that seeks to establish a novel geometric framework for learning shape generators. The key idea is to model various geometric, physical, and topological priors of 3D shapes as suitable regularization losses by developing computational tools in differential geometry and computational topology. We will discuss the applications in deformable shape generation, latent space design, joint shape matching, and 3D man-made shape generation.

Organizers: Yuliang Xiu


Mining Visual Knowledge from Large Pre-trained Models

Talk
  • 18 January 2024 • 15:00—16:00
  • Luming Tang
  • N3.022

Computer vision made huge progress in the past decade with the dominant supervised learning paradigm, that is training large-scale neural networks on each task with ever larger datasets. However, in many cases, scalable data or annotation collection is intractable. In contrast, humans can easily adapt to new vision tasks with very little data or labels. In order to bridge this gap, we found that there actually exists rich visual knowledge in large pre-trained models, i.e., models trained on scalable internet images with either self-supervised or generative objectives. And we proposed different techniques to extract these implicit knowledge and use them to accomplish specific downstream tasks where data is constrained including recognition, dense prediction and generation. Specifically, I’ll mainly present the following three works. Firstly, I will introduce an efficient and effective way to adapt pre-trained vision transformers to a variety of low-shot downstream tasks, while tuning only less than 1 percent of the model parameters. Secondly, I will show that accurate visual correspondences emerge from a strong generative model (i.e., diffusion models) without any supervision. Following that, I will demonstrate that an adapted diffusion model is able to complete a photo with true scene contents using only a few casual captured reference images.

Organizers: Yuliang Xiu Yandong Wen


  • Partha Ghosh
  • N3.022 Aquarium and Zoom

We present a novel unconditional video generative model designed to address long-term spatial and temporal dependencies. To capture these dependencies, our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks developed for three-dimensional object representation and employs a singular latent code to model an entire video sequence. Individual video frames are then synthesized from an intermediate tri-plane representation, which itself is derived from the primary latent code. This novel strategy reduces computational complexity by a factor of 2 as measured in FLOPs. Consequently, our approach facilitates the efficient and temporally coherent generation of videos. Moreover, our joint frame modeling approach, in contrast to autoregressive methods, mitigates the generation of visual artifacts. We further enhance the model's capabilities by integrating an optical flow-based module within our Generative Adversarial Network (GAN) based generator architecture, thereby compensating for the constraints imposed by a smaller generator size. As a result, our model is capable of synthesizing high-fidelity video clips at a resolution of 256×256 pixels, with durations extending to more than 5 seconds at a frame rate of 30 fps. The efficacy and versatility of our approach are empirically validated through qualitative and quantitative assessments across three different datasets comprising both synthetic and real video clips.

Organizers: Yandong Wen


  • Weiyang Liu
  • N3.022 Aquarium and Zoom

Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream task adaptation. Despite demonstrating good generalizability, OFT still uses a fairly large number of trainable parameters due to the high dimensionality of orthogonal matrices. To address this, we start by examining OFT from an information transmission perspective, and then identify a few key desiderata that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast Fourier transform algorithm enables efficient information transmission, we propose an efficient orthogonal parameterization using butterfly structures. We apply this parameterization to OFT, creating a novel parameter-efficient finetuning method, called Orthogonal Butterfly (BOFT). By subsuming OFT as a special case, BOFT introduces a generalized orthogonal finetuning framework. Finally, we conduct an extensive empirical study of adapting large vision transformers, large language models, and text-to-image diffusion models to various downstream tasks in computer vision and natural language. The results validate the effectiveness of BOFT as a generic finetuning method.

Organizers: Yandong Wen


Ghost on the Shell: An Expressive Representation of General 3D Shapes

Talk
  • 12 October 2023 • 10:00 am—11:00 am
  • Zhen Liu
  • Hybrid

The creation of photorealistic virtual worlds requires the accurate modeling of 3D surface geometry for a wide range of objects. For this, meshes are appealing since they enable 1) fast physics-based rendering with realistic material and lighting, 2) physical simulation, and 3) are memory-efficient for modern graphics pipelines. Recent work on reconstructing and statistically modeling 3D shape, however, has critiqued meshes as being topologically inflexible. To capture a wide range of object shapes, any 3D representation must be able to model solid, watertight, shapes as well as thin, open, surfaces. Recent work has focused on the former, and methods for reconstructing open surfaces do not support fast reconstruction with material and lighting or unconditional generative modelling. Inspired by the observation that open surfaces can be seen as islands floating on watertight surfaces, we parametrize open surfaces by defining a manifold signed distance field on watertight templates. With this parametrization, we further develop a grid-based and differentiable representation that parametrizes both watertight and non-watertight meshes of arbitrary topology. Our new representation, called Ghost-on-the-Shell (G-Shell), enables two important applications: differentiable rasterization-based reconstruction from multiview images and generative modelling of non-watertight meshes. We empirically demonstrate that G-SHELL achieves state-of-the-art performance on non-watertight mesh reconstruction and generation tasks, while also performing effectively for watertight meshes.

Organizers: Yandong Wen