I will present three recent projects within the 3D Deep Learning research line from my team at Google Research: (1) a deep network for reconstructing the 3D shape of multiple objects appearing in a single RGB image (ECCV'20). (2) a new conditioning scheme for normalizing flow models. It enables several applications such as reconstructing an object's 3D point cloud from an image, or the converse problem of rendering an image given a 3D point cloud, both within the same modeling framework (CVPR'20); (3) a neural rendering framework that maps a voxelized object into a high quality image. It renders highly-textured objects and illumination effects such as reflections and shadows realistically. It allows controllable rendering: geometric and appearance modifications in the input are accurately represented in the final rendering (CVPR'20).
Game Development requires a vast array of tools, techniques, and expertise, ranging from game design, artistic content creation, to data management and low level engine programming. Yet all of these domains have one kind of task in common - the transformation of one kind of data into another. Meanwhile, advances in Machine Learning have resulted in a fundamental change in how we think about these kinds of data transformations - allowing for accurate and scalable function approximation, and the ability to train such approximations on virtually unlimited amounts of data. In this talk I will present how these two fundamental changes in Computer Science affect game development - how they can be used to improve game technology as well as the way games are built - and the exciting new possibilities and challenges they bring along the way.
Organizers: Abhinanda Ranjit Punnakkal
Accurate 3D human pose estimation has been a longstanding goal in computer vision. However, till now, it has only gained limited success in easy scenarios such as studios which have little occlusion. In this talk, I will present our two works aiming to address the occlusion problem in realistic scenarios. In the first work, we present an approach to recover absolute 3D human pose of single person from multi-view images by incorporating multi-view geometric priors in our model. It consists of two separate steps: (1) estimating the 2D poses in multi-view images and (2) recovering the 3D poses from the multi-view 2D poses. First, we introduce a cross-view fusion scheme into CNN to jointly estimate 2D poses for multiple views. Consequently, the 2D pose estimation for each view already benefits from other views. Second, we present a recursive Pictorial Structure Model to recover the 3D pose from the multi-view 2D poses. It gradually improves the accuracy of 3D pose with affordable computational cost. In the second work, we present a 3D pose estimator which allows us to reliably estimate and track people in crowded scenes. In contrast to the previous efforts which require to establish cross-view correspondence based on noisy and incomplete 2D pose estimations, we present an end-to-end solution which directly operates in the 3D space, therefore avoids making incorrect hard decisions in the 2D space. To achieve this goal, the features in all camera views are warped and aggregated in a common 3D space, and fed to Cuboid Proposal Network (CPN) to coarsely localize all people. Then we propose Pose Regression Network (PRN) to estimate a detailed 3D pose for each proposal. The approach is robust to occlusion which occurs frequently in practice. Without bells and whistles, it significantly outperforms the state-of-the-arts on the benchmark datasets.
Organizers: Chun-Hao Paul Huang
In this talk I will present an overview of our recent works that learn deep geometric models for the 3D face from large datasets of scans. Priors for the 3D face are crucial for many applications: to constrain ill posed problems such as 3D reconstruction from monocular input, for efficient generation and animation of 3D virtual avatars, or even in medical domains such as recognition of craniofacial disorders. Generative models of the face have been widely used for this task, as well as deep learning approaches that have recently emerged as a robust alternative. Barring a few exceptions, most of these data-driven approaches were built from either a relatively limited number of samples (in the case of linear models of the shape), or by synthetic data augmentation (for deep-learning based approaches), mainly due to the difficulty in obtaining large-scale and accurate 3D scans of the face. Yet, there is a substantial amount of 3D information that can be gathered when considering publicly available datasets that have been captured over the last decade. I will discuss here our works that tackle the challenges of building rich geometric models out of these large and varied datasets, with the goal of modeling the facial shape, expression (i.e. motion) or geometric details. Concretely, I will talk about (1) an efficient and fully automatic approach for registration of large datasets of 3D faces in motion; (2) deep learning methods for modeling the facial geometry that can disentangle the shape and expression aspects of the face; and (3) a multi-modal learning approach for capturing geometric details from images in-the-wild, by simultaneously encoding both facial surface normal and natural image information.
Organizers: Jinlong Yang
Biological motion is fascinating in almost every aspect you look upon it. Especially locomotion plays a crucial part in the evolution of life. Structures, like the bones connected by joints, soft and connective tissues and contracting proteins in a muscle-tendon unit enable and prescribe the respective species' specific locomotion pattern. Most importantly, biological motion is autonomously learned, it is untethered as there is no external energy supply and typical for vertebrates, it's muscle-driven. This talk is focused on human motion. Digital models and biologically inspired robots are presented, built for a better understanding of biology’s complexity. Modeling musculoskeletal systems reveals that the mapping from muscle stimulations to movement dynamics is highly nonlinear and complex, which makes it difficult to control those systems with classical techniques. However, experiments on a simulated musculoskeletal model of a human arm and leg and real biomimetic muscle-driven robots show that it is possible to learn an accurate controller despite high redundancy and nonlinearity, while retaining sample efficiency. More examples on active muscle-driven motion will be given.
Organizers: Ahmed Osman
In this talk, I will present about the most recent advances in data-driven character animation and control using neural networks. Creating key-framed animations by hand is typically very time-consuming and requires a lot of artistic expertise and training. Recent work applying deep learning for character animation was firstly able to compete or even outperform the quality that could be achieved by professional animators for biped locomotion, and thus caused a lot excitement in both academia and industry. Shortly after, following research also demonstrated its applicability to quadruped locomotion control, which has been considered one of the unsolved key challenges in character animation due to the highly complex footfall patterns of quadruped characters. Addressing the next challenges beyond character locomotion, this year at SIGGRAPH Asia we presented the Neural State Machine, an improved version of such previous systems in order to make human characters naturally interact with objects and the environment from motion capture data. Generally, the difficulty in such tasks is due to complex planning of periodic and aperiodic movements reacting to the scene geometry in order to precisely position and orient the character, and to adapt to different variations in the type, size and shape of such objects. We demonstrate the versatility of this framework with various scene interaction tasks, such as sitting on a chair, avoiding obstacles, opening and entering through a door, and picking and carrying objects generated in real-time just from a single model.
The body is one of the most relevant aspects of our self, and we shape it through our eating behavior and physical acitivity. As a psychologist and neuroscientist, I seek to disentangle mutual interactions between how we represent our own body, what we eat and how much we exercise. In the talk, I will give a scoping overview of this approach and present the studies I am conducting as a guest scientist at PS.
Organizers: Ahmed Osman
Computation has fundamentally changed the way we study nature. New data collection technology, such as GPS, high definition cameras, UAVs, genotyping, and crowdsourcing, are generating data about wild populations that are orders of magnitude richer than any previously collected. Unfortunately, in this domain as in many others, our ability to analyze data lags substantially behind our ability to collect it. In this talk I will show how computational approaches can be part of every stage of the scientific process of understanding animal sociality, from intelligent data collection (crowdsourcing photographs and identifying individual animals from photographs by stripes and spots - Wildbook.org) to hypothesis formulation (by designing a novel computational framework for analysis of dynamic social networks), and provide scientific insight into collective behavior of zebras, baboons, and other social animals.
Organizers: Aamir Ahmad
Endowing robots with human-like physical reasoning abilities remains challenging. We argue that existing methods often disregard spatio-temporal relations and by using Graph Neural Networks (GNNs) that incorporate a relational inductive bias, we can shift the learning process towards exploiting relations. In this work, we learn action-conditional forward dynamics models of a simulated manipulation task from visual observations involving cluttered and irregularly shaped objects. We investigate two GNN approaches and empirically assess their capability to generalize to scenarios with novel and an increasing number of objects. The first, Graph Networks (GN) based approach, considers explicitly defined edge attributes and not only does it consistently underperform an auto-encoder baseline that we modified to predict future states, our results indicate how different edge attributes can significantly influence the predictions. Consequently, we develop the Auto-Predictor that does not rely on explicitly defined edge attributes. It outperforms the baseline and the GN-based models. Overall, our results show the sensitivity of GNN-based approaches to the task representation, the efficacy of relational inductive biases and advocate choosing lightweight approaches that implicitly reason about relations over ones that leave these decisions to human designers.
Organizers: Siyu Tang
In the first part of the talk, I am going to present our work on human pose estimation in the Wild, capturing unconstrained images and videos containing an a priori unknown number of people, often occluded and exhibiting a wide range of articulations and appearances. Unlike conventional top-down approaches that first detect humans with the off-the-shelf object detector and then estimate poses independently per bounding box, our formulation performs joint detection and pose estimation. In the first stage we indiscriminately localise body parts of every person in the image with the state-of-the-art ConvNet-based keypoint detector. In the second stage we perform assignment of keypoints to people based on a graph partitioning approach, that minimizes an integer linear program under a set of contraints and with the vertex and edge costs computed by our ConvNet. Our method naturally generalises to articulated tracking of multiple humans in video sequences. Next, I will discuss our work on learning accurate 3D object shape and camera pose from a collection of unlabeled category-specific images. We train a convolutional network to predict both the shape and the pose from a single image by minimizing the reprojection error: given several views of an object, the projections of the predicted shapes to the predicted camera poses should match the provided views. To deal with pose ambiguity, we introduce an ensemble of pose predictors that we then distill it to a single "student" model. To allow for efficient learning of high-fidelity shapes, we represent the shapes by point clouds and devise a formulation allowing for differentiable projection of these. Finally, I will talk about how to reconstruct an appearance of three-dimensional objects, namely a method for generating a 3D human avatar from an image. Our model predicts a full texture map, clothing segmentation and displacement map. The learning is done in the UV-space of the SMPL model, which turns the hard 3D inference problem into image-to-image translation task, where we can use deep neural networks to encode appearance, geometry and clothing layout. Our model is trained on a dataset of over 4000 3D scans of humans in diverse clothing.
Conversational agents in the form of virtual agents or social robots are rapidly becoming wide-spread. Humans use non-verbal behaviors to signal their intent, emotions and attitudes in human-human interactions. Conversational agents therefore need this ability as well in order to make an interaction pleasant and efficient. An important part of non-verbal communication is gesticulation: gestures communicate a large share of non-verbal content. Previous systems for gesture production were typically rule-based and could not represent the range of human gestures. Recently the gesture generation field has shifted to data-driven approaches. We follow this line of research by extending the state-of-the-art deep-learning based model. Our model leverages representation learning to enhance speech-gesture mapping. We provide analysis of different representations for the input (speech) and the output (motion) of the network by both objective and subjective evaluations. We also analyze the importance of smoothing of the produced motion and emphasize how challenging it is to evaluate gesture quality. In the future we plan to enrich input signal by taking semantic context (text transcription) as well, make the model probabilistic and evaluate our system on the social robot NAO.
Current solutions to discriminative and generative tasks in computer vision exist separately and often lack interpretability and explainability. Using faces as our application domain, here we present an architecture that is based around two core ideas that address these issues: first, our framework learns an unsupervised, low-dimensional embedding of faces using an adversarial autoencoder that is able to synthesize high-quality face images. Second, a supervised disentanglement splits the low-dimensional embedding vector into four sub-vectors, each of which contains separated information about one of four major face attributes (pose, identity, expression, and style) that can be used both for discriminative tasks and for manipulating all four attributes in an explicit manner. The resulting architecture achieves state-of-the-art image quality, good discrimination and face retrieval results on each of the four attributes, and supports various face editing tasks using a face representation of only 99 dimensions. Finally, we apply the architecture's robust image synthesis capabilities to visually debug label-quality issues in an existing face dataset.
Organizers: Timo Bolkart