Endowing robots with human-like physical reasoning abilities remains challenging. We argue that existing methods often disregard spatio-temporal relations and by using Graph Neural Networks (GNNs) that incorporate a relational inductive bias, we can shift the learning process towards exploiting relations. In this work, we learn action-conditional forward dynamics models of a simulated manipulation task from visual observations involving cluttered and irregularly shaped objects. We investigate two GNN approaches and empirically assess their capability to generalize to scenarios with novel and an increasing number of objects. The first, Graph Networks (GN) based approach, considers explicitly defined edge attributes and not only does it consistently underperform an auto-encoder baseline that we modified to predict future states, our results indicate how different edge attributes can significantly influence the predictions. Consequently, we develop the Auto-Predictor that does not rely on explicitly defined edge attributes. It outperforms the baseline and the GN-based models. Overall, our results show the sensitivity of GNN-based approaches to the task representation, the efficacy of relational inductive biases and advocate choosing lightweight approaches that implicitly reason about relations over ones that leave these decisions to human designers.
Organizers: Siyu Tang
In the first part of the talk, I am going to present our work on human pose estimation in the Wild, capturing unconstrained images and videos containing an a priori unknown number of people, often occluded and exhibiting a wide range of articulations and appearances. Unlike conventional top-down approaches that first detect humans with the off-the-shelf object detector and then estimate poses independently per bounding box, our formulation performs joint detection and pose estimation. In the first stage we indiscriminately localise body parts of every person in the image with the state-of-the-art ConvNet-based keypoint detector. In the second stage we perform assignment of keypoints to people based on a graph partitioning approach, that minimizes an integer linear program under a set of contraints and with the vertex and edge costs computed by our ConvNet. Our method naturally generalises to articulated tracking of multiple humans in video sequences. Next, I will discuss our work on learning accurate 3D object shape and camera pose from a collection of unlabeled category-specific images. We train a convolutional network to predict both the shape and the pose from a single image by minimizing the reprojection error: given several views of an object, the projections of the predicted shapes to the predicted camera poses should match the provided views. To deal with pose ambiguity, we introduce an ensemble of pose predictors that we then distill it to a single "student" model. To allow for efficient learning of high-fidelity shapes, we represent the shapes by point clouds and devise a formulation allowing for differentiable projection of these. Finally, I will talk about how to reconstruct an appearance of three-dimensional objects, namely a method for generating a 3D human avatar from an image. Our model predicts a full texture map, clothing segmentation and displacement map. The learning is done in the UV-space of the SMPL model, which turns the hard 3D inference problem into image-to-image translation task, where we can use deep neural networks to encode appearance, geometry and clothing layout. Our model is trained on a dataset of over 4000 3D scans of humans in diverse clothing.
Conversational agents in the form of virtual agents or social robots are rapidly becoming wide-spread. Humans use non-verbal behaviors to signal their intent, emotions and attitudes in human-human interactions. Conversational agents therefore need this ability as well in order to make an interaction pleasant and efficient. An important part of non-verbal communication is gesticulation: gestures communicate a large share of non-verbal content. Previous systems for gesture production were typically rule-based and could not represent the range of human gestures. Recently the gesture generation field has shifted to data-driven approaches. We follow this line of research by extending the state-of-the-art deep-learning based model. Our model leverages representation learning to enhance speech-gesture mapping. We provide analysis of different representations for the input (speech) and the output (motion) of the network by both objective and subjective evaluations. We also analyze the importance of smoothing of the produced motion and emphasize how challenging it is to evaluate gesture quality. In the future we plan to enrich input signal by taking semantic context (text transcription) as well, make the model probabilistic and evaluate our system on the social robot NAO.
Current solutions to discriminative and generative tasks in computer vision exist separately and often lack interpretability and explainability. Using faces as our application domain, here we present an architecture that is based around two core ideas that address these issues: first, our framework learns an unsupervised, low-dimensional embedding of faces using an adversarial autoencoder that is able to synthesize high-quality face images. Second, a supervised disentanglement splits the low-dimensional embedding vector into four sub-vectors, each of which contains separated information about one of four major face attributes (pose, identity, expression, and style) that can be used both for discriminative tasks and for manipulating all four attributes in an explicit manner. The resulting architecture achieves state-of-the-art image quality, good discrimination and face retrieval results on each of the four attributes, and supports various face editing tasks using a face representation of only 99 dimensions. Finally, we apply the architecture's robust image synthesis capabilities to visually debug label-quality issues in an existing face dataset.
Organizers: Timo Bolkart
Relighting of human images has various applications in image synthesis. For relighting, we must infer albedo, shape, and illumination from a human portrait. Previous techniques rely on human faces for this inference, based on spherical harmonics (SH) lighting. However, because they often ignore light occlusion, inferred shapes are biased and relit images are unnaturally bright particularly at hollowed regions such as armpits, crotches, or garment wrinkles. This paper introduces the first attempt to infer light occlusion in the SH formulation directly. Based on supervised learning using convolutional neural networks (CNNs), we infer not only an albedo map, illumination but also a light transport map that encodes occlusion as nine SH coefficients per pixel. The main difficulty in this inference is the lack of training datasets compared to unlimited variations of human portraits. Surprisingly, geometric information including occlusion can be inferred plausibly even with a small dataset of synthesized human figures, by carefully preparing the dataset so that the CNNs can exploit the data coherency. Our method accomplishes more realistic relighting than the occlusion-ignored formulation.
Deep learning has significantly advanced state-of-the-art for 3D hand pose estimation, of which accuracy can be improved with increased amounts of labelled data. However, acquiring 3D hand pose labels can be extremely difficult. In this talk, I will present our recent two works on leveraging self-supervised learning techniques for hand pose estimation from depth map. In both works, we incorporate differentiable renderer to the network and formulate training loss as model fitting error to update network parameters. In first part of the talk, I will present our earlier work which approximates hand surface with a set of spheres. We then model the pose prior as a variational lower bound with variational auto-encoder(VAE). In second part, I will present our latest work on regressing the vertex coordinates of a hand mesh model with 2D fully convolutional network(FCN) in a single forward pass. In the first stage, the network estimates a dense correspondence field for every pixel on the image grid to the mesh grid. In the second stage, we design a differentiable operator to map features learned from the previous stage and regress a 3D coordinate map on the mesh grid. Finally, we sample from the mesh grid to recover the mesh vertices, and fit it an articulated template mesh in closed form. Without any human annotation, both works can perform competitively with strongly supervised methods. The later work will also be later extended to be compatible with MANO model.
Organizers: Dimitrios Tzionas
Realistic digital avatars are increasingly important in digital media with potential to revolutionize 3D face-to-face communication and social interactions through compelling digital embodiment of ourselves. My goal is to efficiently create high-fidelity 3D avatars from a single image input, captured in an unconstrained environment. These avatars must be close in quality to those created by professional capture systems, yet require minimal computation and no special expertise from the user. These requirements pose several significant technical challenges. A single photograph provides only partial information due to occlusions, and intricate variations in shape and appearance may prevent us from applying traditional template-based approaches. In this talk, I will present our recent work on clothed human reconstruction from a single image. We demonstrate that a careful choice of data representation that can be easily handled by machine learning algorithms is the key to robust and high-fidelity synthesis and inference for human digitization.
Organizers: Timo Bolkart
Over the past century, abdominal surgery has seen a rapid transition from open procedures to less invasive methods such as laparoscopy and robot-assisted minimally invasive surgery (R-A MIS), as they involve reduced blood loss, postoperative morbidity and length of hospital stay. Furthermore, R-A MIS has offered refined accuracy and more ergonomic instruments for surgeons, further minimising trauma to the patient. However, training surgeons in MIS procedures is becoming increasingly long and arduous, while commercially available robotic systems adopt a design similar to conventional laparoscopic instruments with limited novelty. Do these systems satisfy their users? What is the role and importance of haptics? Taking into account the input of end-users as well as examining the high intricacy and dexterity of the human hand can help to bridge the gap between R-A MIS and open surgery. By adopting designs inspired by the human hand, robotic tele-operated systems could become more accessible not only in the surgical domain but, beyond, in areas that benefit from user-centred design such as stroke rehabilitation, as well as in areas where safety issues prevent use of autonomous robots, such as assistive technologies and nuclear industry.
Organizers: Dimitrios Tzionas
In the past few years, significant progress has been made on shape modeling of human body, face, and hands. Yet clothing shape is currently not well presented. Modeling clothing using physics-based simulation can sometimes involve tedious manual work and heavy computation. Therefore, a data-driven learning approach has emerged in the community. In this talk, I will present a stream of work that targeted to learn the shape of clothed human from captured data. It involves 3D body estimation, clothing surface registration and clothing deformation modeling. I will conclude this talk by outlining the current challenges and some promising research directions in this field.
Organizers: Timo Bolkart
Since the release of the Kinect, RGB-D cameras have been used in several consumer devices, including smartphones. In this talk, I will present two challenging uses of this technology. With multiple RGB-D cameras, it is possible to reconstruct a 3D scene and visualize it from any point of view. In the first part of the talk, I will show how such a scene can be streamed and rendered as a point cloud in a compelling way and its appearance improved by the use of external cinema cameras. In the second part of the talk, I will present my work on how an RGB-D camera can be used for enabling real-walking in virtual reality by making the user aware of the surrounding obstacles. I present a pipeline to create an occupancy map from a point cloud on the fly on a mobile phone used as a virtual reality headset. This occupancy map can then be used to prevent the user from hitting physical obstacles when walking in the virtual scene.
Organizers: Sergi Pujades