Deep learning has significantly advanced state-of-the-art for 3D hand pose estimation, of which accuracy can be improved with increased amounts of labelled data. However, acquiring 3D hand pose labels can be extremely difficult. In this talk, I will present our recent two works on leveraging self-supervised learning techniques for hand pose estimation from depth map. In both works, we incorporate differentiable renderer to the network and formulate training loss as model fitting error to update network parameters. In first part of the talk, I will present our earlier work which approximates hand surface with a set of spheres. We then model the pose prior as a variational lower bound with variational auto-encoder(VAE). In second part, I will present our latest work on regressing the vertex coordinates of a hand mesh model with 2D fully convolutional network(FCN) in a single forward pass. In the first stage, the network estimates a dense correspondence field for every pixel on the image grid to the mesh grid. In the second stage, we design a differentiable operator to map features learned from the previous stage and regress a 3D coordinate map on the mesh grid. Finally, we sample from the mesh grid to recover the mesh vertices, and fit it an articulated template mesh in closed form. Without any human annotation, both works can perform competitively with strongly supervised methods. The later work will also be later extended to be compatible with MANO model.
Organizers: Dimitrios Tzionas
Realistic digital avatars are increasingly important in digital media with potential to revolutionize 3D face-to-face communication and social interactions through compelling digital embodiment of ourselves. My goal is to efficiently create high-fidelity 3D avatars from a single image input, captured in an unconstrained environment. These avatars must be close in quality to those created by professional capture systems, yet require minimal computation and no special expertise from the user. These requirements pose several significant technical challenges. A single photograph provides only partial information due to occlusions, and intricate variations in shape and appearance may prevent us from applying traditional template-based approaches. In this talk, I will present our recent work on clothed human reconstruction from a single image. We demonstrate that a careful choice of data representation that can be easily handled by machine learning algorithms is the key to robust and high-fidelity synthesis and inference for human digitization.
Organizers: Timo Bolkart
Over the past century, abdominal surgery has seen a rapid transition from open procedures to less invasive methods such as laparoscopy and robot-assisted minimally invasive surgery (R-A MIS), as they involve reduced blood loss, postoperative morbidity and length of hospital stay. Furthermore, R-A MIS has offered refined accuracy and more ergonomic instruments for surgeons, further minimising trauma to the patient. However, training surgeons in MIS procedures is becoming increasingly long and arduous, while commercially available robotic systems adopt a design similar to conventional laparoscopic instruments with limited novelty. Do these systems satisfy their users? What is the role and importance of haptics? Taking into account the input of end-users as well as examining the high intricacy and dexterity of the human hand can help to bridge the gap between R-A MIS and open surgery. By adopting designs inspired by the human hand, robotic tele-operated systems could become more accessible not only in the surgical domain but, beyond, in areas that benefit from user-centred design such as stroke rehabilitation, as well as in areas where safety issues prevent use of autonomous robots, such as assistive technologies and nuclear industry.
Organizers: Dimitrios Tzionas
In the past few years, significant progress has been made on shape modeling of human body, face, and hands. Yet clothing shape is currently not well presented. Modeling clothing using physics-based simulation can sometimes involve tedious manual work and heavy computation. Therefore, a data-driven learning approach has emerged in the community. In this talk, I will present a stream of work that targeted to learn the shape of clothed human from captured data. It involves 3D body estimation, clothing surface registration and clothing deformation modeling. I will conclude this talk by outlining the current challenges and some promising research directions in this field.
Organizers: Timo Bolkart
Since the release of the Kinect, RGB-D cameras have been used in several consumer devices, including smartphones. In this talk, I will present two challenging uses of this technology. With multiple RGB-D cameras, it is possible to reconstruct a 3D scene and visualize it from any point of view. In the first part of the talk, I will show how such a scene can be streamed and rendered as a point cloud in a compelling way and its appearance improved by the use of external cinema cameras. In the second part of the talk, I will present my work on how an RGB-D camera can be used for enabling real-walking in virtual reality by making the user aware of the surrounding obstacles. I present a pipeline to create an occupancy map from a point cloud on the fly on a mobile phone used as a virtual reality headset. This occupancy map can then be used to prevent the user from hitting physical obstacles when walking in the virtual scene.
Organizers: Sergi Pujades
First, a short analysis of the key components of my participation in SemEval 2018, an emotion analysis contest from tweets. Namely, a transfer learning approach used for emotion classification and a context-aware attention mechanism. In my second paper, I explore how brain information can improve word representations. Neural activation models that have been proposed in the literature use a set of example words for which fMRI measurements are available in order to find a mapping between word semantics and localized neural activations. I use such models to predict neural activations on a full word lexicon. Then, I propose a cognitive computational model that estimates semantic similarity in the neural activation space and investigates the relative performance of this model for various natural language processing tasks. Finally, in my most recent work I explore cross-topic word representations. In traditional Distributional Semantic Models -like word2vec- the multiple senses of a polysemous word are conflated into a single vector space representation. In my work, I propose a DSM that learns multiple distributional representations of a word based on different topics. Moreover, we project the different topic representations in a common space and apply a smoothing technique to group redundant topic vectors.
Organizers: Soubhik Sanyal
Since Hubel and Wiesel's seminal findings in the primary visual cortex (V1) more than 50 years ago, progress in vision science has been very limited along previous frameworks and schools of thoughts on understanding vision. Have we been asking the right questions? I will show observations motivating the new path. First, a drastic information bottleneck forces the brain to process only a tiny fraction of the massive visual input information; this selection is called the attentional selection, how to select this tiny fraction is critical. Second, a large body of evidence has been accumulating to suggest that the primary visual cortex (V1) is where this selection starts, suggesting that the visual cortical areas along the visual pathway beyond V1 must be investigated in light of this selection in V1. Placing attentional selection as the center stage, a new path to understanding vision is proposed (articulated in my book "Understanding vision: theory, models, and data", Oxford University Press 2014). I will show a first example of using this new path, which aims to ask new questions and make fresh progresses. I will relate our insights to artificial vision systems to discuss issues like top-down feedbacks in hierachical processing, analysis-by-synthesis, and image understanding.
Multi-person articulated pose tracking is an important while challenging problem in human behavior understanding. In this talk, going along the road of top-down approaches, I will introduce a decent and efficient pose tracker based on pose flows. This approach can achieve real-time pose tracking without loss of accuracy. Besides, to better understand human activities in visual contents, clothes texture and geometric details also play indispensable roles. However, extrapolating them from a single image is much more difficult than rigid objects due to its large variations in pose, shape, and cloth. I will present a two-stage pipeline to predict human bodies and synthesize human novel views from one single-view image.
Organizers: Siyu Tang
Much existing work in reinforcement learning involves environments that are either intentionally neutral, lacking a role for cooperation and competition, or intentionally simple, when agents need imagine nothing more than that they are playing versions of themselves. Richer game theoretic notions become important as these constraints are relaxed. For humans, this encompasses issues that concern utility, such as envy and guilt, and that concern inference, such as recursive modeling of other players, I will discuss studies treating a paradigmatic game of trust as an interactive partially-observable Markov decision process, and will illustrate the solution concepts with evidence from interactions between various groups of subjects, including those diagnosed with borderline and anti-social personality disorders.
In this talk, I will present my understanding on 3D face reconstruction, modelling and applications from a deep learning perspective. In the first part of my talk, I will discuss the relationship between representations (point clouds, meshes, etc) and network layers (CNN, GCN, etc) on face reconstruction task, then present my ECCV work PRN which proposed a new representation to help achieve state-of-the-art performance on face reconstruction and dense alignment tasks. I will also introduce my open source project face3d that provides examples for generating different 3D face representations. In the second part of the talk, I will talk some publications in integrating 3D techniques into deep networks, then introduce my upcoming work which implements this. In the third part, I will present how related tasks could promote each other in deep learning, including face recognition for face reconstruction task and face reconstruction for face anti-spoofing task. Finally, with such understanding of these three parts, I will present my plans on 3D face modelling and applications.
Organizers: Timo Bolkart
The past few years with the advent of Deep Convolutional Neural Networks (DCNNs), as well as the availability of visual data it was shown that it is possible to produce excellent results in very challenging tasks, such as visual object recognition, detection, tracking etc. Nevertheless, in certain tasks such as fine-grain object recognition (e.g., face recognition) it is very difficult to collect the amount of data that are needed. In this talk, I will show how, using DCNNs, we can generate highly realistic faces and heads and use them for training algorithms such as face and facial expression recognition. Next, I will reverse the problem and demonstrate how by having trained a very powerful face recognition network it can be used to perform very accurate 3D shape and texture reconstruction of faces from a single image. Finally, I will demonstrate how to create very lightweight networks for representing 3D face texture and shape structure by capitalising upon intrinsic mesh convolutions.
Organizers: Dimitrios Tzionas