Deep learning has significantly advanced state-of-the-art for 3D hand pose estimation, of which accuracy can be improved with increased amounts of labelled data. However, acquiring 3D hand pose labels can be extremely difficult. In this talk, I will present our recent two works on leveraging self-supervised learning techniques for hand pose estimation from depth map. In both works, we incorporate differentiable renderer to the network and formulate training loss as model fitting error to update network parameters. In first part of the talk, I will present our earlier work which approximates hand surface with a set of spheres. We then model the pose prior as a variational lower bound with variational auto-encoder(VAE). In second part, I will present our latest work on regressing the vertex coordinates of a hand mesh model with 2D fully convolutional network(FCN) in a single forward pass. In the first stage, the network estimates a dense correspondence field for every pixel on the image grid to the mesh grid. In the second stage, we design a differentiable operator to map features learned from the previous stage and regress a 3D coordinate map on the mesh grid. Finally, we sample from the mesh grid to recover the mesh vertices, and fit it an articulated template mesh in closed form. Without any human annotation, both works can perform competitively with strongly supervised methods. The later work will also be later extended to be compatible with MANO model.
Biography: Chengde Wan is a Ph.D. student at the computer vision laboratory under Prof. Luc Van Gool at ETH Zürich, Switzerland. His main research focus is hand pose estimation. Before joining the lab, he worked as a research assistant at ROSE lab of Nanyang Technical University(Singapore) in 2014. He also interned at Facebook Reality Lab in 2018. He received my MSc and BSc in Electrical Engineering both from Beijing University of Posts and Telecommunications (China) on 2014 and 2011 respectively.