Hands-Object Interaction | Perceiving Systems - Max Planck Institute for Intelligent Systems

(Left) We use a dataset of 3D hand scans to learn MANO, a statistical model of 3D hand shape. We combine MANO with our SMPL body model to build the holistic SMPL+H model. We register SMPL+H (pink) to 4D scans (white); the results look natural even for missing data or finger webbing in scans. (Middle) We train ObMan, a deep network with a MANO layer, to estimate 3D hand and object meshes from an RGB image of grasping, while encouraging contact and discouraging penetrations. (Right) We capture GRAB, a dataset of real whole-body grasps (blue, yellow), i.e. of people interacting with objects using their body, hands and face. We use GRAB to train GrabNet, a network that generates grasping hands (gray) for unseen objects (yellow).

Dimitris Tzionas (Project Leader), Omid Taheri, Javier Romero, Nima Ghorbani, Gul Varol, Cordelia Schmid, Korrawe Karunratanakul, Jinlong Yang, Yan Zhang, Krikamol Muandet, Siyu Tang, Michael Black (Director), Yana Hasson (INRIA), Igor Kalevatykh (INRIA), Ivan Laptev (INRIA)

Hands allow humans to interact with, and use, physical objects, but capturing hand motion is a challenging computer-vision task. To tackle this, we learn a statistical model of the human hand [ ], called MANO, that is trained using many 3D scans of human hands and represents the 3D shape variation across a human population. We combine MANO with the SMPL body model and FLAME face model to obtain the expressive SMPL-X model, which allows us to reconstruct realistic bodies and hands using our 4D scanner, mocap data, or monocular video.

MANO can be fit to noisy input data to reconstruct hands and/or objects [ ] from a monocular RGB-D or multiview RGB sequence. Interacting motion also helps to recover the unknown kinematic skeleton of objects [ ].

To directly regress hands and objects, we developed ObMan [ ], a deep-learning model that integrates MANO as a network layer, to estimate the 3D hand and object meshes from an RGB image of grasping. For training data, we use MANO and ShapeNet objects to generate synthetic images of hand-object grasps. ObMan's joint hand-object reconstruction allows the network to encourage contact and discourage interpenetration.

Hand-object distance is central to grasping. To model this, we learn a Grasping Field [ ] that characterizes every point in a 3D space by the signed distances to the surface of the hand and the object. The hand, the object, and the contact area are represented by implicit surfaces in a common space. The Grasping Field is parameterized with a deep neural network trained on ObMan's synthetic data.

ObMan's dataset contains hand grasps synthesized by robotics software. However, real human grasps look more varied and natural. Moreover, humans use not only their hands, but also use the body and face during interactions. We therefore capture GRAB [ ], a dataset of real whole-body human grasps of objects. We use a high-end MoCap system, capture 10 subjects interacting with 51 objects, and reconstruct 3D SMPL-X [ ] human meshes interacting with 3D object meshes, including dynamic poses and in-hand manipulation. We use GRAB to train GrabNet, a deep network that generates 3D hand grasps for unseen 3D objects.

GRAB (ECCV 2020) dataset (link)

A dataset of 3D whole-body grasps during human-object interaction.
The dataset contains 1.622.459 frames in total. Each one has:
- an expressive 3D SMPL-X human mesh (shaped and posed),
- a 3D rigid object mesh (posed), and
- contact annotations (wherever applicable).

ObMan (CVPR 2019) dataset (link)