Humans live within a 3D space and constantly interact with it to perform tasks. Such interactions involve physical contact between surfaces that is semantically meaningful. Our goal is to learn how humans interact with scenes and leverage this to enable virtual characters to do the same. This is a challenging task for a computer as solving it requires that (1) the generated human bodies are semantically plausible within the 3D environment (e.g. people sitting on the sofa or cooking near the stove), and (2) the generated human-scene interaction is physically feasible such that the human body and scene do not interpenetrate while, at the same time, body-scene contact supports physical interactions.
While prior work focused on the body as a stick figure, we place full 3D SMPL-X bodies in scenes. The body surface is critical to establishing appropriate semantic and physical interactions. To create training data, we use the PROX dataset [ ], which includes 3D SMPL-X bodies fit to real 3D scenes with ground truth contact information.
Our first work, PSI [ ], uses a conditional variational autoencoder to predict semantically plausible 3D human poses conditioned on latent scene representations. We then refine the generated 3D bodies using scene constraints to enforce feasible physical interaction.
To synthesize realistic human-scene interactions, it is essential to represent the physical contact and proximity between the body and the world. With PLACE [ ], we explicitly model the proximity between the human body and the 3D scene around it. Specifically, given a set of basis points on a scene mesh, we train a conditional VAE to synthesize the distances from the basis points to the human body surface.
POSA [ ] flips this around to model human-scene interaction in a body-centric representation that enables it to generalize to new scenes. POSA augments SMPL-X such that, for every mesh vertex, it encodes (a) the contact probability with the scene surface and (b) the corresponding semantic scene label. We learn POSA with a VAE conditioned on the SMPL-X vertices, and train on the PROX dataset.
While the above methods produce static poses, SAMP [ ] generates goal-directed human movement in novel scenes. Given a task like "sit on the sofa", SAMP uses a GoalNet to extract the affordances of the sofa. A MotionNet generates sequences of poses to achieve the goal, while an A* algorithm plans a collision-free path through the scene.
These methods are just the beginning but provide a path for creating digital humans that can behave autonomously in 3D worlds.