Human Pose, Shape, and Motion

Our goal is to extract detailed 3D information about people, their facial expressions, and their interactions with each other and the world. To do so, we train deep neural networks to regress SMPL-X body parameters (shape, body pose, hand pose, and facial expression) directly from image pixels.

As humans, we influence the world through our bodies. We express our emotions through our facial expressions and body posture. We manipulate and change the world with our hands. For computers to be full partners with humans, they have to see us and understand our behavior. They have to recognize our facial expressions, our gestures, our movements and our actions. This means that we need robust algorithms and expressive representations that can capture human pose, motion, and behavior.

There is a long history of research on this topic but, in Perceiving Systems, our focus has always been on developing 3D representations of the body, extracting these from images and video, and using these as the foundation for human behavior analysis.

Representing and extracting 3D body shape and pose has not been the dominant paradigm in the field but this is now changing. The introduction of our SMPL body model helped change this. SMPL is accurate, easy to use, compatible with game engines, differentiable, and is now widely used both in research and industry. It can be easily fit to image data "top down" or integrated into the end-to-end training of neural networks.

Over the last six years we have shown how to fit SMPL and SMPL-X to image data and how to train deep networks end-to-end to extract full-body shape, pose, and facial expressions from single images or videos. This includes the following methods, which provide foundational tools for capturing and analyzing human motion in natural settings

SMPLify [ ]
Unite the People [ ]
Human Mesh Recovery (HMR) [ ]
Neural Body Fitting [ ]
SMPLify-X [ ]
SPIN [ ]
PROX [ ]
VIBE [ ]
ExPose [ ]
TUCH [ ]
PARE [ ]
SPEC [ ]
ROMP [ ]
PIXIE [ ]

We are often interested in the interaction between multiple people and between people and objects. Consequently we have developed methods for detecting and tracking people in crowded scenes, for recognizing hand-object interactions, and for tracking 3D human motion in the wild.

Our ultimate goal is to understand behavior. To do so we first want to capture it at scale. This means robustly and efficiently tracking human behavior in natural settings and relating that behavior to the 3D world around the person.