Human inference via optimization. (Left) SMPLify estimates configurations of the SMPL body model from 2D body joints detected in images. (Middle) SMPLify-X estimates SMPL-X from whole-body 2D landmarks; note the expressive face and fingers. (Right) SMPLify-X humans (yellow) penetrate 3D objects; PROX (gray) extends it to use a 3D scene scan to encourage contact between bodies and objects, while discouraging inter-penetrations.
While data-driven methods for directly regressing 3D humans from 2D images are widely popular, optimization-based methods continue to play an important role. While typically slower than regression methods, optimization approaches require no training data, can be quickly adapted to new problems, and produce image-aligned results. In our view, the two approaches are not competing, but rather, complimentary.
Optimization-based approaches directly fit a 3D body model like SMPL to image observations (e.g., detected joint locations, edges, silhouettes, semantic segmentations, etc.). We introduced the first such method, SMPLify [ ], which optimizes SMPL pose and shape to minimize the 2D error between detected joints and projected SMPL joints. Because of the inherent ambiguity in estimating 3D from 2D, SMPLify introduced a pose prior trained on mocap data and a term that discouraged self-penetration.
With SMPLify-X [ ] we extend this concept to estimate the expressive SMPL-X model by fitting it to 2D landmarks from OpenPose. SMPLify-X introduced several improvements including a gender classifier so that the estimated body shapes better matched the image. We also introduced a better VAE-based pose prior, VPoser, trained on AMASS, and we improved the interpenetration detection.
Because images with ground-truth human pose and shape are hard to obtain, these optimization methods provide critical pseudo ground truth for training deep regression networks. For example, we use SMPLify-X to obtain SMPL-X fits to images and use these to train ExPose [ ]. With SPIN [ ], we showed that an even tighter integration of regression and optimization is valuable and synergistic. SPIN uses a regressor to initialize SMPLify, which is then run for a few optimization steps, improving the fit. These improved fits are then used to retrain the regressor. By doing this in a loop, we incrementally obtain better training data and a better regressor. This training approach is now widely used.
The basic SMPLify(-X) approach is easily adapted to new problems making it a foundational tool in our research. For example, we extended it to perform multi-view fitting and use silhouettes [ ], which we exploited to create the AGORA [ ] and SPEC-MTP [ ] datasets. We use it with aerial vehicles to simultaneously solve for camera extrinsics and body pose in multi-view images [ ]. We adapted it to RGB-D images by including a depth loss and scene contact constraints in the objective function, enabling the creation of the PROX dataset [ ]. We added constraints related to self-contact and exploited this to create the training and test data for TUCH [ ].