My research interest is animal and human pose estimation from images and videos. My main contribution to the field is the exploration of realistic 2D and 3D models of human body shape and 3D models of animals for estimating pose and articulated motion in unconstrained images, videos, and from 3D data.
I graduated in Electronic Engineering at the University of Bologna. After graduation I worked for some time in the industry, where I contributed to realize a system for the forecast of the tide in Venice (tide forecast). Then, I joined for a short time the Istituti Ortopedici Rizzoli, a orthopaedic hospital in Bologna (Italy), where I worked on a method for the estimation of knee prostheses kinematics. Before starting my PhD in 2009 I worked as research scientist at ITC-CNR (Milan, Italy), where my research interest was colour imaging, specifically multispectral imaging and reproduction, and readability of coloured text on Web pages. I am currently a research scientist at IMATI-CNR.
In International Conference on Computer Vision, October 2019 (inproceedings)
We present the first method to perform automatic 3D pose, shape and texture capture of animals from images acquired in-the-wild. In particular, we focus on the problem of capturing 3D information about Grevy's zebras from a collection of images. The Grevy's zebra is one of the most endangered species in Africa, with only a few thousand individuals left. Capturing the shape and pose of these animals can provide biologists and conservationists with information about animal health and behavior. In contrast to research on human pose, shape and texture estimation, training data for endangered species is limited, the animals are in complex natural scenes with occlusion, they are naturally camouflaged, travel in herds, and look similar to each other. To overcome these challenges, we integrate the recent SMAL animal model into a network-based regression pipeline, which we train end-to-end on synthetically generated images with pose, shape, and background variation. Going beyond state-of-the-art methods for human shape and pose estimation, our method learns a shape space for zebras during training.
Learning such a shape space from images using only a photometric loss is novel, and the approach can be used to learn shape in other settings with limited 3D supervision.
Moreover, we couple 3D pose and shape prediction with the task of texture synthesis, obtaining a full texture map of the animal from a single image.
We show that the predicted texture map allows a novel per-instance unsupervised optimization over the network features. This method, SMALST (SMAL with learned Shape and Texture) goes beyond previous work, which assumed manual keypoints and/or segmentation, to regress directly from pixels to 3D animal shape, pose and texture.
Code and data are available at https://github.com/silviazuffi/smalst
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society, 2018 (inproceedings)
Animals are widespread in nature and the analysis of their shape and motion is important in many fields and industries. Modeling 3D animal shape, however, is difficult because the 3D scanning methods used to capture human shape are not applicable to wild animals or natural settings. Consequently, we propose a method to capture the detailed 3D shape of animals from images alone. The articulated and deformable nature of animals makes this problem extremely challenging, particularly in unconstrained environments with moving and uncalibrated cameras. To make this possible, we use a strong prior model of articulated animal shape that we fit to the image data. We then deform the animal shape in a canonical reference pose such that it matches image evidence when articulated and projected into multiple images. Our method extracts significantly more 3D shape detail than previous methods and is able to model new species, including the shape of an extinct animal, using only a few video frames. Additionally, the projected 3D shapes are accurate enough to facilitate the extraction of a realistic texture map from multiple frames.
A method for generating a body shape, comprising the steps: - receiving one or more linguistic descriptors related to the body shape; - retrieving an association between the one or more linguistic descriptors and a body shape; and - generating the body shape, based on the association.
In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, pages: 5524-5532, IEEE, Piscataway, NJ, USA, July 2017 (inproceedings)
There has been significant work on learning realistic, articulated, 3D models of the human body. In contrast, there are few such models of animals, despite many applications. The main challenge is that animals are much less cooperative than humans. The best human body models are learned from thousands of 3D scans of people in specific poses, which is infeasible with live animals. Consequently,
we learn our model from a small set of 3D scans of toy figurines in arbitrary poses. We employ a novel part-based shape model to compute an initial registration to the scans. We then normalize their pose, learn a statistical shape model, and refine the registrations and the model together. In this way, we accurately align animal scans from different quadruped families with very different shapes and poses. With the registration to a common template we learn a shape space representing animals including lions, cats, dogs, horses, cows and hippos. Animal shapes can be sampled from the model, posed, animated, and fit to data. We demonstrate generalization by fitting it to images of real animals including species not seen in training.
ACM Trans. Graph. (Proc. SIGGRAPH), 35(4):54:1-54:14, July 2016 (article)
Realistic, metrically accurate, 3D human avatars are useful for games, shopping, virtual reality, and health applications. Such avatars are not in wide use because solutions for creating them from high-end scanners, low-cost range cameras, and tailoring measurements all have limitations. Here we propose a simple solution and show that it is surprisingly accurate. We use crowdsourcing to generate attribute ratings of 3D body shapes corresponding to standard linguistic descriptions of 3D shape. We then learn a linear function relating these ratings to 3D human shape parameters. Given an image of a new body, we again turn to the crowd for ratings of the body shape. The collection of linguistic ratings of a photograph provides remarkably strong constraints on the metric 3D shape. We call the process crowdshaping and show that our Body Talk system produces shapes that are perceptually indistinguishable from bodies created from high-resolution scans and that the metric accuracy is sufficient for many tasks. This makes body “scanning” practical without a scanner, opening up new applications including database search, visualization, and extracting avatars from books.
In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 2015), pages: 3537-3546, June 2015 (inproceedings)
We propose a new 3D model of the human body that is both realistic and part-based. The body is represented by
a graphical model in which nodes of the graph correspond to body parts that can independently translate and rotate
in 3D as well as deform to capture pose-dependent shape variations. Pairwise potentials define a “stitching cost” for
pulling the limbs apart, giving rise to the stitched puppet model (SPM). Unlike existing realistic 3D body models, the
distributed representation facilitates inference by allowing the model to more effectively explore the space of poses,
much like existing 2D pictorial structures models. We infer pose and body shape using a form of particle-based max-product belief propagation. This gives the SPM the realism of recent 3D body models with the computational advantages
of part-based models. We apply the SPM to two challenging problems involving estimating human shape and
pose from 3D data. The first is the FAUST mesh alignment challenge (http://faust.is.tue.mpg.de/), where ours is the first method to successfully align all 3D meshes. The second involves estimating pose and shape from crude visual hull representations of complex
In this thesis we address the problem of building shape models of the human body,
in 2D and 3D, which are realistic and efficient to use. We focus our efforts on the
human body, which is highly articulated and has interesting shape variations, but
the approaches we present here can be applied to generic deformable and articulated
objects. To address efficiency, we constrain our models to be part-based and have a
tree-structured representation with pairwise relationships between connected parts.
This allows the application of methods for distributed inference based on message
passing. To address realism, we exploit recent advances in computer graphics that
represent the human body with statistical shape models learned from 3D scans.
We introduce two articulated body models, a 2D model, named Deformable Structures
(DS), which is a contour-based model parameterized for 2D pose and projected
shape, and a 3D model, named Stitchable Puppet (SP), which is a mesh-based model
parameterized for 3D pose, pose-dependent deformations and intrinsic body shape.
We have successfully applied the models to interesting and challenging problems in
computer vision and computer graphics, namely pose estimation from static images,
pose estimation from video sequences, pose and shape estimation from 3D scan data.
This advances the state of the art in human pose and shape estimation and suggests
that carefully dened realistic models can be important for computer vision.
More work at the intersection of vision and graphics is thus encouraged.
In Proceedings of the 31st International Conference on Machine Learning (ICML-14), 32(1):1152-1160, J. Machine Learning Research Workshop and Conf. and Proc., Beijing, China, June 2014 (inproceedings)
In applications of graphical models arising in domains such as computer vision and signal processing,
we often seek the most likely configurations of high-dimensional, continuous variables. We develop a particle-based max-product algorithm which maintains a diverse set of posterior mode hypotheses, and is robust to initialization.
At each iteration, the set of hypotheses at each node is augmented via stochastic proposals, and then reduced via an efficient selection algorithm. The integer program underlying our optimization-based particle selection minimizes
errors in subsequent max-product message updates. This objective automatically encourages diversity in the maintained hypotheses, without requiring tuning of application-specific distances among hypotheses. By avoiding the stochastic resampling steps underlying particle sum-product algorithms, we also avoid common degeneracies where particles collapse onto a single hypothesis. Our approach significantly outperforms previous particle-based algorithms in experiments focusing on the estimation of human pose from single images.
In IEEE International Conference on Computer Vision (ICCV), pages: 3192-3199, IEEE, Sydney, Australia, December 2013 (inproceedings)
Although action recognition in videos is widely studied, current methods often fail on real-world datasets. Many recent approaches improve accuracy and robustness to cope with challenging video sequences, but it is often unclear
what affects the results most. This paper attempts to provide insights based on a systematic performance evaluation
using thoroughly-annotated data of human actions. We annotate human Joints for the HMDB dataset (J-HMDB). This annotation can be used to derive ground truth optical flow and segmentation. We evaluate current methods using
this dataset and systematically replace the output of various algorithms with ground truth. This enables us to discover what is important – for example, should we work on improving flow algorithms, estimating human bounding boxes, or enabling pose estimation? In summary, we find that highlevel pose features greatly outperform low/mid level features; in particular, pose over time is critical, but current pose estimation algorithms are not yet reliable enough to provide this information. We also find that the accuracy of a top-performing action recognition framework can be greatly increased by refining the underlying low/mid level features; this suggests it is important to improve optical flow and human detection algorithms. Our analysis and JHMDB dataset should facilitate a deeper understanding of action recognition algorithms.
(7), Max Planck Institute for Intelligent Systems, October 2013 (techreport)
We introduce Puppet Flow (PF), a layered model describing the optical flow of a person in a video sequence. We consider video frames composed by two layers: a foreground layer corresponding to a person, and background.
We model the background as an affine flow field. The foreground layer, being a moving person, requires reasoning about the articulated nature of the human body. We thus represent the foreground layer with the Deformable Structures model (DS), a parametrized 2D part-based human body representation. We call the motion field defined through articulated motion and deformation of the DS model, a Puppet Flow. By exploiting the DS representation, Puppet Flow is a parametrized optical flow field, where parameters are the person's pose, gender and body shape.
In IEEE International Conference on Computer Vision (ICCV), pages: 3312-3319, 2013 (inproceedings)
We address the problem of upper-body human pose estimation in uncontrolled monocular video sequences, without manual initialization. Most current methods focus on isolated video frames and often fail to correctly localize arms and hands. Inferring pose over a video sequence is advantageous because poses of people in adjacent frames exhibit properties of smooth variation due to the nature of human and camera motion. To exploit this, previous methods have used prior knowledge about distinctive actions or generic temporal priors combined with static image likelihoods to track people in motion. Here we take a different approach based on a simple observation: Information about how a person moves from frame to frame is present in the optical flow field. We develop an approach for tracking articulated motions that "links" articulated shape models of people in adjacent frames trough the dense optical flow. Key to this approach is a 2D shape model of the body that we use to compute how the body moves over time. The resulting "flowing puppets" provide a way of integrating image evidence across frames to improve pose inference. We apply our method on a challenging dataset of TV video sequences and show state-of-the-art performance.
In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages: 3546-3553, IEEE, June 2012 (inproceedings)
Pictorial Structures (PS) define a probabilistic model of 2D articulated objects in images. Typical PS models assume an object can be represented by a set of rigid parts connected with pairwise constraints that define the prior probability of part configurations. These models are widely used to represent non-rigid articulated objects such as humans and animals despite the fact that such objects have parts that deform non-rigidly. Here we define a new Deformable Structures (DS) model that is a natural extension of previous PS models and that captures the non-rigid shape deformation of the parts. Each part in a DS model is represented by a low-dimensional shape deformation space and pairwise potentials between parts capture how the shape varies with pose and the shape of neighboring parts. A key advantage of such a model is that it more accurately models object boundaries. This enables image likelihood models that are more discriminative than previous PS likelihoods. This likelihood is learned using training imagery annotated using a DS “puppet.” We focus on a human DS model learned from 2D projections of a realistic 3D human body model and use it to infer human poses in images using a form of non-parametric belief propagation.
In IEEE Conf. on Computer Vision and Pattern Recognition, (CVPR), pages: 639-646, IEEE, June 2010 (inproceedings)
We define a new “contour person” model of the human body that has the expressive power of a detailed 3D model and the computational benefits of a simple 2D part-based model. The contour person (CP) model is learned from a 3D SCAPE model of the human body that captures natural shape and pose variations; the projected contours of this model, along with their segmentation into parts forms the training set. The CP model factors deformations of the body into three components: shape variation, viewpoint change and part rotation. This latter model also incorporates a learned non-rigid deformation model. The result is a 2D articulated model that is compact to represent, simple to compute with and more expressive than previous models. We demonstrate the value of such a model in 2D pose estimation and segmentation. Given an initial pose from a standard pictorial-structures method, we refine the pose and shape using an objective function that segments the scene into foreground and background regions. The result is a parametric, human-specific, image segmentation.
Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems