Perceiving Systems, Computer Vision


2024


{PuzzleAvatar}: Assembling 3D Avatars from Personal Albums
PuzzleAvatar: Assembling 3D Avatars from Personal Albums

Xiu, Y., Liu, Z., Tzionas, D., Black, M. J.

ACM Transactions on Graphics, 43(6), ACM, December 2024 (article) To be published

Abstract
Generating personalized 3D avatars is crucial for AR/VR. However, recent text-to-3D methods that generate avatars for celebrities or fictional characters, struggle with everyday people. Methods for faithful reconstruction typically require full-body images in controlled settings. What if a user could just upload their personal "OOTD" (Outfit Of The Day) photo collection and get a faithful avatar in return? The challenge is that such casual photo collections contain diverse poses, challenging viewpoints, cropped views, and occlusion (albeit with a consistent outfit, accessories and hairstyle). We address this novel "Album2Human" task by developing PuzzleAvatar, a novel model that generates a faithful 3D avatar (in a canonical pose) from a personal OOTD album, while bypassing the challenging estimation of body and camera pose. To this end, we fine-tune a foundational vision-language model (VLM) on such photos, encoding the appearance, identity, garments, hairstyles, and accessories of a person into (separate) learned tokens and instilling these cues into the VLM. In effect, we exploit the learned tokens as "puzzle pieces" from which we assemble a faithful, personalized 3D avatar. Importantly, we can customize avatars by simply inter-changing tokens. As a benchmark for this new task, we collect a new dataset, called PuzzleIOI, with 41 subjects in a total of nearly 1K OOTD configurations, in challenging partial photos with paired ground-truth 3D bodies. Evaluation shows that PuzzleAvatar not only has high reconstruction accuracy, outperforming TeCH and MVDreamBooth, but also a unique scalability to album photos, and strong robustness. Our code and data are publicly available for research purpose.

Page Code Video DOI [BibTex]

2024

Page Code Video DOI [BibTex]


{StableNormal}: Reducing Diffusion Variance for Stable and Sharp Normal
StableNormal: Reducing Diffusion Variance for Stable and Sharp Normal

Ye, C., Qiu, L., Gu, X., Zuo, Q., Wu, Y., Dong, Z., Bo, L., Xiu, Y., Han, X.

ACM Transactions on Graphics, 43(6), ACM, December 2024 (article) To be published

Abstract
This work addresses the challenge of high-quality surface normal estimation from monocular colored inputs (i.e., images and videos), a field which has recently been revolutionized by repurposing diffusion priors. However, previous attempts still struggle with stochastic inference, conflicting with the deterministic nature of the Image2Normal task, and costly ensembling step, which slows down the estimation process. Our method, StableNormal, mitigates the stochasticity of the diffusion process by reducing inference variance, thus producing "Stable-and-Sharp" normal estimates without any additional ensembling process. StableNormal works robustly under challenging imaging conditions, such as extreme lighting, blurring, and low quality. It is also robust against transparent and reflective surfaces, as well as cluttered scenes with numerous objects. Specifically, StableNormal employs a coarse-to-fine strategy, which starts with a one-step normal estimator (YOSO) to derive an initial normal guess, that is relatively coarse but reliable, then followed by a semantic-guided refinement process (SG-DRN) that refines the normals to recover geometric details. The effectiveness of StableNormal is demonstrated through competitive performance in standard datasets such as DIODE-indoor, iBims, ScannetV2 and NYUv2, and also in various downstream tasks, such as surface reconstruction and normal enhancement. These results evidence that StableNormal retains both the "stability" and "sharpness" for accurate normal estimation. StableNormal represents a baby attempt to repurpose diffusion priors for deterministic estimation. To democratize this, code and models have been publicly available.

Page Huggingface Demo Code Video DOI [BibTex]

Page Huggingface Demo Code Video DOI [BibTex]


Re-Thinking Inverse Graphics with Large Language Models
Re-Thinking Inverse Graphics with Large Language Models

Kulits, P., Feng, H., Liu, W., Abrevaya, V., Black, M. J.

Transactions on Machine Learning Research, August 2024 (article)

Abstract
Inverse graphics -- the task of inverting an image into physical variables that, when rendered, enable reproduction of the observed scene -- is a fundamental challenge in computer vision and graphics. Successfully disentangling an image into its constituent elements, such as the shape, color, and material properties of the objects of the 3D scene that produced it, requires a comprehensive understanding of the environment. This complexity limits the ability of existing carefully engineered approaches to generalize across domains. Inspired by the zero-shot ability of large language models (LLMs) to generalize to novel contexts, we investigate the possibility of leveraging the broad world knowledge encoded in such models to solve inverse-graphics problems. To this end, we propose the Inverse-Graphics Large Language Model (IG-LLM), an inverse-graphics framework centered around an LLM, that autoregressively decodes a visual embedding into a structured, compositional 3D-scene representation. We incorporate a frozen pre-trained visual encoder and a continuous numeric head to enable end-to-end training. Through our investigation, we demonstrate the potential of LLMs to facilitate inverse graphics through next-token prediction, without the application of image-space supervision. Our analysis enables new possibilities for precise spatial reasoning about images that exploit the visual knowledge of LLMs. We release our code and data at https://ig-llm.is.tue.mpg.de/ to ensure the reproducibility of our investigation and to facilitate future research.

link (url) [BibTex]

link (url) [BibTex]


Modelling Dynamic 3D Human-Object Interactions: From Capture to Synthesis
Modelling Dynamic 3D Human-Object Interactions: From Capture to Synthesis

Taheri, O.

University of Tübingen, July 2024 (phdthesis) To be published

Abstract
Modeling digital humans that move and interact realistically with virtual 3D worlds has emerged as an essential research area recently, with significant applications in computer graphics, virtual and augmented reality, telepresence, the Metaverse, and assistive technologies. In particular, human-object interaction, encompassing full-body motion, hand-object grasping, and object manipulation, lies at the core of how humans execute tasks and represents the complex and diverse nature of human behavior. Therefore, accurate modeling of these interactions would enable us to simulate avatars to perform tasks, enhance animation realism, and develop applications that better perceive and respond to human behavior. Despite its importance, this remains a challenging problem, due to several factors such as the complexity of human motion, the variance of interaction based on the task, and the lack of rich datasets capturing the complexity of real-world interactions. Prior methods have made progress, but limitations persist as they often focus on individual aspects of interaction, such as body, hand, or object motion, without considering the holistic interplay among these components. This Ph.D. thesis addresses these challenges and contributes to the advancement of human-object interaction modeling through the development of novel datasets, methods, and algorithms.

[BibTex]

[BibTex]


Exploring Weight Bias and Negative Self-Evaluation in Patients with Mood Disorders: Insights from the {BodyTalk} Project,
Exploring Weight Bias and Negative Self-Evaluation in Patients with Mood Disorders: Insights from the BodyTalk Project,

Meneguzzo, P., Behrens, S. C., Pavan, C., Toffanin, T., Quiros-Ramirez, M. A., Black, M. J., Giel, K., Tenconi, E., Favaro, A.

Frontiers in Psychiatry, 15, Sec. Psychopathology, May 2024 (article)

Abstract
Background: Negative body image and adverse body self-evaluation represent key psychological constructs within the realm of weight bias (WB), potentially intertwined with the negative self-evaluation characteristic of depressive symptomatology. Although WB encapsulates an implicit form of self-critical assessment, its exploration among people with mood disorders (MD) has been under-investigated. Our primary goal is to comprehensively assess both explicit and implicit WB, seeking to reveal specific dimensions that could interconnect with the symptoms of MDs. Methods: A cohort comprising 25 MD patients and 35 demographically matched healthy peers (with 83% female representation) participated in a series of tasks designed to evaluate the congruence between various computer-generated body representations and a spectrum of descriptive adjectives. Our analysis delved into multiple facets of body image evaluation, scrutinizing the associations between different body sizes and emotionally charged adjectives (e.g., active, apple-shaped, attractive). Results: No discernible differences emerged concerning body dissatisfaction or the correspondence of different body sizes with varying adjectives. Interestingly, MD patients exhibited a markedly higher tendency to overestimate their body weight (p = 0.011). Explicit WB did not show significant variance between the two groups, but MD participants demonstrated a notable implicit WB within a specific weight rating task for BMI between 18.5 and 25 kg/m2 (p = 0.012). Conclusions: Despite the striking similarities in the assessment of participants’ body weight, our investigation revealed an implicit WB among individuals grappling with MD. This bias potentially assumes a role in fostering self-directed negative evaluations, shedding light on a previously unexplored facet of the interplay between WB and mood disorders.

paper paper link (url) DOI [BibTex]

paper paper link (url) DOI [BibTex]


The Poses for Equine Research Dataset {(PFERD)}
The Poses for Equine Research Dataset (PFERD)

Li, C., Mellbin, Y., Krogager, J., Polikovsky, S., Holmberg, M., Ghorbani, N., Black, M. J., Kjellström, H., Zuffi, S., Hernlund, E.

Nature Scientific Data, 11, May 2024 (article)

Abstract
Studies of quadruped animal motion help us to identify diseases, understand behavior and unravel the mechanics behind gaits in animals. The horse is likely the best-studied animal in this aspect, but data capture is challenging and time-consuming. Computer vision techniques improve animal motion extraction, but the development relies on reference datasets, which are scarce, not open-access and often provide data from only a few anatomical landmarks. Addressing this data gap, we introduce PFERD, a video and 3D marker motion dataset from horses using a full-body set-up of densely placed over 100 skin-attached markers and synchronized videos from ten camera angles. Five horses of diverse conformations provide data for various motions from basic poses (eg. walking, trotting) to advanced motions (eg. rearing, kicking). We further express the 3D motions with current techniques and a 3D parameterized model, the hSMAL model, establishing a baseline for 3D horse markerless motion capture. PFERD enables advanced biomechanical studies and provides a resource of ground truth data for the methodological development of markerless motion capture.

paper [BibTex]

paper [BibTex]


{InterCap}: Joint Markerless {3D} Tracking of Humans and Objects in Interaction from Multi-view {RGB-D} Images
InterCap: Joint Markerless 3D Tracking of Humans and Objects in Interaction from Multi-view RGB-D Images

Huang, Y., Taheri, O., Black, M. J., Tzionas, D.

International Journal of Computer Vision (IJCV), 2024 (article)

Abstract
Humans constantly interact with objects to accomplish tasks. To understand such interactions, computers need to reconstruct these in 3D from images of whole bodies manipulating objects, e.g., for grasping, moving and using the latter. This involves key challenges, such as occlusion between the body and objects, motion blur, depth ambiguities, and the low image resolution of hands and graspable object parts. To make the problem tractable, the community has followed a divide-and-conquer approach, focusing either only on interacting hands, ignoring the body, or on interacting bodies, ignoring the hands. However, these are only parts of the problem. On the contrary, recent work focuses on the whole problem. The GRAB dataset addresses whole-body interaction with dexterous hands but captures motion via markers and lacks video, while the BEHAVE dataset captures video of body-object interaction but lacks hand detail. We address the limitations of prior work with InterCap, a novel method that reconstructs interacting whole-bodies and objects from multi-view RGB-D data, using the parametric whole-body SMPL-X model and known object meshes. To tackle the above challenges, InterCap uses two key observations: (i) Contact between the body and object can be used to improve the pose estimation of both. (ii) Consumer-level Azure Kinect cameras let us set up a simple and flexible multi-view RGB-D system for reducing occlusions, with spatially calibrated and temporally synchronized cameras. With our InterCap method we capture the InterCap dataset, which contains 10 subjects (5 males and 5 females) interacting with 10 daily objects of various sizes and affordances, including contact with the hands or feet. To this end, we introduce a new data-driven hand motion prior, as well as explore simple ways for automatic contact detection based on 2D and 3D cues. In total, InterCap has 223 RGB-D videos, resulting in 67,357 multi-view frames, each containing 6 RGB-D images, paired with pseudo ground-truth 3D body and object meshes. Our InterCap method and dataset fill an important gap in the literature and support many research directions. Data and code are available at https://intercap.is.tue.mpg.de.

Paper link (url) DOI [BibTex]


Self- and Interpersonal Contact in 3D Human Mesh Reconstruction
Self- and Interpersonal Contact in 3D Human Mesh Reconstruction

Müller, L.

University of Tübingen, Tübingen, 2024 (phdthesis)

Abstract
The ability to perceive tactile stimuli is of substantial importance for human beings in establishing a connection with the surrounding world. Humans rely on the sense of touch to navigate their environment and to engage in interactions with both themselves and other people. The field of computer vision has made great progress in estimating a person’s body pose and shape from an image, however, the investigation of self- and interpersonal contact has received little attention despite its considerable significance. Estimating contact from images is a challenging endeavor because it necessitates methodologies capable of predicting the full 3D human body surface, i.e. an individual’s pose and shape. The limitations of current methods become evident when considering the two primary datasets and labels employed within the community to supervise the task of human pose and shape estimation. First, the widely used 2D joint locations lack crucial information for representing the entire 3D body surface. Second, in datasets of 3D human bodies, e.g. collected from motion capture systems or body scanners, contact is usually avoided, since it naturally leads to occlusion which complicates data cleaning and can break the data processing pipelines. In this thesis, we first address the problem of estimating contact that humans make with themselves from RGB images. To do this, we introduce two novel methods that we use to create new datasets tailored for the task of human mesh estimation for poses with self-contact. We create (1) 3DCP, a dataset of 3D body scan and motion capture data of humans in poses with self-contact and (2) MTP, a dataset of images taken in the wild with accurate 3D reference data using pose mimicking. Next, we observe that 2D joint locations can be readily labeled at scale given an image, however, an equivalent label for self-contact does not exist. Consequently, we introduce (3) distrecte self-contact (DSC) annotations indicating the pairwise contact of discrete regions on the human body. We annotate three existing image datasets with discrete self-contact and use these labels during mesh optimization to bring body parts supposed to touch into contact. Then we train TUCH, a human mesh regressor, on our new datasets. When evaluated on the task of human body pose and shape estimation on public benchmarks, our results show that knowing about self-contact not only improves mesh estimates for poses with self-contact, but also for poses without self-contact. Next, we study contact humans make with other individuals during close social interaction. Reconstructing these interactions in 3D is a significant challenge due to the mutual occlusion. Furthermore, the existing datasets of images taken in the wild with ground-truth contact labels are of insufficient size to facilitate the training of a robust human mesh regressor. In this work, we employ a generative model, BUDDI, to learn the joint distribution of 3D pose and shape of two individuals during their interaction and use this model as prior during an optimization routine. To construct training data we leverage pre-existing datasets, i.e. motion capture data and Flickr images with discrete contact annotations. Similar to discrete self-contact labels, we utilize discrete human- human contact to jointly fit two meshes to detected 2D joint locations. The majority of methods for generating 3D humans focus on the motion of a single person and operate on 3D joint locations. While these methods can effectively generate motion, their representation of 3D humans is not sufficient for physical contact since they do not model the body surface. Our approach, in contrast, acts on the pose and shape parameters of a human body model, which enables us to sample 3D meshes of two people. We further demonstrate how the knowledge of human proxemics, incorporated in our model, can be used to guide an optimization routine. For this, in each optimization iteration, BUDDI takes the current mesh and proposes a refinement that we subsequently consider in the objective function. This procedure enables us to go beyond state of the art by forgoing ground-truth discrete human-human contact labels during optimization. Self- and interpersonal contact happen on the surface of the human body, however, the majority of existing art tends to predict bodies with similar, “average” body shape. This is due to a lack of training data of paired images taken in the wild and ground- truth 3D body shape and because 2D joint locations are not sufficient to explain body shape. The most apparent solution would be to collect body scans of people together with their photos. This is, however, a time-consuming and cost-intensive process that lacks scalability. Instead, we leverage the vocabulary humans use to describe body shape. First, we ask annotators to label how much a word like “tall” or “long legs” applies to a human body. We gather these ratings for rendered meshes of various body shapes, for which we have ground-truth body model shape parameters, and for images collected from model agency websites. Using this data, we learn a shape-to-attribute (A2S) model that predicts body shape ratings from body shape parameters. Then we train a human mesh regressor, SHAPY, on the model agency images wherein we supervise body shape via attribute annotations using A2S. Since no suitable test set of diverse 3D ground-truth body shape with images taken in natural settings exists, we introduce Human Bodies in the Wild (HBW). This novel dataset contains photographs of individuals together with their body scan. Our model predicts more realistic body shapes from an image and quantitatively improves body shape estimation on this new benchmark. In summary, we present novel datasets, optimization methods, a generative model, and regressors to advance the field of 3D human pose and shape estimation. Taken together, these methods open up ways to obtain more accurate and realistic 3D mesh estimates from images with multiple people in self- and mutual contact poses and with diverse body shapes. This line of research also enables generative approaches to create more natural, human-like avatars. We believe that knowing about self- and human-human contact through computer vision has wide-ranging implications in other fields as for example robotics, fitness, or behavioral science.

[BibTex]

[BibTex]


Natural Language Control for 3D Human Motion Synthesis
Natural Language Control for 3D Human Motion Synthesis

Petrovich, M.

LIGM, Ecole des Ponts, Univ Gustave Eiffel, CNRS, 2024 (phdthesis)

Abstract
3D human motions are at the core of many applications in the film industry, healthcare, augmented reality, virtual reality and video games. However, these applications often rely on expensive and time-consuming motion capture data. The goal of this thesis is to explore generative models as an alternative route to obtain 3D human motions. More specifically, our aim is to allow a natural language interface as a means to control the generation process. To this end, we develop a series of models that synthesize realistic and diverse motions following the semantic inputs. In our first contribution, described in Chapter 3, we address the challenge of generating human motion sequences conditioned on specific action categories. We introduce ACTOR, a conditional variational autoencoder (VAE) that learns an action-aware latent representation for human motions. We show significant gains over existing methods thanks to our new Transformer-based VAE formulation, encoding and decoding SMPL pose sequences through a single motion-level embedding. In our second contribution, described in Chapter 4, we go beyond categorical actions, and dive into the task of synthesizing diverse 3D human motions from textual descriptions allowing a larger vocabulary and potentially more fine-grained control. Our work stands out from previous research by not deterministically generating a single motion sequence, but by synthesizing multiple, varied sequences from a given text. We propose TEMOS, building on our VAE-based ACTOR architecture, but this time integrating a pretrained text encoder to handle large-vocabulary natural language inputs. In our third contribution, described in Chapter 5, we address the adjacent task of text-to-3D human motion retrieval, where the goal is to search in a motion collection by querying via text. We introduce a simple yet effective approach, named TMR, building on our earlier model TEMOS, by integrating a contrastive loss to enhance the structure of the cross-modal latent space. Our findings emphasize the importance of retaining the motion generation loss in conjunction with contrastive training for improved results. We establish a new evaluation benchmark and conduct analyses on several protocols. In our fourth contribution, described in Chapter 6, we introduce a new problem termed as “multi-track timeline control” for text-driven 3D human motion synthesis. Instead of a single textual prompt, users can organize multiple prompts in temporal intervals that may overlap. We introduce STMC, a test-time denoising method that can be integrated with any pre-trained motion diffusion model. Our evaluations demonstrate that our method generates motions that closely match the semantic and temporal aspects of the input timelines. In summary, our contributions in this thesis are as follows: (i) we develop a generative variational autoencoder, ACTOR, for action-conditioned generation of human motion sequences, (ii) we introduce TEMOS, a text-conditioned generative model that synthesizes diverse human motions from textual descriptions, (iii) we present TMR, a new approach for text-to-3D human motion retrieval, (iv) we propose STMC, a method for timeline control in text-driven motion synthesis, enabling the generation of detailed and complex motions.

Thesis [BibTex]

Thesis [BibTex]


{HMP}: Hand Motion Priors for Pose and Shape Estimation from Video
HMP: Hand Motion Priors for Pose and Shape Estimation from Video

Duran, E., Kocabas, M., Choutas, V., Fan, Z., Black, M. J.

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024 (article)

Abstract
Understanding how humans interact with the world necessitates accurate 3D hand pose estimation, a task complicated by the hand’s high degree of articulation, frequent occlusions, self-occlusions, and rapid motions. While most existing methods rely on single-image inputs, videos have useful cues to address aforementioned issues. However, existing video-based 3D hand datasets are insufficient for training feedforward models to generalize to in-the-wild scenarios. On the other hand, we have access to large human motion capture datasets which also include hand motions, e.g. AMASS. Therefore, we develop a generative motion prior specific for hands, trained on the AMASS dataset which features diverse and high-quality hand motions. This motion prior is then employed for video-based 3D hand motion estimation following a latent optimization approach. Our integration of a robust motion prior significantly enhances performance, especially in occluded scenarios. It produces stable, temporally consistent results that surpass conventional single-frame methods. We demonstrate our method’s efficacy via qualitative and quantitative evaluations on the HO3D and DexYCB datasets, with special emphasis on an occlusion-focused subset of HO3D.

webpage pdf code [BibTex]

webpage pdf code [BibTex]

2023


{FLARE}: Fast learning of Animatable and Relightable Mesh Avatars
FLARE: Fast learning of Animatable and Relightable Mesh Avatars

Bharadwaj, S., Zheng, Y., Hilliges, O., Black, M. J., Abrevaya, V. F.

ACM Transactions on Graphics, 42(6):204:1-204:15, December 2023 (article) Accepted

Abstract
Our goal is to efficiently learn personalized animatable 3D head avatars from videos that are geometrically accurate, realistic, relightable, and compatible with current rendering systems. While 3D meshes enable efficient processing and are highly portable, they lack realism in terms of shape and appearance. Neural representations, on the other hand, are realistic but lack compatibility and are slow to train and render. Our key insight is that it is possible to efficiently learn high-fidelity 3D mesh representations via differentiable rendering by exploiting highly-optimized methods from traditional computer graphics and approximating some of the components with neural networks. To that end, we introduce FLARE, a technique that enables the creation of animatable and relightable mesh avatars from a single monocular video. First, we learn a canonical geometry using a mesh representation, enabling efficient differentiable rasterization and straightforward animation via learned blendshapes and linear blend skinning weights. Second, we follow physically-based rendering and factor observed colors into intrinsic albedo, roughness, and a neural representation of the illumination, allowing the learned avatars to be relit in novel scenes. Since our input videos are captured on a single device with a narrow field of view, modeling the surrounding environment light is non-trivial. Based on the split-sum approximation for modeling specular reflections, we address this by approximating the pre-filtered environment map with a multi-layer perceptron (MLP) modulated by the surface roughness, eliminating the need to explicitly model the light. We demonstrate that our mesh-based avatar formulation, combined with learned deformation, material, and lighting MLPs, produces avatars with high-quality geometry and appearance, while also being efficient to train and render compared to existing approaches.

Paper Project Page Code DOI [BibTex]

2023

Paper Project Page Code DOI [BibTex]


From Skin to Skeleton: Towards Biomechanically Accurate {3D} Digital Humans
From Skin to Skeleton: Towards Biomechanically Accurate 3D Digital Humans

(Honorable Mention for Best Paper)

Keller, M., Werling, K., Shin, S., Delp, S., Pujades, S., Liu, C. K., Black, M. J.

ACM Transaction on Graphics (ToG), 42(6):253:1-253:15, December 2023 (article)

Abstract
Great progress has been made in estimating 3D human pose and shape from images and video by training neural networks to directly regress the parameters of parametric human models like SMPL. However, existing body models have simplified kinematic structures that do not correspond to the true joint locations and articulations in the human skeletal system, limiting their potential use in biomechanics. On the other hand, methods for estimating biomechanically accurate skeletal motion typically rely on complex motion capture systems and expensive optimization methods. What is needed is a parametric 3D human model with a biomechanically accurate skeletal structure that can be easily posed. To that end, we develop SKEL, which re-rigs the SMPL body model with a biomechanics skeleton. To enable this, we need training data of skeletons inside SMPL meshes in diverse poses. We build such a dataset by optimizing biomechanically accurate skeletons inside SMPL meshes from AMASS sequences. We then learn a regressor from SMPL mesh vertices to the optimized joint locations and bone rotations. Finally, we re-parametrize the SMPL mesh with the new kinematic parameters. The resulting SKEL model is animatable like SMPL but with fewer, and biomechanically-realistic, degrees of freedom. We show that SKEL has more biomechanically accurate joint locations than SMPL, and the bones fit inside the body surface better than previous methods. By fitting SKEL to SMPL meshes we are able to “upgrade" existing human pose and shape datasets to include biomechanical parameters. SKEL provides a new tool to enable biomechanics in the wild, while also providing vision and graphics researchers with a better constrained

Project Page Paper DOI [BibTex]

Project Page Paper DOI [BibTex]


{BARC}: Breed-Augmented Regression Using Classification for {3D} Dog Reconstruction from Images
BARC: Breed-Augmented Regression Using Classification for 3D Dog Reconstruction from Images

Rueegg, N., Zuffi, S., Schindler, K., Black, M. J.

Int. J. of Comp. Vis. (IJCV), 131(8):1964–1979, August 2023 (article)

Abstract
The goal of this work is to reconstruct 3D dogs from monocular images. We take a model-based approach, where we estimate the shape and pose parameters of a 3D articulated shape model for dogs. We consider dogs as they constitute a challenging problem, given they are highly articulated and come in a variety of shapes and appearances. Recent work has considered a similar task using the multi-animal SMAL model, with additional limb scale parameters, obtaining reconstructions that are limited in terms of realism. Like previous work, we observe that the original SMAL model is not expressive enough to represent dogs of many different breeds. Moreover, we make the hypothesis that the supervision signal used to train the network, that is 2D keypoints and silhouettes, is not sufficient to learn a regressor that can distinguish between the large variety of dog breeds. We therefore go beyond previous work in two important ways. First, we modify the SMAL shape space to be more appropriate for representing dog shape. Second, we formulate novel losses that exploit information about dog breeds. In particular, we exploit the fact that dogs of the same breed have similar body shapes. We formulate a novel breed similarity loss, consisting of two parts: One term is a triplet loss, that encourages the shape of dogs from the same breed to be more similar than dogs of different breeds. The second one is a breed classification loss. With our approach we obtain 3D dogs that, compared to previous work, are quantitatively better in terms of 2D reconstruction, and significantly better according to subjective and quantitative 3D evaluations. Our work shows that a-priori side information about similarity of shape and appearance, as provided by breed labels, can help to compensate for the lack of 3D training data. This concept may be applicable to other animal species or groups of species. We call our method BARC (Breed-Augmented Regression using Classification). Our code is publicly available for research purposes at https://barc.is.tue.mpg.de/.

On-line DOI [BibTex]

On-line DOI [BibTex]


Virtual Reality Exposure to a Healthy Weight Body Is a Promising Adjunct Treatment for Anorexia Nervosa
Virtual Reality Exposure to a Healthy Weight Body Is a Promising Adjunct Treatment for Anorexia Nervosa

Behrens, S. C., Tesch, J., Sun, P. J., Starke, S., Black, M. J., Schneider, H., Pruccoli, J., Zipfel, S., Giel, K. E.

Psychotherapy and Psychosomatics, 92(3):170-179, June 2023 (article)

Abstract
Introduction/Objective: Treatment results of anorexia nervosa (AN) are modest, with fear of weight gain being a strong predictor of treatment outcome and relapse. Here, we present a virtual reality (VR) setup for exposure to healthy weight and evaluate its potential as an adjunct treatment for AN. Methods: In two studies, we investigate VR experience and clinical effects of VR exposure to higher weight in 20 women with high weight concern or shape concern and in 20 women with AN. Results: In study 1, 90% of participants (18/20) reported symptoms of high arousal but verbalized low to medium levels of fear. Study 2 demonstrated that VR exposure to healthy weight induced high arousal in patients with AN and yielded a trend that four sessions of exposure improved fear of weight gain. Explorative analyses revealed three clusters of individual reactions to exposure, which need further exploration. Conclusions: VR exposure is a well-accepted and powerful tool for evoking fear of weight gain in patients with AN. We observed a statistical trend that repeated virtual exposure to healthy weight improved fear of weight gain with large effect sizes. Further studies are needed to determine the mechanisms and differential effects.

on-line DOI [BibTex]

on-line DOI [BibTex]


{Fast-SNARF}: A Fast Deformer for Articulated Neural Fields
Fast-SNARF: A Fast Deformer for Articulated Neural Fields

Chen, X., Jiang, T., Song, J., Rietmann, M., Geiger, A., Black, M. J., Hilliges, O.

IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), pages: 1-15, April 2023 (article)

Abstract
Neural fields have revolutionized the area of 3D reconstruction and novel view synthesis of rigid scenes. A key challenge in making such methods applicable to articulated objects, such as the human body, is to model the deformation of 3D locations between the rest pose (a canonical space) and the deformed space. We propose a new articulation module for neural fields, Fast-SNARF, which finds accurate correspondences between canonical space and posed space via iterative root finding. Fast-SNARF is a drop-in replacement in functionality to our previous work, SNARF, while significantly improving its computational efficiency. We contribute several algorithmic and implementation improvements over SNARF, yielding a speed-up of 150× . These improvements include voxel-based correspondence search, pre-computing the linear blend skinning function, and an efficient software implementation with CUDA kernels. Fast-SNARF enables efficient and simultaneous optimization of shape and skinning weights given deformed observations without correspondences (e.g. 3D meshes). Because learning of deformation maps is a crucial component in many 3D human avatar methods and since Fast-SNARF provides a computationally efficient solution, we believe that this work represents a significant step towards the practical creation of 3D virtual humans.

pdf publisher site code DOI [BibTex]

pdf publisher site code DOI [BibTex]


{SmartMocap}: Joint Estimation of Human and Camera Motion Using Uncalibrated {RGB} Cameras
SmartMocap: Joint Estimation of Human and Camera Motion Using Uncalibrated RGB Cameras

Saini, N., Huang, C. P., Black, M. J., Ahmad, A.

IEEE Robotics and Automation Letters, 8(6):3206-3213, 2023 (article)

Abstract
Markerless human motion capture (mocap) from multiple RGB cameras is a widely studied problem. Existing methods either need calibrated cameras or calibrate them relative to a static camera, which acts as the reference frame for the mocap system. The calibration step has to be done a priori for every capture session, which is a tedious process, and re-calibration is required whenever cameras are intentionally or accidentally moved. In this letter, we propose a mocap method which uses multiple static and moving extrinsically uncalibrated RGB cameras. The key components of our method are as follows. First, since the cameras and the subject can move freely, we select the ground plane as a common reference to represent both the body and the camera motions unlike existing methods which represent bodies in the camera coordinate system. Second, we learn a probability distribution of short human motion sequences (~1sec) relative to the ground plane and leverage it to disambiguate between the camera and human motion. Third, we use this distribution as a motion prior in a novel multi-stage optimization approach to fit the SMPL human body model and the camera poses to the human body keypoints on the images. Finally, we show that our method can work on a variety of datasets ranging from aerial cameras to smartphones. It also gives more accurate results compared to the state-of-the-art on the task of monocular human mocap with a static camera.

publisher site DOI [BibTex]


Viewpoint-Driven Formation Control of Airships for Cooperative Target Tracking
Viewpoint-Driven Formation Control of Airships for Cooperative Target Tracking

Price, E., Black, M. J., Ahmad, A.

IEEE Robotics and Automation Letters, 8(6):3653-3660, 2023 (article)

Abstract
For tracking and motion capture (MoCap) of animals in their natural habitat, a formation of safe and silent aerial platforms, such as airships with on-board cameras, is well suited. In our prior work we derived formation properties for optimal MoCap, which include maintaining constant angular separation between observers w.r.t. the subject, threshold distance to it and keeping it centered in the camera view. Unlike multi-rotors, airships have non-holonomic constrains and are affected by ambient wind. Their orientation and flight direction are also tightly coupled. Therefore a control scheme for multicopters that assumes independence of motion direction and orientation is not applicable. In this letter, we address this problem by first exploiting a periodic relationship between the airspeed of an airship and its distance to the subject. We use it to derive analytical and numeric solutions that satisfy the formation properties for optimal MoCap. Based on this, we develop an MPC-based formation controller. We perform theoretical analysis of our solution, boundary conditions of its applicability, extensive simulation experiments and a real world demonstration of our control method with an unmanned airship.

publisher's site DOI [BibTex]

2022


Reconstructing Expressive {3D} Humans from {RGB} Images
Reconstructing Expressive 3D Humans from RGB Images

Choutas, V.

ETH Zurich, Max Planck Institute for Intelligent Systems and ETH Zurich, December 2022 (phdthesis)

Abstract
To interact with our environment, we need to adapt our body posture and grasp objects with our hands. During a conversation our facial expressions and hand gestures convey important non-verbal cues about our emotional state and intentions towards our fellow speakers. Thus, modeling and capturing 3D full-body shape and pose, hand articulation and facial expressions are necessary to create realistic human avatars for augmented and virtual reality. This is a complex task, due to the large number of degrees of freedom for articulation, body shape variance, occlusions from objects and self-occlusions from body parts, e.g. crossing our hands, and subject appearance. The community has thus far relied on expensive and cumbersome equipment, such as multi-view cameras or motion capture markers, to capture the 3D human body. While this approach is effective, it is limited to a small number of subjects and indoor scenarios. Using monocular RGB cameras would greatly simplify the avatar creation process, thanks to their lower cost and ease of use. These advantages come at a price though, since RGB capture methods need to deal with occlusions, perspective ambiguity and large variations in subject appearance, in addition to all the challenges posed by full-body capture. In an attempt to simplify the problem, researchers generally adopt a divide-and-conquer strategy, estimating the body, face and hands with distinct methods using part-specific datasets and benchmarks. However, the hands and face constrain the body and vice-versa, e.g. the position of the wrist depends on the elbow, shoulder, etc.; the divide-and-conquer approach can not utilize this constraint. In this thesis, we aim to reconstruct the full 3D human body, using only readily accessible monocular RGB images. In a first step, we introduce a parametric 3D body model, called SMPL-X, that can represent full-body shape and pose, hand articulation and facial expression. Next, we present an iterative optimization method, named SMPLify-X, that fits SMPL-X to 2D image keypoints. While SMPLify-X can produce plausible results if the 2D observations are sufficiently reliable, it is slow and susceptible to initialization. To overcome these limitations, we introduce ExPose, a neural network regressor, that predicts SMPL-X parameters from an image using body-driven attention, i.e. by zooming in on the hands and face, after predicting the body. From the zoomed-in part images, dedicated part networks predict the hand and face parameters. ExPose combines the independent body, hand, and face estimates by trusting them equally. This approach though does not fully exploit the correlation between parts and fails in the presence of challenges such as occlusion or motion blur. Thus, we need a better mechanism to aggregate information from the full body and part images. PIXIE uses neural networks called moderators that learn to fuse information from these two image sets before predicting the final part parameters. Overall, the addition of the hands and face leads to noticeably more natural and expressive reconstructions. Creating high fidelity avatars from RGB images requires accurate estimation of 3D body shape. Although existing methods are effective at predicting body pose, they struggle with body shape. We identify the lack of proper training data as the cause. To overcome this obstacle, we propose to collect internet images from fashion models websites, together with anthropometric measurements. At the same time, we ask human annotators to rate images and meshes according to a pre-defined set of linguistic attributes. We then define mappings between measurements, linguistic shape attributes and 3D body shape. Equipped with these mappings, we train a neural network regressor, SHAPY, that predicts accurate 3D body shapes from a single RGB image. We observe that existing 3D shape benchmarks lack subject variety and/or ground-truth shape. Thus, we introduce a new benchmark, Human Bodies in the Wild (HBW), which contains images of humans and their corresponding 3D ground-truth body shape. SHAPY shows how we can overcome the lack of in-the-wild images with 3D shape annotations through easy-to-obtain anthropometric measurements and linguistic shape attributes. Regressors that estimate 3D model parameters are robust and accurate, but often fail to tightly fit the observations. Optimization-based approaches tightly fit the data, by minimizing an energy function composed of a data term that penalizes deviations from the observations and priors that encode our knowledge of the problem. Finding the balance between these terms and implementing a performant version of the solver is a time-consuming and non-trivial task. Machine-learned continuous optimizers combine the benefits of both regression and optimization approaches. They learn the priors directly from data, avoiding the need for hand-crafted heuristics and loss term balancing, and benefit from optimized neural network frameworks for fast inference. Inspired from the classic Levenberg-Marquardt algorithm, we propose a neural optimizer that outperforms classic optimization, regression and hybrid optimization-regression approaches. Our proposed update rule uses a weighted combination of gradient descent and a network-predicted update. To show the versatility of the proposed method, we apply it on three other problems, namely full body estimation from (i) 2D keypoints, (ii) head and hand location from a head-mounted device and (iii) face tracking from dense 2D landmarks. Our method can easily be applied to new model fitting problems and offers a competitive alternative to well-tuned traditional model fitting pipelines, both in terms of accuracy and speed. To summarize, we propose a new and richer representation of the human body, SMPL-X, that is able to jointly model the 3D human body pose and shape, facial expressions and hand articulation. We propose methods, SMPLify-X, ExPose and PIXIE that estimate SMPL-X parameters from monocular RGB images, progressively improving the accuracy and realism of the predictions. To further improve reconstruction fidelity, we demonstrate how we can use easy-to-collect internet data and human annotations to overcome the lack of 3D shape data and train a model, SHAPY, that predicts accurate 3D body shape from a single RGB image. Finally, we propose a flexible learnable update rule for parametric human model fitting that outperforms both classic optimization and neural network approaches. This approach is easily applicable to a variety of problems, unlocking new applications in AR/VR scenarios.

pdf [BibTex]

2022

pdf [BibTex]


How immersive virtual reality can become a key tool to advance research and psychotherapy of eating and weight disorders
How immersive virtual reality can become a key tool to advance research and psychotherapy of eating and weight disorders

Behrens, S. C., Streuber, S., Keizer, A., Giel, K. E.

Frontiers in Psychiatry, 13, pages: 1011620, November 2022 (article)

Abstract
Immersive virtual reality technology (VR) still waits for its wide dissemination in research and psychotherapy of eating and weight disorders. Given the comparably high efforts in producing a VR setup, we outline that the technology’s breakthrough needs tailored exploitation of specific features of VR and user-centered design of setups. In this paper, we introduce VR hardware and review the specific properties of immersive VR versus real-world setups providing examples how they improved existing setups. We then summarize current approaches to make VR a tool for psychotherapy of eating and weight disorders and introduce user-centered design of VR environments as a solution to support their further development. Overall, we argue that exploitation of the specific properties of VR can substantially improve existing approaches for research and therapy of eating and weight disorders. To produce more than pilot setups, iterative development of VR setups within a user-centered design approach is needed.

DOI [BibTex]


{iRotate}: Active visual {SLAM} for omnidirectional robots
iRotate: Active visual SLAM for omnidirectional robots

Bonetto, E., Goldschmid, P., Pabst, M., Black, M. J., Ahmad, A.

Robotics and Autonomous Systems, 154, pages: 104102, Elsevier, August 2022 (article)

Abstract
In this paper, we present an active visual SLAM approach for omnidirectional robots. The goal is to generate control commands that allow such a robot to simultaneously localize itself and map an unknown environment while maximizing the amount of information gained and consuming as low energy as possible. Leveraging the robot’s independent translation and rotation control, we introduce a multi-layered approach for active V-SLAM. The top layer decides on informative goal locations and generates highly informative paths to them. The second and third layers actively re-plan and execute the path, exploiting the continuously updated map and local features information. Moreover, we introduce two utility formulations to account for the presence of obstacles in the field of view and the robot’s location. Through rigorous simulations, real robot experiments, and comparisons with state-of-the-art methods, we demonstrate that our approach achieves similar coverage results with lesser overall map entropy. This is obtained while keeping the traversed distance up to 39% shorter than the other methods and without increasing the wheels’ total rotation amount. Code and implementation details are provided as open-source and all the generated data is available online for consultation.

Code Data iRotate Data Independent Camera experiments link (url) DOI [BibTex]


{AirPose}: Multi-View Fusion Network for Aerial {3D} Human Pose and Shape Estimation
AirPose: Multi-View Fusion Network for Aerial 3D Human Pose and Shape Estimation

Saini, N., Bonetto, E., Price, E., Ahmad, A., Black, M. J.

IEEE Robotics and Automation Letters, 7(2):4805-4812, IEEE, April 2022, Also accepted and presented in the 2022 IEEE International Conference on Robotics and Automation (ICRA) (article)

Abstract
In this letter, we present a novel markerless 3D human motion capture (MoCap) system for unstructured, outdoor environments that uses a team of autonomous unmanned aerial vehicles (UAVs) with on-board RGB cameras and computation. Existing methods are limited by calibrated cameras and off-line processing. Thus, we present the first method (AirPose) to estimate human pose and shape using images captured by multiple extrinsically uncalibrated flying cameras. AirPose calibrates the cameras relative to the person instead of in a classical way. It uses distributed neural networks running on each UAV that communicate viewpoint-independent information with each other about the person (i.e., their 3D shape and articulated pose). The persons shape and pose are parameterized using the SMPL-X body model, resulting in a compact representation, that minimizes communication between the UAVs. The network is trained using synthetic images of realistic virtual environments, and fine-tuned on a small set of real images. We also introduce an optimization-based post processing method (AirPose+) for offline applications that require higher mocap quality. We make code and data available for research at https://github.com/robot-perception-group/AirPose. Video describing the approach and results is available at https://youtu.be/Ss48ICeqvnQ.

paper code video pdf DOI [BibTex]

paper code video pdf DOI [BibTex]


Physical activity improves body image of sedentary adults. Exploring the roles of interoception and affective response
Physical activity improves body image of sedentary adults. Exploring the roles of interoception and affective response

Srismith, D., Dierkes, K., Zipfel, S., Thiel, A., Sudeck, G., Giel, K. E., Behrens, S. C.

Current Psychology, Springer, 2022 (article) Accepted

Abstract
To reduce the number of sedentary people, an improved understanding of effects of exercise in this specific group is needed. The present project investigates the impact of regular aerobic exercise uptake on body image, and how this effect is associated with differences in interoceptive abilities and affective response to exercise. Participants were 29 sedentary adults who underwent a 12-week aerobic physical activity intervention comprised of 30–36 sessions. Body image was improved with large effect sizes. Correlations were observed between affective response to physical activity and body image improvement, but not with interoceptive abilities. Explorative mediation models suggest a neglectable role of a priori interoceptive abilities. Instead, body image improvement was achieved when positive valence was assigned to interoceptive cues experienced during exercise.

DOI [BibTex]

DOI [BibTex]

2021


The neural coding of face and body orientation in occipitotemporal cortex
The neural coding of face and body orientation in occipitotemporal cortex

Foster, C., Zhao, M., Bolkart, T., Black, M. J., Bartels, A., Bülthoff, I.

NeuroImage, 246, pages: 118783, December 2021 (article)

Abstract
Face and body orientation convey important information for us to understand other people's actions, intentions and social interactions. It has been shown that several occipitotemporal areas respond differently to faces or bodies of different orientations. However, whether face and body orientation are processed by partially overlapping or completely separate brain networks remains unclear, as the neural coding of face and body orientation is often investigated separately. Here, we recorded participants’ brain activity using fMRI while they viewed faces and bodies shown from three different orientations, while attending to either orientation or identity information. Using multivoxel pattern analysis we investigated which brain regions process face and body orientation respectively, and which regions encode both face and body orientation in a stimulus-independent manner. We found that patterns of neural responses evoked by different stimulus orientations in the occipital face area, extrastriate body area, lateral occipital complex and right early visual cortex could generalise across faces and bodies, suggesting a stimulus-independent encoding of person orientation in occipitotemporal cortex. This finding was consistent across functionally defined regions of interest and a whole-brain searchlight approach. The fusiform face area responded to face but not body orientation, suggesting that orientation responses in this area are face-specific. Moreover, neural responses to orientation were remarkably consistent regardless of whether participants attended to the orientation of faces and bodies or not. Together, these results demonstrate that face and body orientation are processed in a partially overlapping brain network, with a stimulus-independent neural code for face and body orientation in occipitotemporal cortex.

paper DOI [BibTex]

2021

paper DOI [BibTex]


A pose-independent method for accurate and precise body composition from 3D optical scans
A pose-independent method for accurate and precise body composition from 3D optical scans

Wong, M. C., Ng, B. K., Tian, I., Sobhiyeh, S., Pagano, I., Dechenaud, M., Kennedy, S. F., Liu, Y. E., Kelly, N., Chow, D., Garber, A. K., Maskarinec, G., Pujades, S., Black, M. J., Curless, B., Heymsfield, S. B., Shepherd, J. A.

Obesity, 29(11):1835-1847, Wiley, November 2021 (article)

Abstract
Objective: The aim of this study was to investigate whether digitally re-posing three-dimensional optical (3DO) whole-body scans to a standardized pose would improve body composition accuracy and precision regardless of the initial pose. Methods: Healthy adults (n = 540), stratified by sex, BMI, and age, completed whole-body 3DO and dual-energy X-ray absorptiometry (DXA) scans in the Shape Up! Adults study. The 3DO mesh vertices were represented with standardized templates and a low-dimensional space by principal component analysis (stratified by sex). The total sample was split into a training (80%) and test (20%) set for both males and females. Stepwise linear regression was used to build prediction models for body composition and anthropometry outputs using 3DO principal components (PCs). Results: The analysis included 472 participants after exclusions. After re-posing, three PCs described 95% of the shape variance in the male and female training sets. 3DO body composition accuracy compared with DXA was as follows: fat mass R2 = 0.91 male, 0.94 female; fat-free mass R2 = 0.95 male, 0.92 female; visceral fat mass R2 = 0.77 male, 0.79 female. Conclusions: Re-posed 3DO body shape PCs produced more accurate and precise body composition models that may be used in clinical or nonclinical settings when DXA is unavailable or when frequent ionizing radiation exposure is unwanted.

Wiley online adress DOI Project Page [BibTex]

Wiley online adress Author Version DOI Project Page [BibTex]


Separated and overlapping neural coding of face and body identity
Separated and overlapping neural coding of face and body identity

Foster, C., Zhao, M., Bolkart, T., Black, M. J., Bartels, A., Bülthoff, I.

Human Brain Mapping, 42(13):4242-4260, September 2021 (article)

Abstract
Recognising a person's identity often relies on face and body information, and is tolerant to changes in low-level visual input (e.g., viewpoint changes). Previous studies have suggested that face identity is disentangled from low-level visual input in the anterior face-responsive regions. It remains unclear which regions disentangle body identity from variations in viewpoint, and whether face and body identity are encoded separately or combined into a coherent person identity representation. We trained participants to recognise three identities, and then recorded their brain activity using fMRI while they viewed face and body images of these three identities from different viewpoints. Participants' task was to respond to either the stimulus identity or viewpoint. We found consistent decoding of body identity across viewpoint in the fusiform body area, right anterior temporal cortex, middle frontal gyrus and right insula. This finding demonstrates a similar function of fusiform and anterior temporal cortex for bodies as has previously been shown for faces, suggesting these regions may play a general role in extracting high-level identity information. Moreover, we could decode identity across fMRI activity evoked by faces and bodies in the early visual cortex, right inferior occipital cortex, right parahippocampal cortex and right superior parietal cortex, revealing a distributed network that encodes person identity abstractly. Lastly, identity decoding was consistently better when participants attended to identity, indicating that attention to identity enhances its neural representation. These results offer new insights into how the brain develops an abstract neural coding of person identity, shared by faces and bodies.

on-line DOI Project Page [BibTex]

on-line DOI Project Page [BibTex]


Skinned multi-infant linear body model
Skinned multi-infant linear body model

Hesse, N., Pujades, S., Romero, J., Black, M.

(US Patent 11,127,163, 2021), September 2021 (patent)

Abstract
A computer-implemented method for automatically obtaining pose and shape parameters of a human body. The method includes obtaining a sequence of digital 3D images of the body, recorded by at least one depth camera; automatically obtaining pose and shape parameters of the body, based on images of the sequence and a statistical body model; and outputting the pose and shape parameters. The body may be an infant body.

[BibTex]

[BibTex]


The role of sexual orientation in the relationships between body perception, body weight dissatisfaction, physical comparison, and eating psychopathology in the cisgender population
The role of sexual orientation in the relationships between body perception, body weight dissatisfaction, physical comparison, and eating psychopathology in the cisgender population

Meneguzzo, P., Collantoni, E., Bonello, E., Vergine, M., Behrens, S. C., Tenconi, E., Favaro, A.

Eating and Weight Disorders - Studies on Anorexia, Bulimia and Obesity , 26(6):1985-2000, Springer, August 2021 (article)

Abstract
Purpose: Body weight dissatisfaction (BWD) and visual body perception are specific aspects that can influence the own body image, and that can concur with the development or the maintenance of specific psychopathological dimensions of different psychiatric disorders. The sexual orientation is a fundamental but understudied aspect in this field, and, for this reason, the purpose of this study is to improve knowledge about the relationships among BWD, visual body size-perception, and sexual orientation. Methods: A total of 1033 individuals participated in an online survey. Physical comparison, depression, and self-esteem was evaluated, as well as sexual orientation and the presence of an eating disorder. A Figure Rating Scale was used to assess different valences of body weight, and mediation analyses were performed to investigated specific relationships between psychological aspects. Results: Bisexual women and gay men reported significantly higher BWD than other groups (p < 0.001); instead, higher body misperception was present in gay men (p = 0.001). Physical appearance comparison mediated the effect of sexual orientation in both BWD and perceptual distortion. No difference emerged between women with a history of eating disorders and without, as regards the value of body weight attributed to attractiveness, health, and presence on social media. Conclusion: This study contributes to understanding the relationship between sexual orientations and body image representation and evaluation. Physical appearance comparisons should be considered as critical psychological factors that can improve and affect well-being. The impact on subjects with high levels of eating concerns is also discussed. Level of evidence: Level III: case–control analytic study.

DOI Project Page [BibTex]

DOI Project Page [BibTex]


Learning an Animatable Detailed {3D} Face Model from In-the-Wild Images
Learning an Animatable Detailed 3D Face Model from In-the-Wild Images

Feng, Y., Feng, H., Black, M. J., Bolkart, T.

ACM Transactions on Graphics, 40(4):88:1-88:13, August 2021 (article)

Abstract
While current monocular 3D face reconstruction methods can recover fine geometric details, they suffer several limitations. Some methods produce faces that cannot be realistically animated because they do not model how wrinkles vary with expression. Other methods are trained on high-quality face scans and do not generalize well to in-the-wild images. We present the first approach that regresses 3D face shape and animatable details that are specific to an individual but change with expression. Our model, DECA (Detailed Expression Capture and Animation), is trained to robustly produce a UV displacement map from a low-dimensional latent representation that consists of person-specific detail parameters and generic expression parameters, while a regressor is trained to predict detail, shape, albedo, expression, pose and illumination parameters from a single image. To enable this, we introduce a novel detail-consistency loss that disentangles person-specific details from expression-dependent wrinkles. This disentanglement allows us to synthesize realistic person-specific wrinkles by controlling expression parameters while keeping person-specific details unchanged. DECA is learned from in-the-wild images with no paired 3D supervision and achieves state-of-the-art shape reconstruction accuracy on two benchmarks. Qualitative results on in-the-wild data demonstrate DECA's robustness and its ability to disentangle identity- and expression-dependent details enabling animation of reconstructed faces. The model and code are publicly available at https://deca.is.tue.mpg.de.

pdf Sup Mat code video talk DOI Project Page [BibTex]


Red shape, blue shape: Political ideology influences the social perception of body shape
Red shape, blue shape: Political ideology influences the social perception of body shape

Quiros-Ramirez, M. A., Streuber, S., Black, M. J.

Nature Humanities and Social Sciences Communications, 8, pages: 148, June 2021 (article)

Abstract
Political elections have a profound impact on individuals and societies. Optimal voting is thought to be based on informed and deliberate decisions yet, it has been demonstrated that the outcomes of political elections are biased by the perception of candidates’ facial features and the stereotypical traits voters attribute to these. Interestingly, political identification changes the attribution of stereotypical traits from facial features. This study explores whether the perception of body shape elicits similar effects on political trait attribution and whether these associations can be visualized. In Experiment 1, ratings of 3D body shapes were used to model the relationship between perception of 3D body shape and the attribution of political traits such as ‘Republican’, ‘Democrat’, or ‘Leader’. This allowed analyzing and visualizing the mental representations of stereotypical 3D body shapes associated with each political trait. Experiment 2 was designed to test whether political identification of the raters affected the attribution of political traits to different types of body shapes. The results show that humans attribute political traits to the same body shapes differently depending on their own political preference. These findings show that our judgments of others are influenced by their body shape and our own political views. Such judgments have potential political and societal implications.

pdf on-line sup. mat. sup. figure author pdf DOI Project Page [BibTex]


Weight bias and linguistic body representation in anorexia nervosa: Findings from the BodyTalk project
Weight bias and linguistic body representation in anorexia nervosa: Findings from the BodyTalk project

(Top Cited Article 2021-2022)

Behrens, S. C., Meneguzzo, P., Favaro, A., Teufel, M., Skoda, E., Lindner, M., Walder, L., Quiros-Ramirez, M. A., Zipfel, S., Mohler, B., Black, M., Giel, K. E.

European Eating Disorders Review, 29(2):204-215, Wiley, March 2021 (article)

Abstract
Objective: This study provides a comprehensive assessment of own body representation and linguistic representation of bodies in general in women with typical and atypical anorexia nervosa (AN). Methods: In a series of desktop experiments, participants rated a set of adjectives according to their match with a series of computer generated bodies varying in body mass index, and generated prototypic body shapes for the same set of adjectives. We analysed how body mass index of the bodies was associated with positive or negative valence of the adjectives in the different groups. Further, body image and own body perception were assessed. Results: In a German‐Italian sample comprising 39 women with AN, 20 women with atypical AN and 40 age matched control participants, we observed effects indicative of weight stigmatization, but no significant differences between the groups. Generally, positive adjectives were associated with lean bodies, whereas negative adjectives were associated with obese bodies. Discussion: Our observations suggest that patients with both typical and atypical AN affectively and visually represent body descriptions not differently from healthy women. We conclude that overvaluation of low body weight and fear of weight gain cannot be explained by generally distorted perception or cognition, but require individual consideration.

on-line pdf DOI Project Page [BibTex]

on-line pdf DOI Project Page [BibTex]


Analyzing the Direction of Emotional Influence in Nonverbal Dyadic Communication: A Facial-Expression Study
Analyzing the Direction of Emotional Influence in Nonverbal Dyadic Communication: A Facial-Expression Study

Shadaydeh, M., Müller, L., Schneider, D., Thuemmel, M., Kessler, T., Denzler, J.

IEEE Access, 9, pages: 73780-73790, IEEE, 2021 (article)

Abstract
Identifying the direction of emotional influence in a dyadic dialogue is of increasing interest in the psychological sciences with applications in psychotherapy, analysis of political interactions, or interpersonal conflict behavior. Facial expressions are widely described as being automatic and thus hard to be overtly influenced. As such, they are a perfect measure for a better understanding of unintentional behavior cues about socio-emotional cognitive processes. With this view, this study is concerned with the analysis of the direction of emotional influence in dyadic dialogues based on facial expressions only. We exploit computer vision capabilities along with causal inference theory for quantitative verification of hypotheses on the direction of emotional influence, i.e., cause-effect relationships, in dyadic dialogues. We address two main issues. First, in a dyadic dialogue, emotional influence occurs over transient time intervals and with intensity and direction that are variant over time. To this end, we propose a relevant interval selection approach that we use prior to causal inference to identify those transient intervals where causal inference should be applied. Second, we propose to use fine-grained facial expressions that are present when strong distinct facial emotions are not visible. To specify the direction of influence, we apply the concept of Granger causality to the time-series of facial expressions over selected relevant intervals. We tested our approach on newly, experimentally obtained data. Based on quantitative verification of hypotheses on the direction of emotional influence, we were able to show that the proposed approach is promising to reveal the cause-effect pattern in various instructed interaction conditions.

DOI [BibTex]

DOI [BibTex]


Scientific Report 2016 - 2021
Scientific Report 2016 - 2021
2021 (mpi_year_book)

Abstract
This report presents research done at the Max Planck Institute for Intelligent Systems from January2016 to November 2021. It is our fourth report since the founding of the institute in 2011. Dueto the fact that the upcoming evaluation is an extended one, the report covers a longer reportingperiod.This scientific report is organized as follows: we begin with an overview of the institute, includingan outline of its structure, an introduction of our latest research departments, and a presentationof our main collaborative initiatives and activities (Chapter1). The central part of the scientificreport consists of chapters on the research conducted by the institute’s departments (Chapters2to6) and its independent research groups (Chapters7 to24), as well as the work of the institute’scentral scientific facilities (Chapter25). For entities founded after January 2016, the respectivereport sections cover work done from the date of the establishment of the department, group, orfacility. These chapters are followed by a summary of selected outreach activities and scientificevents hosted by the institute (Chapter26). The scientific publications of the featured departmentsand research groups published during the 6-year review period complete this scientific report.

Scientific Report 2016 - 2021 [BibTex]


Body Image Disturbances and Weight Bias After Obesity Surgery: Semantic and Visual Evaluation in a Controlled Study, Findings from the BodyTalk Project
Body Image Disturbances and Weight Bias After Obesity Surgery: Semantic and Visual Evaluation in a Controlled Study, Findings from the BodyTalk Project

Meneguzzo, P., Behrens, S. C., Favaro, A., Tenconi, E., Vindigni, V., Teufel, M., Skoda, E., Lindner, M., Quiros-Ramirez, M. A., Mohler, B., Black, M., Zipfel, S., Giel, K. E., Pavan, C.

Obesity Surgery, 31(4):1625-1634, 2021 (article)

Abstract
Purpose: Body image has a significant impact on the outcome of obesity surgery. This study aims to perform a semantic evaluation of body shapes in obesity surgery patients and a group of controls. Materials and Methods: Thirty-four obesity surgery (OS) subjects, stable after weight loss (average 48.03 ± 18.60 kg), and 35 overweight/obese controls (MC), were enrolled in this study. Body dissatisfaction, self-esteem, and body perception were evaluated with self-reported tests, and semantic evaluation of body shapes was performed with three specific tasks constructed with realistic human body stimuli. Results: The OS showed a more positive body image compared to HC (p < 0.001), higher levels of depression (p < 0.019), and lower self-esteem (p < 0.000). OS patients and HC showed no difference in weight bias, but OS used a higher BMI than HC in the visualization of positive adjectives (p = 0.011). Both groups showed a mental underestimation of their body shapes. Conclusion: OS patients are more psychologically burdened and have more difficulties in judging their bodies than overweight/obese peers. Their mental body representations seem not to be linked to their own BMI. Our findings provide helpful insight for the design of specific interventions in body image in obese and overweight people, as well as in OS.

paper pdf DOI Project Page [BibTex]

paper pdf DOI Project Page [BibTex]

2020


Occlusion Boundary: A Formal Definition & Its Detection via Deep Exploration of Context
Occlusion Boundary: A Formal Definition & Its Detection via Deep Exploration of Context

Wang, C., Fu, H., Tao, D., Black, M. J.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2641-2656, November 2020 (article)

Abstract
Occlusion boundaries contain rich perceptual information about the underlying scene structure and provide important cues in many visual perception-related tasks such as object recognition, segmentation, motion estimation, scene understanding, and autonomous navigation. However, there is no formal definition of occlusion boundaries in the literature, and state-of-the-art occlusion boundary detection is still suboptimal. With this in mind, in this paper we propose a formal definition of occlusion boundaries for related studies. Further, based on a novel idea, we develop two concrete approaches with different characteristics to detect occlusion boundaries in video sequences via enhanced exploration of contextual information (e.g., local structural boundary patterns, observations from surrounding regions, and temporal context) with deep models and conditional random fields. Experimental evaluations of our methods on two challenging occlusion boundary benchmarks (CMU and VSB100) demonstrate that our detectors significantly outperform the current state-of-the-art. Finally, we empirically assess the roles of several important components of the proposed detectors to validate the rationale behind these approaches.

official version DOI [BibTex]

2020

official version DOI [BibTex]


3D Morphable Face Models - Past, Present and Future
3D Morphable Face Models - Past, Present and Future

Egger, B., Smith, W. A. P., Tewari, A., Wuhrer, S., Zollhoefer, M., Beeler, T., Bernard, F., Bolkart, T., Kortylewski, A., Romdhani, S., Theobalt, C., Blanz, V., Vetter, T.

ACM Transactions on Graphics, 39(5):157, October 2020 (article)

Abstract
In this paper, we provide a detailed survey of 3D Morphable Face Models over the 20 years since they were first proposed. The challenges in building and applying these models, namely capture, modeling, image formation, and image analysis, are still active research topics, and we review the state-of-the-art in each of these areas. We also look ahead, identifying unsolved challenges, proposing directions for future research and highlighting the broad range of current and future applications.

project page pdf preprint DOI Project Page [BibTex]

project page pdf preprint DOI Project Page [BibTex]


AirCapRL: Autonomous Aerial Human Motion Capture Using Deep Reinforcement Learning
AirCapRL: Autonomous Aerial Human Motion Capture Using Deep Reinforcement Learning

Tallamraju, R., Saini, N., Bonetto, E., Pabst, M., Liu, Y. T., Black, M., Ahmad, A.

IEEE Robotics and Automation Letters, 5(4):6678-6685, IEEE, October 2020, Also accepted and presented in the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). (article)

Abstract
In this letter, we introduce a deep reinforcement learning (DRL) based multi-robot formation controller for the task of autonomous aerial human motion capture (MoCap). We focus on vision-based MoCap, where the objective is to estimate the trajectory of body pose, and shape of a single moving person using multiple micro aerial vehicles. State-of-the-art solutions to this problem are based on classical control methods, which depend on hand-crafted system, and observation models. Such models are difficult to derive, and generalize across different systems. Moreover, the non-linearities, and non-convexities of these models lead to sub-optimal controls. In our work, we formulate this problem as a sequential decision making task to achieve the vision-based motion capture objectives, and solve it using a deep neural network-based RL method. We leverage proximal policy optimization (PPO) to train a stochastic decentralized control policy for formation control. The neural network is trained in a parallelized setup in synthetic environments. We performed extensive simulation experiments to validate our approach. Finally, real-robot experiments demonstrate that our policies generalize to real world conditions.

link (url) DOI Project Page [BibTex]

link (url) DOI Project Page [BibTex]


Analysis of motor development within the first year of life: 3-{D} motion tracking without markers for early detection of developmental disorders
Analysis of motor development within the first year of life: 3-D motion tracking without markers for early detection of developmental disorders

Parisi, C., Hesse, N., Tacke, U., Rocamora, S. P., Blaschek, A., Hadders-Algra, M., Black, M. J., Heinen, F., Müller-Felber, W., Schroeder, A. S.

Bundesgesundheitsblatt - Gesundheitsforschung - Gesundheitsschutz, 63(7):881–890, July 2020 (article)

Abstract
Children with motor development disorders benefit greatly from early interventions. An early diagnosis in pediatric preventive care (U2–U5) can be improved by automated screening. Current approaches to automated motion analysis, however, are expensive, require lots of technical support, and cannot be used in broad clinical application. Here we present an inexpensive, marker-free video analysis tool (KineMAT) for infants, which digitizes 3‑D movements of the entire body over time allowing automated analysis in the future. Three-minute video sequences of spontaneously moving infants were recorded with a commercially available depth-imaging camera and aligned with a virtual infant body model (SMIL model). The virtual image generated allows any measurements to be carried out in 3‑D with high precision. We demonstrate seven infants with different diagnoses. A selection of possible movement parameters was quantified and aligned with diagnosis-specific movement characteristics. KineMAT and the SMIL model allow reliable, three-dimensional measurements of spontaneous activity in infants with a very low error rate. Based on machine-learning algorithms, KineMAT can be trained to automatically recognize pathological spontaneous motor skills. It is inexpensive and easy to use and can be developed into a screening tool for preventive care for children.

pdf on-line w/ sup mat DOI Project Page [BibTex]

pdf on-line w/ sup mat DOI Project Page [BibTex]


Learning and Tracking the {3D} Body Shape of Freely Moving Infants from {RGB-D} sequences
Learning and Tracking the 3D Body Shape of Freely Moving Infants from RGB-D sequences

Hesse, N., Pujades, S., Black, M., Arens, M., Hofmann, U., Schroeder, S.

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 42(10):2540-2551, 2020 (article)

Abstract
Statistical models of the human body surface are generally learned from thousands of high-quality 3D scans in predefined poses to cover the wide variety of human body shapes and articulations. Acquisition of such data requires expensive equipment, calibration procedures, and is limited to cooperative subjects who can understand and follow instructions, such as adults. We present a method for learning a statistical 3D Skinned Multi-Infant Linear body model (SMIL) from incomplete, low-quality RGB-D sequences of freely moving infants. Quantitative experiments show that SMIL faithfully represents the RGB-D data and properly factorizes the shape and pose of the infants. To demonstrate the applicability of SMIL, we fit the model to RGB-D sequences of freely moving infants and show, with a case study, that our method captures enough motion detail for General Movements Assessment (GMA), a method used in clinical practice for early detection of neurodevelopmental disorders in infants. SMIL provides a new tool for analyzing infant shape and movement and is a step towards an automated system for GMA.

pdf Journal DOI Project Page [BibTex]

pdf Journal DOI Project Page [BibTex]


Machine learning systems and methods of estimating body shape from images
Machine learning systems and methods of estimating body shape from images

Black, M., Rachlin, E., Heron, N., Loper, M., Weiss, A., Hu, K., Hinkle, T., Kristiansen, M.

(US Patent 10,679,046), June 2020 (patent)

Abstract
Disclosed is a method including receiving an input image including a human, predicting, based on a convolutional neural network that is trained using examples consisting of pairs of sensor data, a corresponding body shape of the human and utilizing the corresponding body shape predicted from the convolutional neural network as input to another convolutional neural network to predict additional body shape metrics.

[BibTex]

[BibTex]


General Movement Assessment from videos of computed {3D} infant body models is equally effective compared to conventional {RGB} Video rating
General Movement Assessment from videos of computed 3D infant body models is equally effective compared to conventional RGB Video rating

Schroeder, S., Hesse, N., Weinberger, R., Tacke, U., Gerstl, L., Hilgendorff, A., Heinen, F., Arens, M., Bodensteiner, C., Dijkstra, L. J., Pujades, S., Black, M., Hadders-Algra, M.

Early Human Development, 144, pages: 104967, May 2020 (article)

Abstract
Background: General Movement Assessment (GMA) is a powerful tool to predict Cerebral Palsy (CP). Yet, GMA requires substantial training hampering its implementation in clinical routine. This inspired a world-wide quest for automated GMA. Aim: To test whether a low-cost, marker-less system for three-dimensional motion capture from RGB depth sequences using a whole body infant model may serve as the basis for automated GMA. Study design: Clinical case study at an academic neurodevelopmental outpatient clinic. Subjects: Twenty-nine high-risk infants were recruited and assessed at their clinical follow-up at 2-4 month corrected age (CA). Their neurodevelopmental outcome was assessed regularly up to 12-31 months CA. Outcome measures: GMA according to Hadders-Algra by a masked GMA-expert of conventional and computed 3D body model (“SMIL motion”) videos of the same GMs. Agreement between both GMAs was assessed, and sensitivity and specificity of both methods to predict CP at ≥12 months CA. Results: The agreement of the two GMA ratings was substantial, with κ=0.66 for the classification of definitely abnormal (DA) GMs and an ICC of 0.887 (95% CI 0.762;0.947) for a more detailed GM-scoring. Five children were diagnosed with CP (four bilateral, one unilateral CP). The GMs of the child with unilateral CP were twice rated as mildly abnormal. DA-ratings of both videos predicted bilateral CP well: sensitivity 75% and 100%, specificity 88% and 92% for conventional and SMIL motion videos, respectively. Conclusions: Our computed infant 3D full body model is an attractive starting point for automated GMA in infants at risk of CP.

DOI Project Page [BibTex]

DOI Project Page [BibTex]


Real Time Trajectory Prediction Using Deep Conditional Generative Models
Real Time Trajectory Prediction Using Deep Conditional Generative Models

Gomez-Gonzalez, S., Prokudin, S., Schölkopf, B., Peters, J.

IEEE Robotics and Automation Letters, 5(2):970-976, IEEE, April 2020 (article)

arXiv DOI [BibTex]


Learning Multi-Human Optical Flow
Learning Multi-Human Optical Flow

Ranjan, A., Hoffmann, D. T., Tzionas, D., Tang, S., Romero, J., Black, M. J.

International Journal of Computer Vision (IJCV), 128(4):873-890, April 2020 (article)

Abstract
The optical flow of humans is well known to be useful for the analysis of human action. Recent optical flow methods focus on training deep networks to approach the problem. However, the training data used by them does not cover the domain of human motion. Therefore, we develop a dataset of multi-human optical flow and train optical flow networks on this dataset. We use a 3D model of the human body and motion capture data to synthesize realistic flow fields in both single-and multi-person images. We then train optical flow networks to estimate human flow fields from pairs of images. We demonstrate that our trained networks are more accurate than a wide range of top methods on held-out test data and that they can generalize well to real image sequences. The code, trained models and the dataset are available for research.

pdf DOI poster link (url) DOI Project Page Project Page [BibTex]


The iReAct study - A biopsychosocial analysis of the individual response to physical activity
The iReAct study - A biopsychosocial analysis of the individual response to physical activity

Thiel, A., Sudeck, G., Gropper, H., Maturana, F. M., Schubert, T., Srismith, D., Widmann, M., Behrens, S., Martus, P., Munz, B., Giel, K., Zipfel, S., Niess, A. M.

Contemporary Clinical Trials Communications , 17, pages: 100508, March 2020 (article)

Abstract
Background Physical activity is a substantial promoter for health and well-being. Yet, while an increasing number of studies shows that the responsiveness to physical activity is highly individual, most studies focus this issue from only one perspective and neglect other contributing aspects. In reference to a biopsychosocial framework, the goal of our study is to examine how physically inactive individuals respond to two distinct standardized endurance trainings on various levels. Based on an assessment of activity- and health-related biographical experiences across the life course, our mixed-method study analyzes the responsiveness to physical activity in the form of a transdisciplinary approach, considering physiological, epigenetic, motivational, affective, and body image-related aspects. Methods Participants are randomly assigned to two different training programs (High Intensity Interval Training vs. Moderate Intensity Continuous Training) for six weeks. After this first training period, participants switch training modes according to a two-period sequential-training-intervention (STI) design and train for another six weeks. In order to analyse baseline characteristics as well as acute and adaptive biopsychosocial responses, three extensive mixed-methods diagnostic blocks take place at the beginning (t0) of the study and after the first (t1) and the second (t2) training period resulting in a net follow-up time of 15 weeks. The study is divided into five modules in order to cover a wide array of perspectives. Discussion The study's transdisciplinary mixed-method design allows to interlace a multitude of subjective and objective data and therefore to draw an integrated picture of the biopsychosocial efficacy of two distinct physical activity programs. The results of our study can be expected to contribute to the development and design of individualised training programs for the promotion of physical activity.

DOI [BibTex]

DOI [BibTex]


Machine learning systems and methods for augmenting images
Machine learning systems and methods for augmenting images

Black, M., Rachlin, E., Lee, E., Heron, N., Loper, M., Weiss, A., Smith, D.

(US Patent 10,529,137 B1), January 2020 (patent)

Abstract
Disclosed is a method including receiving visual input comprising a human within a scene, detecting a pose associated with the human using a trained machine learning model that detects human poses to yield a first output, estimating a shape (and optionally a motion) associated with the human using a trained machine learning model associated that detects shape (and optionally motion) to yield a second output, recognizing the scene associated with the visual input using a trained convolutional neural network which determines information about the human and other objects in the scene to yield a third output, and augmenting reality within the scene by leveraging one or more of the first output, the second output, and the third output to place 2D and/or 3D graphics in the scene.

[BibTex]

[BibTex]


Influence of Physical Activity Interventions on Body Representation: A Systematic Review
Influence of Physical Activity Interventions on Body Representation: A Systematic Review

Srismith, D., Wider, L., Wong, H. Y., Zipfel, S., Thiel, A., Giel, K. E., Behrens, S. C.

Frontiers in Psychiatry, 11, pages: 99, 2020 (article)

Abstract
Distorted representation of one's own body is a diagnostic criterion and corepsychopathology of disorders such as anorexia nervosa and body dysmorphic disorder. Previousliterature has raised the possibility of utilising physical activity intervention (PI) as atreatment option for individuals suffering from poor body satisfaction, which is traditionallyregarded as a systematic distortion in “body image.” In this systematic review,conducted according to the PRISMA statement, the evidence on effectiveness of PI on body representation outcomes is synthesised. We provide an update of 34 longitudinal studies evaluating the effectiveness of different types of PIs on body representation. No systematic risk of bias within or across studies were identified. The reviewed studies show that the implementation of structured PIs may be efficacious in increasing individuals’ satisfaction of their own body, and thus improving their subjective body image related assessments. However, there is no clear evidence regarding an additional or interactive effect of PI when implemented in conjunction with established treatments for clinical populations. We argue for theoretically sound, mechanism-oriented, multimethod approaches to future investigations on body image disturbance. Specifically, we highlight the need to consider expanding the theoretical framework for the investigation of body representation disturbances to include further body representations besides body image.

DOI [BibTex]

DOI [BibTex]

2019


Towards Geometric Understanding of Motion
Towards Geometric Understanding of Motion

Ranjan, A.

University of Tübingen, December 2019 (phdthesis)

Abstract

The motion of the world is inherently dependent on the spatial structure of the world and its geometry. Therefore, classical optical flow methods try to model this geometry to solve for the motion. However, recent deep learning methods take a completely different approach. They try to predict optical flow by learning from labelled data. Although deep networks have shown state-of-the-art performance on classification problems in computer vision, they have not been as effective in solving optical flow. The key reason is that deep learning methods do not explicitly model the structure of the world in a neural network, and instead expect the network to learn about the structure from data. We hypothesize that it is difficult for a network to learn about motion without any constraint on the structure of the world. Therefore, we explore several approaches to explicitly model the geometry of the world and its spatial structure in deep neural networks.

The spatial structure in images can be captured by representing it at multiple scales. To represent multiple scales of images in deep neural nets, we introduce a Spatial Pyramid Network (SpyNet). Such a network can leverage global information for estimating large motions and local information for estimating small motions. We show that SpyNet significantly improves over previous optical flow networks while also being the smallest and fastest neural network for motion estimation. SPyNet achieves a 97% reduction in model parameters over previous methods and is more accurate.

The spatial structure of the world extends to people and their motion. Humans have a very well-defined structure, and this information is useful in estimating optical flow for humans. To leverage this information, we create a synthetic dataset for human optical flow using a statistical human body model and motion capture sequences. We use this dataset to train deep networks and see significant improvement in the ability of the networks to estimate human optical flow.

The structure and geometry of the world affects the motion. Therefore, learning about the structure of the scene together with the motion can benefit both problems. To facilitate this, we introduce Competitive Collaboration, where several neural networks are constrained by geometry and can jointly learn about structure and motion in the scene without any labels. To this end, we show that jointly learning single view depth prediction, camera motion, optical flow and motion segmentation using Competitive Collaboration achieves state-of-the-art results among unsupervised approaches.

Our findings provide support for our hypothesis that explicit constraints on structure and geometry of the world lead to better methods for motion estimation.

PhD Thesis Project Page [BibTex]

2019


Decoding subcategories of human bodies from both body- and face-responsive cortical regions
Decoding subcategories of human bodies from both body- and face-responsive cortical regions

Foster, C., Zhao, M., Romero, J., Black, M. J., Mohler, B. J., Bartels, A., Bülthoff, I.

NeuroImage, 202(15):116085, November 2019 (article)

Abstract
Our visual system can easily categorize objects (e.g. faces vs. bodies) and further differentiate them into subcategories (e.g. male vs. female). This ability is particularly important for objects of social significance, such as human faces and bodies. While many studies have demonstrated category selectivity to faces and bodies in the brain, how subcategories of faces and bodies are represented remains unclear. Here, we investigated how the brain encodes two prominent subcategories shared by both faces and bodies, sex and weight, and whether neural responses to these subcategories rely on low-level visual, high-level visual or semantic similarity. We recorded brain activity with fMRI while participants viewed faces and bodies that varied in sex, weight, and image size. The results showed that the sex of bodies can be decoded from both body- and face-responsive brain areas, with the former exhibiting more consistent size-invariant decoding than the latter. Body weight could also be decoded in face-responsive areas and in distributed body-responsive areas, and this decoding was also invariant to image size. The weight of faces could be decoded from the fusiform body area (FBA), and weight could be decoded across face and body stimuli in the extrastriate body area (EBA) and a distributed body-responsive area. The sex of well-controlled faces (e.g. excluding hairstyles) could not be decoded from face- or body-responsive regions. These results demonstrate that both face- and body-responsive brain regions encode information that can distinguish the sex and weight of bodies. Moreover, the neural patterns corresponding to sex and weight were invariant to image size and could sometimes generalize across face and body stimuli, suggesting that such subcategorical information is encoded with a high-level visual or semantic code.

paper pdf DOI Project Page [BibTex]


Active Perception based Formation Control for Multiple Aerial Vehicles
Active Perception based Formation Control for Multiple Aerial Vehicles

Tallamraju, R., Price, E., Ludwig, R., Karlapalem, K., Bülthoff, H. H., Black, M. J., Ahmad, A.

IEEE Robotics and Automation Letters, Robotics and Automation Letters, 4(4):4491-4498, IEEE, October 2019 (article)

Abstract
We present a novel robotic front-end for autonomous aerial motion-capture (mocap) in outdoor environments. In previous work, we presented an approach for cooperative detection and tracking (CDT) of a subject using multiple micro-aerial vehicles (MAVs). However, it did not ensure optimal view-point configurations of the MAVs to minimize the uncertainty in the person's cooperatively tracked 3D position estimate. In this article, we introduce an active approach for CDT. In contrast to cooperatively tracking only the 3D positions of the person, the MAVs can actively compute optimal local motion plans, resulting in optimal view-point configurations, which minimize the uncertainty in the tracked estimate. We achieve this by decoupling the goal of active tracking into a quadratic objective and non-convex constraints corresponding to angular configurations of the MAVs w.r.t. the person. We derive this decoupling using Gaussian observation model assumptions within the CDT algorithm. We preserve convexity in optimization by embedding all the non-convex constraints, including those for dynamic obstacle avoidance, as external control inputs in the MPC dynamics. Multiple real robot experiments and comparisons involving 3 MAVs in several challenging scenarios are presented.

pdf DOI Project Page [BibTex]

pdf DOI Project Page [BibTex]


Method for providing a three dimensional body model
Method for providing a three dimensional body model

Loper, M., Mahmood, N., Black, M.

September 2019, U.S.~Patent 10,417,818 (patent)

Abstract
A method for providing a three-dimensional body model which may be applied for an animation, based on a moving body, wherein the method comprises providing a parametric three-dimensional body model, which allows shape and pose variations; applying a standard set of body markers; optimizing the set of body markers by generating an additional set of body markers and applying the same for providing 3D coordinate marker signals for capturing shape and pose of the body and dynamics of soft tissue; and automatically providing an animation by processing the 3D coordinate marker signals in order to provide a personalized three-dimensional body model, based on estimated shape and an estimated pose of the body by means of predicted marker locations.

MoSh Project pdf [BibTex]


Decoding the Viewpoint and Identity of Faces and Bodies
Decoding the Viewpoint and Identity of Faces and Bodies

Foster, C., Zhao, M., Bolkart, T., Black, M., Bartels, A., Bülthoff, I.

Journal of Vision, 19(10): 54c, pages: 54-55, Arvo Journals, September 2019 (article)

Abstract
(2019). . , 19(10): 25.13, 54-55. doi: Zitierlink: http://hdl.handle.net/21.11116/0000-0003-7493-4

link (url) DOI Project Page [BibTex]

link (url) DOI Project Page [BibTex]