I graduated and joined the startup Bodylabs in September 2017, which was acquired by Amazon. In March 2019, I joined the Facebook Reality Labs in Sausalito.
05/2015 to 07/2017
Ph.D. student, Max-Planck-Institute for Intelligent Systems and Bernstein Center for Computational Neuroscience, Tuebingen (Germany).
Advisor: Dr. Peter Gehler
06/2012 to 04/2015
Ph.D. student, Multimedia Computing and Computer Vision Lab, University of Augsburg (Germany).
Advisor: Prof. Dr. Rainer Lienhart.
Diploma in Computer Science, University of Augsburg (Germany). Thesis: "An Analysis of Successful Approaches to Human Pose Estimation".
Advanced degree (equiv. to M.Sc.) in Computer Science from a German university.
In Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), Febuary 2020 (inproceedings)
The goal of many computer vision systems is to transform image pixels into 3D representations. Recent popular models use neural networks to regress directly from pixels to 3D object parameters. Such an approach works well when supervision is available, but in problems like human pose and shape estimation, it is difficult to obtain natural images with 3D ground truth. To go one step further, we propose a new architecture that facilitates unsupervised, or lightly supervised, learning. The idea is to break the problem into a series of transformations between increasingly abstract representations. Each step involves a cycle designed to be learnable without annotated training data, and the chain of cycles delivers the final solution. Specifically, we use 2D body part segments as an intermediate representation that contains enough information to be lifted to 3D, and at the same time is simple enough to be learned in an unsupervised way. We demonstrate the method by learning 3D human pose and shape from un-paired and un-annotated images. We also explore varying amounts of paired data and show that cycling greatly alleviates the need for paired data. While we present results for modeling humans, our formulation is general and can be applied to other vision problems.
International Conference on Computer Vision, pages: 4332-4341, October 2019 (conference)
With an increased availability of 3D scanning technology, point clouds are moving into the focus of computer vision as a rich representation of everyday scenes. However, they are hard to handle for machine learning algorithms due to the unordered structure. One common approach is to apply voxelization, which dramatically increases the amount of data stored and at the same time loses details through discretization. Recently, deep learning models with hand-tailored architectures were proposed to handle point clouds directly and achieve input permutation invariance. However, these architectures use an increased number of parameters and are computationally inefficient. In this work we propose basis point sets as a highly efficient and fully general way to process point clouds with machine learning algorithms. Basis point sets are a residual representation that can be computed efficiently and can be used with standard neural network architectures. Using the proposed representation as the input to a relatively simple network allows us to match the performance of PointNet on a shape classification task while using three order of magnitudes less floating point operations. In a second experiment, we show how proposed representation can be used for obtaining high resolution meshes from noisy 3D scans. Here, our network achieves performance comparable to the state-of-the-art computationally intense multi-step frameworks, in one network pass that can be done in less than 1ms.
Direct prediction of 3D body pose and shape remains a challenge even for highly parameterized deep learning models. Mapping from the 2D image space to the prediction space is difficult: perspective ambiguities make the loss function noisy and training data is scarce. In this paper, we propose a novel approach (Neural Body Fitting (NBF)). It integrates a statistical body model within a CNN, leveraging reliable bottom-up semantic body part segmentation and robust top-down body model constraints. NBF is fully differentiable and can be trained using 2D and 3D annotations. In detailed experiments, we analyze how the components of our model affect performance, especially the use of part segmentations as an explicit intermediate representation, and present a robust, efficiently trainable framework for 3D human pose estimation from 2D images with competitive results on standard benchmarks. Code is available at https://github.com/mohomran/neural_body_fitting
In Proceedings IEEE International Conference on Computer Vision (ICCV), IEEE, Piscataway, NJ, USA, October 2017 (inproceedings)
We present the first image-based generative model of people in clothing in a full-body setting. We sidestep the commonly used complex graphics rendering pipeline and the need for high-quality 3D scans of dressed people. Instead, we learn generative models from a large image database. The main challenge is to cope with the high variance in human pose, shape and appearance. For this reason, pure image-based approaches have not been considered so far. We show that this challenge can be overcome by splitting the generating process in two parts. First, we learn to generate a semantic segmentation of the body and clothing. Second, we learn a conditional model on the resulting segments that creates realistic images. The full model is differentiable and can be conditioned on pose, shape or color. The result are samples of people in different clothing items and styles. The proposed model can generate entirely new people with realistic clothing. In several experiments we present encouraging results that suggest an entirely data-driven approach to people generation is possible.
In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, IEEE, Piscataway, NJ, USA, July 2017 (inproceedings)
3D models provide a common ground for different representations of human bodies. In turn, robust 2D estimation has proven to be a powerful tool to obtain 3D fits “in-the-wild”. However, depending on the level of detail, it can be hard to impossible to acquire labeled data for training 2D estimators on large scale. We propose a hybrid approach to this problem: with an extended version of the recently introduced SMPLify method, we obtain high quality 3D body model fits for multiple human pose datasets. Human annotators solely sort good and bad fits. This procedure leads to an initial dataset, UP-3D, with rich annotations. With a comprehensive set of experiments, we show how this data can be used to train discriminative models that produce results with an unprecedented level of detail: our models predict 31 segments and 91 landmark locations on the body. Using the 91 landmark pose estimator, we present state-of-the art results for 3D human pose and shape estimation using an order of magnitude less training data and without assumptions about gender or pose in the fitting procedure. We show that UP-3D can be enhanced with these improved fits to grow in quantity and quality, which makes the system deployable on large scale. The data, code and models are available for research purposes.
Early stopping is a widely used technique to prevent poor generalization performance when training an over-expressive model by means of gradient-based optimization. To find a good point to halt the optimizer, a common practice is to split the dataset into a training and a smaller validation set to obtain an ongoing estimate of the generalization performance. In this paper we propose a novel early stopping criterion which is based on fast-to-compute, local statistics of the computed gradients and entirely removes the need for a held-out validation set. Our experiments show that this is a viable approach in the setting of least-squares and logistic regression as well as neural networks.
In International Conference on 3D Vision (3DV), pages: 421-430, 2017 (inproceedings)
Existing markerless motion capture methods often assume known backgrounds, static cameras, and sequence specific motion priors, limiting their application scenarios. Here we present a fully automatic method that, given multiview videos, estimates 3D human pose and body shape. We take the recently proposed SMPLify method  as the base method and extend it in several ways. First we fit a 3D human body model to 2D features detected in multi-view images. Second, we use a CNN method to segment the person in each image and fit the 3D body model to the contours, further improving accuracy. Third we utilize a generic and robust DCT temporal prior to handle the left and right side swapping issue sometimes introduced by the 2D pose estimator. Validation on standard benchmarks shows our results are comparable to the state of the art and also provide a realistic 3D shape avatar. We also demonstrate accurate results on HumanEva and on challenging monocular sequences of dancing from YouTube.
In Computer Vision – ECCV 2016, pages: 561-578, Lecture Notes in Computer Science, Springer International Publishing, October 2016 (inproceedings)
We describe the first method to automatically estimate the 3D pose of the human body as well as its 3D shape from a single unconstrained image. We estimate a full 3D mesh and show that 2D joints alone carry a surprising amount of information about body shape. The problem is challenging because of the complexity of the human body, articulation, occlusion, clothing, lighting, and the inherent ambiguity in inferring 3D from 2D. To solve this, we first use a recently published CNN-based method, DeepCut, to predict (bottom-up) the 2D body joint locations. We then fit (top-down) a recently published statistical body shape model, called SMPL, to the 2D joints. We do so by minimizing an objective function that penalizes the error between the projected 3D model joints and detected 2D joints. Because SMPL captures correlations in human shape across the population, we are able to robustly fit it to very little data. We further leverage the 3D model to prevent solutions that cause interpenetration. We evaluate our method, SMPLify, on the Leeds Sports, HumanEva, and Human3.6M datasets, showing superior
pose accuracy with respect to the state of the art.
In ACM Multimedia Open Source Software Competition, October 2016 (inproceedings)
The caffe framework is one of the leading deep learning toolboxes in the machine learning and computer vision community. While it offers efficiency and configurability, it falls short of a full interface to Python. With increasingly involved procedures for training deep networks and reaching depths of hundreds of layers, creating configuration files and keeping them consistent becomes an error prone process.
We introduce the barrista framework, offering full, pythonic control over caffe. It separates responsibilities and offers code to solve frequently occurring tasks for pre-processing, training and model inspection. It is compatible to all caffe versions since mid 2015 and can import and export .prototxt files.
Examples are included, e.g., a deep residual network implemented in only 172 lines (for arbitrary depths), comparing to 2320 lines in the official implementation for the equivalent model.
In ACM Transactions on Multimedia (ACMMM) Open-source Software Competition, October 2015 (inproceedings)
Since the introduction of Random Forests in the 80's they have been a frequently used statistical tool for a variety of machine learning tasks. Many different training algorithms and model adaptions demonstrate the versatility of the forests. This variety resulted in a fragmentation of research and code, since each adaption requires its own algorithms and representations.
In 2011, Criminisi and Shotton developed a unifying Decision Forest model for many tasks. By identifying the reusable parts and specifying clear interfaces, we extend this approach to an object oriented representation and implementation. This has the great advantage that research on specific parts of the Decision Forest model can be done `locally' by reusing well-tested and high-performance components.
Our fertilized forests library is open source and easy to extend. It provides components allowing for parallelization up to node optimization level to exploit modern many core architectures. Additionally, the library provides consistent and easy-to-maintain interfaces to C++, Python and Matlab and offers cross-platform and cross-interface persistence.
Schiendorfer, A., Lassner, C., Anders, G., Reif, W., Lienhart, R.
In International Conference on Self-adaptive and Self-organizing Systems (SASO), September 2015 (inproceedings)
Many large-scale systems benefit from an organizational structure to provide for problem decomposition. A pivotal problem solving setting is given by hierarchical control systems familiar from hierarchical task networks. If these structures can be modified autonomously by, e.g., coalition formation and reconfiguration, adequate decisions on higher levels require a faithful abstracted model of a collective of agents. An illustrative example is found in calculating schedules for a set of power plants organized in a hierarchy of Autonomous Virtual Power Plants. Functional dependencies over the combinatorial domain, such as the joint costs or rates of change of power production, are approximated by repeatedly sampling input-output pairs and substituting the actual functions by piecewise linear functions. However, if the sampled data points are weakly informative, the resulting abstracted high-level optimization introduces severe errors. Furthermore, obtaining additional point labels amounts to solving computationally hard optimization problems. Building on prior work, we propose to apply techniques from active learning to maximize the information gained by each additional point. Our results show that significantly better allocations in terms of cost-efficiency (up to 33.7 % reduction in costs in our case study) can be found with fewer but carefully selected sampling points using Decision Forests.
Schiendorfer, A., Lassner, C., Anders, G., Reif, W., Lienhart, R.
In 3rd Workshop on Self-optimisation in Organic and Autonomic Computing Systems (SAOS), March 2015 (inproceedings)
Organizational structures such as hierarchies provide an effective means to deal with the increasing complexity found in large-scale energy systems. In hierarchical systems, the concrete functions describing the subsystems can be replaced by abstract piecewise linear functions to speed up the optimization process. However, if the data points are weakly informative the resulting abstracted optimization problem introduces severe errors and exhibits bad runtime performance. Furthermore, obtaining additional point labels amounts to solving computationally hard optimization problems. Therefore, we propose to apply methods from active learning to search for informative inputs. We present first results experimenting with Decision Forests and Gaussian Processes that motivate further research. Using points selected by Decision Forests, we could reduce the average mean-squared error of the abstract piecewise linear function by one third.
IEEE Winter Conference on Applications of Computer Vision (WACV), January 2015 (conference)
The entropy measurement function is a central element of
decision forest induction. The Shannon entropy and other
generalized entropies such as the Renyi and Tsallis entropy are designed to fulfill the Khinchin-Shannon axioms. Whereas these axioms are appropriate for physical systems,
they do not necessarily model well the artificial system of
decision forest induction.
In this paper, we show that when omitting two of the four
axioms, every norm induces an entropy function. The remaining two axioms are sufficient to describe the requirements for an entropy function in the decision forest context.
Furthermore, we introduce and analyze the
entropy, show relations to existing entropies and the relation
to various heuristics that are commonly used for decision
In experiments with classification, regression and the recently introduced Hough forests, we show how the discrete
and differential form of the new entropy can be used for
forest induction and how the functions can simply be fine-tuned. The experiments indicate that the impact of the entropy function is limited, however can be a simple and useful
post-processing step for optimizing decision forests for high
In IEEE International Conference on Machine Learning and Applications (ICMLA), December 2013 (inproceedings)
Feature learning has the aim to take away the hassle of hand-designing features for machine learning tasks. Since the feature design process is tedious and requires a lot of experience,
an automated solution is of great interest. However, an important problem in this field is that usually no objective values are available to fit a feature learning function to.
Artificial Neural Networks are a sufficiently flexible tool for function approximation to be able to avoid this problem. We show how the error function of an ANN can be modified such that it works solely with objective distances instead of objective values. We derive the adjusted rules for backpropagation through networks with arbitrary depths and include practical considera-
tions that must be taken into account to apply difference based learning successfully.
On all three benchmark datasets we use, linear SVMs trained on automatically learned ANN features outperform RBF kernel SVMs trained on the raw data. This can be achieved in a feature space with up to only a tenth of dimensions of the number of original data dimensions. We conclude our work with two experiments on distance based ANN training in two further fields: data visualization and outlier detection.
An Analysis of Successful Approaches to Human Pose Estimation, University of Augsburg, University of Augsburg, May 2012 (mastersthesis)
The field of Human Pose Estimation is developing fast and lately leaped forward
with the release of the Kinect system. That system reaches a very good perfor-
mance for pose estimation using 3D scene information, however pose estimation
from 2D color images is not solved reliably yet. There is a vast amount of pub-
lications trying to reach this aim, but no compilation of important methods and
solution strategies. The aim of this thesis is to fill this gap: it gives an introductory
overview over important techniques by analyzing four current (2012) publications
in detail. They are chosen such, that during their analysis many frequently used
techniques for Human Pose Estimation can be explained. The thesis includes two
introductory chapters with a definition of Human Pose Estimation and exploration
of the main difficulties, as well as a detailed explanation of frequently used methods.
A final chapter presents some ideas on how parts of the analyzed approaches can
be recombined and shows some open questions that can be tackled in future work.
The thesis is therefore a good entry point to the field of Human Pose Estimation
and enables the reader to get an impression of the current state-of-the-art.
Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems