Behavior, Action, and Language

Top left: Synthesizing realistic 3D face motion from speech signals. Top right: Parsing human motions using hierarchical dynamic clustering. Middle: Synthesizing human motions for different body shapes using physics. Bottom: Generative model of motions conditioned on actions.

Much of our work focuses on capturing or estimating human movement. For this we seek metrically accurate 3D movement with increasing levels of detail. We are interested, however, in more than the movement of the joints, the facial muscles, the fingers, etc. What we really seek is what is behind human movement; that is, the goals, motives, emotions, and plans that drive human movement.

A first step towards understanding human motion is being able to predict it. To that end, we train deep neural networks on motion capture data to generate realistic human movements. These may predict motion based on the past or generate it conditioned on linguistic descriptions.

To train neural networks to generate human motion, we need training data. To that end, we used MoSh to transform multiple datasets of mocap markers into a consistent body representation (SMPL), resulting in the AMASS dataset. This gives us sufficient training data for deep networks to be effective.

While data-driven methods are interesting, human movement, is, to some extent, governed by the physics of the world. For example, a large person and small person will do jumping jacks differently due to their mass and its distribution. To efficiently capture such variation, we turn to physics-based models of human movement. We envision solutions that combine the best of data-driven and physics-based methods.

Our interest goes deeper than prediction and physics to the causes of movement. To study this, we track human movement together with speech so that we can relate the intent and goals with behavior. We do this to model how speech drives facial motions and to relate human movement to scripts in movies. We do this at both the pixel level and at the level of 3D movement.

We argue that capture, modeling and synthesis of human motion produces a virtuous cycle. If we can synthesize avatars behaving realistically in virtual worlds, then we must have modeled essential elements of human behavior. If we can build human and animal avatars that have goals, can see, and can act, then we can generate an infinite amount of training data that will let us better analyze the behavior of real people.