Flexible Motion In-betweening with Diffusion Models

Motion in-betweening, a fundamental task in character animation, consists of generating motion sequences that plausibly interpolate user-provided keyframe constraints. It has long been recognized as a labor-intensive and challenging process. We investigate the potential of diffusion models in generating diverse human motions guided by keyframes. Unlike previous inbetweening methods, we propose a simple unified model capable of generating precise and diverse motions that conform to a flexible range of user-specified spatial constraints, as well as text conditioning. To this end, we propose Conditional Motion Diffusion In-betweening (CondMDI) which allows for arbitrary dense-or-sparse keyframe placement and partial keyframe constraints while generating high-quality motions that are diverse and coherent with the given keyframes. We evaluate the performance of CondMDI on the text-conditioned HumanML3D dataset and demonstrate the versatility and efficacy of diffusion models for keyframe in-betweening. We further explore the use of guidance and imputation-based approaches for inference-time keyframing and compare CondMDI against these methods.

When performing motion keyframing, our goal is to produce realistic motions that adhere to a set of spatio-temporarily sparse input keyframes while maintaining coherence between these observed keyframes and the entirety of the generated motion sequence. We use a diffusion model for generating the motions. To condition the model on keyframes, we use a simple and effective approach: replace the noisy input motion with the given keyframes over the provided frames, and concatenate the observation mask to further inform the model of the keyframe locations. Our model is then trained on randomly sampled keyframes with randomly sampled joints, together with a mask that indicates the observed keyframes and features. This then offers significant flexibility in terms of number of keyframes and their placement in time, as well as partial keyframes, i.e., providing information for a subset of the joints.

Generated motions are high-quality given sparse keyframes placed 5, 10, or 20 frames apart.

Generated results are still high-quality when keyframes are placed more sparsely.

CondMDI supports partialkeyframes, allowing for joint control.

CondMDI allows for text prompts to be used as an additional condition to control the generated motions. As an example, we show the results of conditioning on the same root joint trajectory given different text prompts.

Ablations

Here, we show the comparison of CondMDI and inference-time imputation and reconstruction guidance methods on the task of keyframe in-betweening. Imputation alone ignores the keyframes and thus, the imputation results demonstrate high jumps over the keyframes. Reconstruction guidance improves the results of imputation by introducing cohesion between the generated motion and the keyframes but still demonstrates some jumpts over the keyframes. CondMDI outperforms both methods by generating high-quality and smooth motions.

Other Methods

Root Joint Control. Here, we compare CondMDI against SOTA diffusion-based methods on the root joint control task. Although CondMDI is designed for sparse keyframe in-betweening, it tracks the observed trajectory well and without much foot skate.

Flexible Motion In-betweening with Diffusion Models

Abstract

CondMDI Overview

Sparse Keyframes

Partial Keyframes

Text Conditioning

Comparisons

Ablations

Other Methods