Flexible Motion In-betweening with Diffusion Models

1University of British Columbia, 2Tel-Aviv University, 3Simon Fraser University, 4NVIDIA
teaser-fig1 teaser-fig2 teaser-fig3

We present a simple unified model capable of generating precise and diverse motions that conform to a flexible range of user-specified spatial constraints, as well as text descriptions. From left to right: motion conditioned on sparse keyframes; motion conditioned on root trajectory and a "throwing" prompt; diverse motions generated for the same keyframes.


Motion in-betweening, a fundamental task in character animation, consists of generating motion sequences that plausibly interpolate user-provided keyframe constraints. It has long been recognized as a labor-intensive and challenging process. We investigate the potential of diffusion models in generating diverse human motions guided by keyframes. Unlike previous inbetweening methods, we propose a simple unified model capable of generating precise and diverse motions that conform to a flexible range of user-specified spatial constraints, as well as text conditioning. To this end, we propose Conditional Motion Diffusion In-betweening (CondMDI) which allows for arbitrary dense-or-sparse keyframe placement and partial keyframe constraints while generating high-quality motions that are diverse and coherent with the given keyframes. We evaluate the performance of CondMDI on the text-conditioned HumanML3D dataset and demonstrate the versatility and efficacy of diffusion models for keyframe in-betweening. We further explore the use of guidance and imputation-based approaches for inference-time keyframing and compare CondMDI against these methods.

CondMDI Overview

When performing motion keyframing, our goal is to produce realistic motions that adhere to a set of spatio-temporarily sparse input keyframes while maintaining coherence between these observed keyframes and the entirety of the generated motion sequence. We use a diffusion model for generating the motions. To condition the model on keyframes, we use a simple and effective approach: replace the noisy input motion with the given keyframes over the provided frames, and concatenate the observation mask to further inform the model of the keyframe locations. Our model is then trained on randomly sampled keyframes with randomly sampled joints, together with a mask that indicates the observed keyframes and features. This then offers significant flexibility in terms of number of keyframes and their placement in time, as well as partial keyframes, i.e., providing information for a subset of the joints.

framework overview

Sparse Keyframes

Generated motions are high-quality given sparse keyframes placed 5, 10, or 20 frames apart.

Keyframes every 5 frames

Keyframes every 10 frames

Keyframes every 20 frames

Generated results are still high-quality when keyframes are placed more sparsely.

Keyframes every 40 frames

Keyframes randomly placed at frames 35-40-80-140

Partial Keyframes

CondMDI supports partialkeyframes, allowing for joint control.

Conditioned on the root joint trajectory

Conditioned on the trajectory of the head and two wrists

Conditioned on the right wrist trajectory

Text Conditioning

CondMDI allows for text prompts to be used as an additional condition to control the generated motions. As an example, we show the results of conditioning on the same root joint trajectory given different text prompts.

"a person is walking"

"a person is dancing"

a person is exercising

a person is waving



Here, we show the comparison of CondMDI and inference-time imputation and reconstruction guidance methods on the task of keyframe in-betweening. Imputation alone ignores the keyframes and thus, the imputation results demonstrate high jumps over the keyframes. Reconstruction guidance improves the results of imputation by introducing cohesion between the generated motion and the keyframes but still demonstrates some jumpts over the keyframes. CondMDI outperforms both methods by generating high-quality and smooth motions.

Imputation - stop replacement at denoising step one

Imputation - replacement until last denoising step

Reconstruction Guidance


Other Methods

Root Joint Control. Here, we compare CondMDI against SOTA diffusion-based methods on the root joint control task. Although CondMDI is designed for sparse keyframe in-betweening, it tracks the observed trajectory well and without much foot skate.






Sparse Keyframe In-betweening. Here, we compare CondMDI against OmniControl on the sparse keyframe in-betweening task. OmniControl demonstrates foot skate and less adherence to the keyframes compared to our method.