Motion in-betweening, a fundamental task in character animation, consists of generating motion sequences that plausibly interpolate user-provided keyframe constraints. It has long been recognized as a labor-intensive and challenging process. We investigate the potential of diffusion models in generating diverse human motions guided by keyframes. Unlike previous inbetweening methods, we propose a simple unified model capable of generating precise and diverse motions that conform to a flexible range of user-specified spatial constraints, as well as text conditioning. To this end, we propose Conditional Motion Diffusion In-betweening (CondMDI) which allows for arbitrary dense-or-sparse keyframe placement and partial keyframe constraints while generating high-quality motions that are diverse and coherent with the given keyframes. We evaluate the performance of CondMDI on the text-conditioned HumanML3D dataset and demonstrate the versatility and efficacy of diffusion models for keyframe in-betweening. We further explore the use of guidance and imputation-based approaches for inference-time keyframing and compare CondMDI against these methods.
When performing motion keyframing, our goal is to produce realistic motions that adhere to a set of spatio-temporarily sparse input keyframes while maintaining coherence between these observed keyframes and the entirety of the generated motion sequence. We use a diffusion model for generating the motions. To condition the model on keyframes, we use a simple and effective approach: replace the noisy input motion with the given keyframes over the provided frames, and concatenate the observation mask to further inform the model of the keyframe locations. Our model is then trained on randomly sampled keyframes with randomly sampled joints, together with a mask that indicates the observed keyframes and features. This then offers significant flexibility in terms of number of keyframes and their placement in time, as well as partial keyframes, i.e., providing information for a subset of the joints.
Generated motions are high-quality given sparse keyframes placed 5, 10, or 20 frames apart.
Keyframes every 5 frames
Keyframes every 10 frames
Keyframes every 20 frames
Generated results are still high-quality when keyframes are placed more sparsely.
Keyframes every 40 frames
Keyframes randomly placed at frames 35-40-80-140
CondMDI supports partialkeyframes, allowing for joint control.
Conditioned on the root joint trajectory
Conditioned on the trajectory of the head and two wrists
Conditioned on the right wrist trajectory
CondMDI allows for text prompts to be used as an additional condition to control the generated motions. As an example, we show the results of conditioning on the same root joint trajectory given different text prompts.
"a person is walking"
"a person is dancing"
a person is exercising
a person is waving
Here, we show the comparison of CondMDI and inference-time imputation and reconstruction guidance methods on the task of keyframe in-betweening. Imputation alone ignores the keyframes and thus, the imputation results demonstrate high jumps over the keyframes. Reconstruction guidance improves the results of imputation by introducing cohesion between the generated motion and the keyframes but still demonstrates some jumpts over the keyframes. CondMDI outperforms both methods by generating high-quality and smooth motions.
Imputation - stop replacement at denoising step one
Imputation - replacement until last denoising step
Reconstruction Guidance
Ours
Root Joint Control. Here, we compare CondMDI against SOTA diffusion-based methods on the root joint control task. Although CondMDI is designed for sparse keyframe in-betweening, it tracks the observed trajectory well and without much foot skate.
MDM
PriorMDM
GMD
OmniControl
Ours
Sparse Keyframe In-betweening. Here, we compare CondMDI against OmniControl on the sparse keyframe in-betweening task. OmniControl demonstrates foot skate and less adherence to the keyframes compared to our method.
OmniControl
Ours
OmniControl
Ours