Diffusion Trajectory-guided Policy for Long-horizon Robot Manipulation

Shichao Fan1, Quantao Yang4, Yajie Liu2, Kun Wu3, Zhengping Che3, Qingjie Liu2,*, Min Wan1
1School of Mechanical Engineering and Automation, BeiHang University, China
2School of Computer Science and Engineering, BeiHang University, China
3Beijing Innovation Center of Humanoid Robotics, China
4Division of Robotics, Perception and Learning (RPL), KTH Royal Institute of Technology, Sweden

Accepted by IEEE Robotics and Automation Letters(RAL)2025

*Indicates Corresponding Authors
Code(coming soon) arXiv

Overview

Teaser image

System overview. (a) and (b) present a task instruction with the initial task observation, allowing our Diffusion Trajectory Model to predict the complete future 2D-particle trajectories; (c) illustrates the Diffusion Trajectory-guided pipeline, showcasing how these predicted trajectories guide the manipulation policy.

Abstract

Recently, Vision-Language-Action models (VLA) have advanced robot imitation learning, but high data collection costs and limited demonstrations hinder generalization and current imitation learning methods struggle in out-of-distribution scenarios, especially for long-horizon tasks. A key challenge is how to mitigate compounding errors in imitation learning, which lead to cascading failures over extended trajectories. To address these challenges, we propose the Diffusion Trajectory-guided Policy (DTP) framework, which generates 2D trajectories through a diffusion model to guide policy learning for long-horizon tasks. By leveraging task-relevant trajectories, DTP provides trajectory-level guidance to reduce error accumulation. Our two-stage approach first trains a generative vision-language model to create diffusion-based trajectories, then refines the imitation policy using them. Experiments on the CALVIN benchmark show that DTP outperforms state-of-the-art baselines by 25% in success rate, starting from scratch without external pretraining. Moreover, DTP significantly improves real-world robot performance.

Method

Teaser image

System architecture for learning language-conditioned policies. a) shows the input modalities, including vision, language, and proprioception. b) describes the Diffusion Trajectory Model, detailing how vision and language inputs generate diffusion particle trajectories. c) explains how these trajectories guide the training of robot policies, focusing on the learning of the Diffusion Trajectory Policy. Masked learnable tokens represent the particle trajectory prediction token, action token, and video prediction token, respectively.

Experiments Setup

Here are some examples of the hardware setup for data collection, which is also the hardware setup we used to establish our real-world experiments. For the Franka Emika Panda robots, we use cameras positioned at the top, left, and right viewpoints to record the visual information of the task trajectories. For the AgileX/Tien Kung robots, we use their built-in cameras to record visual information. For UR robots, we use an external top camera. All demonstrations are collected using high-quality human teleoperation and stored on a unified intelligence platform.

Real-world Long-horizon Experiments

Our method's performance on a long sequence (divided into 5 subsequences), where the small window on the left side of the video displays the diffusion trajectory view, representing the Stage 1 DTM (Diffusion Trajectory Model) process, with the entire procedure orchestrated by the Stage 2 DTP (Diffusion Trajectory-guided Policy).

Video Presentation

BibTeX

@article{fan2025diffusion,
  title={Diffusion trajectory-guided policy for long-horizon robot manipulation},
  author={Fan, Shichao and Yang, Quantao and Liu, Yajie and Wu, Kun and Che, Zhengping and Liu, Qingjie and Wan, Min},
  journal={IEEE Robotics and Automation Letters(RAL)},
  year={2025}
}