1. Overview.
Denoising-based models, such as diffusion and flow matching, have been a critical component of robotic manipulation for their strong distribution-fitting and scaling capacity. Concurrently, several works have demonstrated that simple learning objectives, such as L1 regression, can achieve performance comparable to denoising-based methods on certain tasks, while offering faster convergence and inference. In this paper, we focus on how to combine the advantages of these two paradigms: retaining the ability of denoising models to capture multi-modal distributions and avoid mode collapse while achieving the efficiency of the L1 regression objective.
To achieve this vision, we reformulate the original v-prediction flow matching and transform it into sample-prediction with the L1 training objective. We empirically show that the multi-modality can be expressed via a single ODE step. Thus, we propose L1 Flow, a two-step sampling schedule that generates a suboptimal action sequence via a single integration step and then reconstructs the precise action sequence through a single prediction. The proposed method largely retains the advantages of flow matching while reducing the iterative neural function evaluations to merely two and mitigating the potential performance degradation associated with direct sample regression.
We evaluate our method with varying baselines and benchmarks, including 8 tasks in MimicGen, 5 tasks in RoboMimic & PushT Bench, and one task in the real-world scenario. The results show the advantages of the proposed method with regard to training efficiency, inference speed, and overall performance.
2. One-step integration effectively captures the multi-modality.
3. Pseudocode
We consider a variant of Flow Matching where the model directly predicts the terminal sample rather than the instantaneous velocity. Instead of learning the instantaneous velocity field \(x_1-x_0\), the model \(\textcolor{red}{f_\theta(x_t,t)}\) predicts the corresponding terminal sample \(x_1\) conditioned on the intermediate state \(x_t\) and time \(t\): \begin{equation} \textcolor{red}{f_\theta(x_t,t)}=\frac{dx_t}{dt}=x_1-x_0\Rightarrow x_1. \end{equation} The instantaneous velocity can then be implicitly recovered as: \begin{equation} \begin{aligned} v=x_1-x_0&=\frac{\textcolor{red}{f_\theta(x_t,t)}-x_t}{1-t}. \end{aligned} \end{equation} This yields a sample-prediction flow whose dynamics are defined by the ODE: \begin{equation} \frac{dx}{dt}=\frac{\textcolor{red}{f_\theta(x_t,t)}-x_t}{1-t}. \end{equation} At training time, the model is optimized to minimize the discrepancy between the predicted terminal sample \(x_{pred}\) and the true target sample \(x_1\). We employ L1 loss to supervise the target samples: \begin{equation} \mathcal{L}=\mathbb{E}_{x_0\sim\mathcal{N}(0,1),x_1\sim data,t}\ ||\textcolor{red}{f_\theta(x_t,t)}-x_1||_1. \label{eq:x_loss_L1} \end{equation}
During inference, we introduce a two-step denoising schedule that combines continuous flow integration with direct sample prediction. See details in the pseudocode.
4. Experiments in MimicGen.
5. Experiments in RoboMimic.
We first compare the proposed two-step sampling strategy with 2 standard denoising-based methods ie. DDPM (100 steps) and Flow Matching (10 steps), and L1 regression in 8 tasks of MimicGen Benchmark.
Then, as an efficient paradigm aiming at speeding up, L1 Flow is compared to distillation-based methods ie. Consistency Policy (CP) and OneDP in total 5 tasks of Robomimic and PushT Bench.
@misc{song2025l1sampleflowefficient,
title={L1 Sample Flow for Efficient Visuomotor Learning},
author={Weixi Song and Zhetao Chen and Tao Xu and Xianchao Zeng and Xinyu Zhou and Lixin Yang and Donglin Wang and Cewu Lu and Yong-Lu Li},
year={2025},
eprint={2511.17898},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2511.17898},
}