VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers

¹Shanghai Artificial Intelligence Laboratory, ²Tongji University, ³University of Science and Technology of China, ⁴Zhejiang University, ⁵Nanjing University, ⁶Shanghai Jiao Tong University
ICCV 2025

Abstract

In this paper, we introduce an innovative vector quantization based action tokenizer built upon the largest-scale action trajectory dataset to date, leveraging over 100 times more data than previous approaches. This extensive dataset enables our tokenizer to capture rich spatiotemporal dynamics, resulting in a model that not only accelerates inference but also generates smoother and more coherent action outputs. Once trained, the tokenizer can be seamlessly adapted to a wide range of downstream tasks in a zero-shot manner, from short-horizon reactive behaviors to long-horizon planning. A key finding of our work is that the domain gap between synthetic and real action trajectories is marginal, allowing us to effectively utilize a vast amount of synthetic data during training without compromising real-world performance. To validate our approach, we conducted extensive experiments in both simulated environments and on real robotic platforms. The results demonstrate that as the volume of synthetic trajectory data increases, the performance of our tokenizer on downstream tasks improves significantly—most notably, achieving up to a 30% higher success rate on two real-world tasks in long-horizon scenarios.These findings highlight the potential of our action tokenizer as a robust and scalable solution for real-time embodied intelligence systems, paving the way for more efficient and reliable robotic control in diverse application domains.

Real-World Setup

For real robot experiments, we use a Franka Research3 arm with a fixed RealSense D435 camera to capture environmental observations. We evaluate on six manipulation tasks (4 short-horizon, 2 long-horizon) designed to evaluate the model's ability to handle varying task complexities. For each task, we collect 50 demonstrations and evaluate performance over 20 trials.

Results

Adding synthetic trajectories significantly boosts the action tokenizer's performance, raising the average success rate from 23% to 46.25%, with notable gains in precision and dynamic tasks like "Flip the pot upright" (+30%) and "Pull out a tissue paper" (from 5% to 20%+). While the limited LIBERO dataset showed minimal impact on short-horizon tasks, the much larger ManiSkill dataset led to substantial improvements. For long-horizon tasks, VQ-VLA, especially with VQ_O+L+M, dramatically outperforms baselines, achieving success rates of 50% and 30% on complex tasks where baselines struggle. This is largely due to VQ-VAE's ability to predict multiple actions per inference, reducing error accumulation and improving efficiency in long, sequential tasks.

During the real-world experiments, we measured the action execution frequency of VQ-VLA and compared it with the original OpenVLA. As shown in the table, with a compression ratio of 5 in VQ-VAE, the inference speed is nearly tripled. This significant improvement greatly facilitates real-time performance in practical applications.

VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers

Abstract

Real-World Setup

Demo

Results

BibTeX