WholeBodyVLA Overview

Overview of WholeBodyVLA. Introducing WholeBodyVLA, a humanoid system that operates on Agibot X2 robot and performs end-to-end humanoid loco–manipulation in large space for the first time. The proposed system achieves consecutive tasks autonomously, including (a-c) basic bimanual grasping, side-step toward the box, and squatting to place; (d-e) squatting to grasp and lift the box and turning to place the box onto the cart; (f-h) grasping the cart handle, pushing the cart forward, and pushing a load of more than 50 kg.

Method Overview

WholeBodyVLA Method

Pipeline of WholeBodyVLA. LAM is pretrained on manipulation and manipulation- aware locomotion videos, yielding unified latent supervision for the VLM. Meanwhile, the LMO RL policy is trained for precise and stable locomotion under disturbances. At runtime, egocentric images and language instructions are encoded by the VLM into latent action tokens, which are decoded (∼ 10 Hz )into (i) dual-arm joint actions and (ii) locomotion commands executed by LMO at 50 Hz, enabling robust whole-body loco–manipulation.

Real-world Demos

WholeBodyVLA Performance in Complex Tasks

Task 1: Bag Packing

Our Success Cases

WholeBodyVLA (ours)
WholeBodyVLA under visual variation

Failure Cases of Baseline Methods

❌ Stumble to stop
❌ Lose balance and kick the box

Task 2: Box Loading

Our Success Cases

WholeBodyVLA (ours)
WholeBodyVLA under unseen object

Failure Cases of Baseline Methods

❌ Stumble to stop
❌ Lose balance and deviate greatly from the intended direction

Task 3: Cart Pushing

Our Success Cases

WholeBodyVLA (ours)
WholeBodyVLA under unseen heavy load

Failure Cases of Baseline Methods

❌ Deviate from the right direction
❌ Stop too late

Experiments on Robot Generalization and Capability Showcases

Adaptability & Scalability

Generalization Experiments

1. Object Generalization

Demonstrate WholeBodyVLA's robustness to variations in objects appearance and position, layout, and table color, in response to Reviewer C1fR (W2 Q2 S2) and Reviewer 76f3 (W1).

2. Start-Pose Generalization

Showcase WholeBodyVLA's ability to compose forward advancing, sidestepping, turning, and squatting to handle diverse start-poses (X/Y offsets, orientations, and table heights), in response to Reviewer C1fR (W2 Q2 S2) and Reviewer N1iB (W1 Q1).

Distance X-Axis

X-axis Distance Generalization Experiment 1
X-axis Distance Generalization Experiment 2 (w/ unseen table color)

Distance Y-Axis

Y-axis Distance Generalization Experiment 1
Y-axis Distance Generalization Experiment 2 (w/ unseen table color)

Orientation

Orientation Generalization Experiment 1
Orientation Generalization Experiment 2 (w/ unseen table color)

Height

Height Generalization Experiment

3. Terrian Generalization

Demonstrate WholeBodyVLA's ability to traverse uneven terrain, in response to Reviewer C1fR (W2 Q2 S2), Reviewer N1iB (Q4), and Reviewer 1asV (W2).

Long-Horizon Bimanual Manipulation

Demonstrate WholeBodyVLA's competence on long-horizon sequences that involve loco-manipualtion and whole-body coordinated actions, in response to Reviewer C1fR (W2 Q2 S2), Reviewer N1iB (Q4), and Reviewer 1asV (W2 W3).

Long-Horizon Bimanual Manipulation with Coordination

What's More

Showcase WholeBodyVLA's scalability to more complex everyday loco-manipulation tasks (e.g., wiping, vacuum cleaning, etc), in response to Reviewer C1fR (W2 Q2 S2), Reviewer N1iB (Q4), Reviewer 76f3 (W1 W2), and Reviewer 1asV (W2).