VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model
Paper
β’
2509.09372
β’
Published
β’
243
VLA-Adapter-Lite is a small, trainable BridgeAttention policy head that maps vision + language + state β action for the NVIDIA GR00T Teleop G1 humanoid dataset (43-D state/actions).
The vision (SigLIP) and language (Qwen) towers are frozen; only this adapter is trained.
This repo contains only the policy head weights and code. At inference, you load the frozen backbones from their own model hubs.
adapter.pt / adapter.safetensors β PyTorch state dict for the policy headpolicy_definition.py β the BridgeAttentionPolicy classconfig.json β dimensions & training config (IDs for base models, dims, etc.)Backbones (frozen at inference & training):
google/siglip-base-patch16-224Qwen/Qwen2.5-0.5B-InstructTarget (GR00T G1):
import json, torch
from transformers import SiglipVisionModel, SiglipImageProcessor, AutoTokenizer, AutoModelForCausalLM
from policy_definition import BridgeAttentionPolicy
# Load config & backbones
cfg = json.load(open("config.json"))
vision_model_id = cfg["vision_model_id"]
text_model_id = cfg["text_model_id"]
image_processor = SiglipImageProcessor.from_pretrained(vision_model_id)
vision = SiglipVisionModel.from_pretrained(vision_model_id, output_hidden_states=True).eval()
tokenizer = AutoTokenizer.from_pretrained(text_model_id, use_fast=True)
if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token
text = AutoModelForCausalLM.from_pretrained(text_model_id, output_hidden_states=True).eval()
# Build policy head and load weights
v_hidden = vision.config.hidden_size
t_hidden = text.config.hidden_size
policy = BridgeAttentionPolicy(
v_hidden=v_hidden, t_hidden=t_hidden,
state_dim=cfg["state_dim"], policy_dim=cfg["policy_dim"],
n_heads=cfg["n_heads"], n_layers=cfg["policy_layers"],
n_queries=cfg["num_action_queries"], action_dim=cfg["action_dim"],
dropout=cfg["dropout"]
).eval()
sd = torch.load("adapter.pt", map_location="cpu")
policy.load_state_dict(sd, strict=True)
# ---- Example forward (single sample) ----
from PIL import Image
instruction = "Pick the apple from the table and place it into the basket."
state = torch.zeros(1, cfg["state_dim"]) # shape [1,43]; replace with real proprio
# Vision: last 4 hidden states (drop CLS token), as a list of tensors
img = Image.new("RGB", (256, 256), color=(200, 230, 255)) # replace with a real frame
v_inputs = image_processor(images=[img], return_tensors="pt")
with torch.no_grad():
v_out = vision(**v_inputs, output_hidden_states=True)
v_feats_layers = [t[:, 1:, :].contiguous() if t.shape[1] >= 2 else t.contiguous()
for t in v_out.hidden_states[-4:]]
# Language: last 4 hidden states
t_inputs = tokenizer([instruction], return_tensors="pt", padding=True, truncation=True, max_length=64)
with torch.no_grad():
t_out = text(**t_inputs, output_hidden_states=True)
t_feats_layers = [t.contiguous() for t in t_out.hidden_states[-4:]]
with torch.no_grad():
action = policy(v_feats_layers, t_feats_layers, state) # [1,43]
print("Pred action:", action.shape)
nvidia/PhysicalAI-Robotics-GR00T-Teleop-G1 (total 768 frames)Overall per-joint-group error
| Segment | MSE | MAE |
|---|---|---|
| left_leg | 0.0040 | 0.049 |
| right_leg | 0.0055 | 0.047 |
| waist | 0.0002 | 0.013 |
| left_arm | 0.0455 | 0.157 |
| left_hand | 0.1253 | 0.156 |
| right_arm | 0.0878 | 0.184 |
| right_hand | 0.1154 | 0.143 |
| Dataset | Samples | MSE | MAE | Arms MSE | Hands MSE |
|---|---|---|---|---|---|
| g1-pick-apple | 192 | 0.0399 | 0.087 | 0.0362 | 0.0850 |
| g1-pick-pear | 192 | 0.0817 | 0.146 | 0.0645 | 0.1808 |
| g1-pick-grapes | 192 | 0.0801 | 0.136 | 0.1249 | 0.1175 |
| g1-pick-starfruit | 192 | 0.0473 | 0.105 | 0.0411 | 0.0981 |
g1-pick-apple segment error
| Segment | MSE | MAE |
|---|---|---|
| left_leg | 0.0011 | 0.027 |
| right_leg | 0.0016 | 0.028 |
| waist | 0.0002 | 0.012 |
| left_arm | 0.0610 | 0.177 |
| left_hand | 0.1664 | 0.202 |
| right_arm | 0.0113 | 0.083 |
| right_hand | 0.0037 | 0.020 |
g1-pick-pear segment error
| Segment | MSE | MAE |
|---|---|---|
| left_leg | 0.0069 | 0.071 |
| right_leg | 0.0061 | 0.057 |
| waist | 0.0001 | 0.010 |
| left_arm | 0.0374 | 0.153 |
| left_hand | 0.1331 | 0.165 |
| right_arm | 0.0915 | 0.203 |
| right_hand | 0.2285 | 0.262 |
g1-pick-grapes segment error
| Segment | MSE | MAE |
|---|---|---|
| left_leg | 0.0030 | 0.045 |
| right_leg | 0.0052 | 0.045 |
| waist | 0.0002 | 0.012 |
| left_arm | 0.0251 | 0.123 |
| left_hand | 0.0058 | 0.022 |
| right_arm | 0.2246 | 0.335 |
| right_hand | 0.2292 | 0.273 |
g1-pick-starfruit segment error
| Segment | MSE | MAE |
|---|---|---|
| left_leg | 0.0051 | 0.053 |
| right_leg | 0.0092 | 0.058 |
| waist | 0.0004 | 0.019 |
| left_arm | 0.0584 | 0.177 |
| left_hand | 0.1959 | 0.235 |
| right_arm | 0.0238 | 0.114 |
| right_hand | 0.0003 | 0.014 |
Core
Backbones & Dataset
Related Benchmarks / Corpora
@article{wang2025vlaadapter,
title={VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model},
author={Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and Huang, Siteng and Tang, Yifan and Wang, Wenhui and Zhang, Ru and Liu, Jianyi and Wang, Donglin},
journal={arXiv preprint arXiv:2509.09372},
year={2025}
}
@article{kim2025oft,
title={Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success},
author={Kim, Moo Jin and Finn, Chelsea and Liang, Percy},
journal={arXiv preprint arXiv:2502.19645},
year={2025}
}
@article{kim2024openvla,
title={OpenVLA: An Open-Source Vision-Language-Action Model},
author={Kim, Moo Jin and others},
journal={arXiv preprint arXiv:2406.09246},
year={2024}
}
@article{zhai2023siglip,
title={Sigmoid Loss for Language-Image Pre-Training},
author={Zhai, Xiaohua and others},
journal={arXiv preprint arXiv:2303.15343},
year={2023}
}
@article{yang2024qwen25,
title={Qwen2.5 Technical Report},
author={Yang, An and others},
journal={arXiv preprint arXiv:2412.15115},
year={2024}
}
@dataset{nvidia2025gr00t,
title={PhysicalAI-Robotics-GR00T-Teleop-G1},
author={NVIDIA Physical AI},
year={2025},
howpublished={Hugging Face dataset card},
url={https://huggingface.co/datasets/nvidia/PhysicalAI-Robotics-GR00T-Teleop-G1}
}
@article{liu2023libero,
title={LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning},
author={Liu, Bingjie and others},
journal={arXiv preprint arXiv:2306.03310},
year={2023}
}
@article{walke2023bridgedatav2,
title={BridgeData V2: A Dataset for Robot Learning at Scale},
author={Walke, Homer and others},
journal={arXiv preprint arXiv:2308.12952},
year={2023}
}