SpaRRTA Model README

This document explains the probe models used by the SpaRRTA Space demo and how they are organized on Hugging Face Model Hub.

What these models are

SpaRRTA uses:

A frozen vision backbone from timm DINOv3 (hf_model_id in manifest).
A lightweight learned probe head (EfficientProbing) per:
- backbone,
- perspective (camera or human),
- triplet (for example bridge_2_trashbin_bike).

The probe head predicts one of 4 classes:

Front
Back
Left
Right

Available backbones

From space/artifacts/manifest.yaml:

vit_small_patch16_dinov3.lvd1689m (expected_feat_dim: 384)
vit_base_patch16_dinov3.lvd1689m (expected_feat_dim: 768)
vit_large_patch16_dinov3.lvd1689m (expected_feat_dim: 1024)
vit_huge_plus_patch16_dinov3.lvd1689m (expected_feat_dim: 1280)
vit_7b_patch16_dinov3.lvd1689m (expected_feat_dim: 4096, experimental)

Perspectives

Each triplet has separate heads for:

camera view
human view

The app selects the corresponding checkpoint at inference time.

Triplets

A triplet is a spatial relation setup (reference, target, human) tied to one scene id, for example:

bridge_tree_truck
bridge_2_trashbin_bike
city_2_hydrant_taxi
winter_town_2_snowman_husky

Each (backbone, perspective, triplet) combination maps to one .pt checkpoint.

Checkpoint format and path contract

Hub-style checkpoint path in manifest:

checkpoints/<model_key>/<perspective>/<triplet_id>.pt

Example:

checkpoints/vit_small_patch16_dinov3.lvd1689m/camera/bridge_2_trashbin_bike.pt

Legacy paths (artifacts/checkpoints/...) are accepted by the app and normalized automatically.

Model Hub repository structure

Recommended single model repo:

<hf-username>/sparrta-probes

Structure:

checkpoints/
  vit_small_patch16_dinov3.lvd1689m/
    camera/*.pt
    human/*.pt
  vit_base_patch16_dinov3.lvd1689m/
    camera/*.pt
    human/*.pt
  vit_large_patch16_dinov3.lvd1689m/
    camera/*.pt
    human/*.pt
  vit_huge_plus_patch16_dinov3.lvd1689m/
    camera/*.pt
    human/*.pt
  vit_7b_patch16_dinov3.lvd1689m/
    camera/*.pt
    human/*.pt
manifest.yaml

How Space loads model checkpoints

The Space app resolves checkpoints with:

SPARRTA_MODEL_REPO_ID environment variable (preferred), or
model_repo_id field in space/artifacts/manifest.yaml.

At runtime, it uses hf_hub_download(...) to fetch checkpoint files and caches them via the Hugging Face cache.

Optional:

SPARRTA_MODEL_REVISION can pin a specific tag/commit. If unset, the app uses latest.

Citation

If you find this research useful, please consider citing:

@misc{kargin2026sparrta,
  title={SpaRRTa: A Synthetic Benchmark for Evaluating Spatial Intelligence in Visual Foundation Models},
  author={Turhan Can Kargin and Wojciech Jasiński and Adam Pardyl and Bartosz Zieliński and Marcin Przewięźlikowski},
  year={2026},
  eprint={2601.11729},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2601.11729}
}