AgentsMedium impactFor DevarXiv Agents · June 11, 2026

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

SpatialClaw introduces a flexible code-based action interface for vision-language agents to perform complex 3D/4D spatial reasoning, significantly improving accuracy across diverse benchmarks.
Signal strength3.4/5·arXiv Agents

SpatialClaw introduces a flexible code-based action interface for vision-language agents to perform complex 3D/4D spatial reasoning, significantly improving accuracy across diverse benchmarks.

TL;DR

SpatialClaw introduces a flexible code-based action interface for vision-language agents to perform complex 3D/4D spatial reasoning, significantly improving accuracy across diverse benchmarks.

What happened

The paper proposes SpatialClaw, a training-free framework that lets a vision-language model-backed agent iteratively write executable Python code cells to perform spatial reasoning, leveraging a stateful kernel with perception and geometry primitives. This approach outperforms prior spatial agents by 11.2 percentage points on average accuracy across 20 varied spatial reasoning tasks without model or benchmark-specific tuning.

Why it matters

Spatial reasoning is a challenging domain for AI, especially for vision-language models. SpatialClaw’s novel flexible action interface enables more adaptive and compositional spatial analysis, advancing the capability of AI agents to reason in complex 3D/4D environments.

Generating deep dive...

AI-powered analysis takes a few seconds