TargetBench Icon Target-Bench:
Can World Models Achieve Mapless Path Planning with Semantic Targets?
Dingrui Wang1,2,* Hongyuan Ye1,* Zhihao Liang1,* Zhexiao Sun1,*
Zhaowei Lu1 Yuchen Zhang1 Yuyu Zhao1 Yuan Gao1 Marvin Seegert1 Finn Schäfer1
Haotong Qin3 Wei Li4 Luigi Palmieri2 Felix Jahncke1 Mattia Piccinini1 Johannes Betz1
1TUM 2Bosch AI Center 3ETH 4NJU
* Equal contribution; author order settled via Mario Kart.

Overview

Target-Bench Framework. The pipeline comprises three stages: 1) World Model Inference generates video predictions conditioned on a semantic target; 2) Spatial-temporal Reconstruction recovers camera poses from generated frames; 3) Evaluation compares the reconstructed path against ground truth.

Target-Bench Framework

Hardware Configuration. The data collection platform is a Unitree Go1 quadruped robot. Key sensors include a Livox MID-360 LiDAR for mapping and a Realsense D435i stereo camera for visual data. A Jetson AGX Orin handles onboard processing, with a switch managing connections between components.

Robot Hardware Setup

Dataset Interactive Demo

Compare ground truth robot navigation videos with different video generation models.

Sample 1 of 117

Semantic Target & Current Frame

Loading...
Current Frame

Model A

Model B

Evaluation Pipeline

Evaluation Pipeline Overview

1. Current Frame & Semantic Task

Task: Move next to the person and stop at the side of the sofa

Current Frame

Ground Truth Video

Current Frame

2. World Model

3. Spatial-temporal Reconstruction

4. Extrinsic Extraction Path Builder

5. Evaluation with Metrics

Quantitative Results

World Models Comparison. We benchmark state-of-the-art video generation models (Sora, Veo, Wan series). To evaluate their planning capability, we introduce a Weighted Overall (WO) Score to quantify performance, aggregating five key metrics: Average/Final Displacement Error (ADE/FDE) for accuracy, Miss Rate (MR) for reliability, and Soft Endpoint (SE) with Approach Consistency (AC) for semantic goal adherence.

Model Comparison VGGT

World model performance comparison with VGGT as world decoder’s spatio-temporal reconstruction tool.


World Decoders Comparison. We employ VGGT, SpaTracker, and ViPE as world decoders to recover 3D camera trajectories from the generated videos, enabling direct comparison with ground truth paths.

Overall Score Comparison

Overall score comparison between different spatio-temporal reconstruction tools.