INSIGHT:
Inference-time Sequence Introspection for Help-Triggering in VLAs

▶ Click speaker icon to unmute

Abstract

Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspective mechanisms for anticipating failures and requesting help from a human supervisor. We present INSIGHT, a learning framework for leveraging token-level uncertainty signals to predict when a VLA should request help. Using π0-FAST as the underlying model, we extract per-token entropy, log-probability, and Dirichlet-based estimates of aleatoric and epistemic uncertainty, and train compact transformer classifiers to map these sequences to help triggers. We explore supervision regimes for strong or weak supervision, and extensively compare them across in-distribution and out-of-distribution tasks. Our results show a trade-off: strong labels enable models to capture fine-grained uncertainty dynamics for reliable help detection, while weak labels, though noisier, still support competitive introspection when training and evaluation are aligned, offering a scalable path when dense annotation is impractical. Crucially, we find that modeling the temporal evolution of token-level uncertainty signals with transformers provides far greater predictive power than static sequence-level scores. This study provides the first systematic evaluation of uncertainty-based introspection in VLAs, opening future avenues for active learning and for real-time error mitigation through selective human intervention.

Problem

Inference in VLAs

Our work builds upon π0-FAST, a VLA model that takes as input a natural language instruction, RGB images of the environment, and the robot’s state. The model processes these inputs in a forward pass and auto-regressively generates a sequence of tokens. These tokens are then decoded into continuous joint actions, which the robot executes in the environment.

Inference process of PI0-FAST

Importantly, each token is drawn from its own probability distribution over the action vocabulary. This exposes token-level uncertainty signals, such as entropy, negative log-likelihood, or Dirichlet-based measures. Our central research question is therefore: Can token-level uncertainty reliably indicate when the robot should request human help?

Inference process GIF showing token-level probability distributions
Granularity of signals

Episode vs Step vs Token

With the inference loop in mind, we next clarify the three levels of granularity—episode, step, and token—used throughout our analysis.

Episode, Step, Token levels and where uncertainty lives.
INSIGHT taps uncertainty at the token level from the VLA, then reasons over temporal windows to trigger help.

Token is the infinitesimal unit produced during inference in VLAs. A VLA generates n tokens, each drawn from its own probability distribution over the vocabulary.

Action involves the n tokens, which are not fixed in number per inference for π0-FAST, decoded in to a chunk of robot actions.

Step consists of one cycle of collecting an observation, performing inference, decoding the resulting token sequence into an action chunk, and executing it.

Episode refers to one complete rollout (success or failure), which consists of multiple inferences by the VLA. Formally, an episode includes all K steps:

E = (a1:H1, a1:H2, …, a1:HK)

Key idea: model the temporal evolution of token-level uncertainty with a compact transformer → more reliable help triggers.
Modeling

Training Compact Transformers for Introspection

Our approach INSIGHT operates on token-level signals over time. We imlpement this by training a lightweight transformer that predicts when to request help.

Compact transformer diagram
A lightweight transformer over uncertainty sequences.
  • Inputs: token-level features per step: entropy, −log p, Dirichlet AU/EU.
  • Backbone: 1–2 layers, small hidden size; positional encodings for time.
  • Objective: predict help/no-help for the window’s end step.
Why a small transformer? Token features are informative—the introspector transformer can stay compact for fast, online triggering.
Supervision

Strong vs Weak Labels

How do we train this model?The effectiveness of introspection depends on supervision of the compact transformer. We compare dense step-level (strong) and scalable episode-level (weak) labels.

Strong (step labels)

Dense, per-step “help/no-help” supervision. Captures the onset and decay of uncertainty around challenging events.

Strong label example GIF
Each step in an episode is labeled as needing help or not
  • Pros: precise timing, best early detection.
  • Cons: costly to annotate densely.

Weak (episode labels / MIL)

Only episode-level success/fail. We use multiple-instance learning to supervise the transformer.

Weak label example GIF
Each episode is labeled as success or failure
  • Pros: scalable, easy to collect.
  • Cons: noisier; results in delayed help triggers.
Datasets

Evaluation Settings

We evaluate across three settings to probe generalization and robustness:

In-Distribution

Lift and Pick-Place variants in the Kitchen (real-world) environment. Objects and backgrounds match those seen during training.

Distribution Shift

Evaluation in the Kitchen with distribution shifts: new distractor objects, changes in shape, color, location, and orientation. Tests robustness to visual variation while staying in the same environment.

Simulated OOD

Cross-domain evaluation between real Kitchen and simulated LIBERO-10. Objects and tasks differ completely, and the underlying π0-FAST checkpoints differ as well (Kitchen-trained vs. LIBERO-trained). This setting represents true out-of-distribution generalization.

Rollouts

If you would like to explore examples from our datasets, this interactive section allows you to browse rollout videos filtered by dataset (In-Distribution, Distribution Shift, or Simulation-OOD) and by outcome (Success or Failure). These examples illustrate how fine-tuned π0-FAST behaves across different settings and why uncertainty-aware help detection is necessary.

Results

Results for the transformer (INSIGHT) and Conformal Prediction based on entropy (CP-E) and perplexity (CP-P). Boxplots show mean (dashed) and median (solid) across folds; error bars denote ±1 SD. Paired Wilcoxon significance: * p<0.05, ** p<0.01. Higher is better unless noted.

In-Distribution Performance

Setting. Calibration, and testing all come from the same distribution, so Conformal Prediction's coverage guarantees formally hold.

Do uncertainty metrics provide predictive power for requesting help?

Yes. Across all experiments, token-level uncertainty signals (entropy, log-probability, aleatoric and epistemic uncertainty) offered predictive power beyond random guessing (accuracy/F1 > 0.5).

Boxplots of accuracy, F1 across 10 folds.

Signal exists. Token-level uncertainty (entropy, log-p, AU, EU) carries predictive power beyond random (accuracy/F1 > 0.5).

Baselines vs. INSIGHT. CP—though sequence-level—sometimes reaches ~0.7 under weak-label evaluation when calibration and evaluation align, but temporal transformers remain stronger overall.

Distribution Shift Performance

Setting. Object positions/orientations and unseen objects differ from training; Conformal Prediction’s exchangeability assumption breaks, so results are diagnostic rather than guaranteed.

How transferable are models under distribution shift?

All models degraded under distribution shift. Strongly supervised models degraded least.

Boxplots under distribution shift.

Label/regime mismatch. The weak-training / strong-testing case can drop below F1=0.5 due to noisy supervision evaluated with strict labels under shift.

Large In-Distribution Performance

Setting. In-Distribution and Distribution Shift dataets are combined to increase training diversity while maintaining exchangeability, letting us assess whether scaling within distribution helps.

Does adding more data help?

Adding shifted data to increase training size did not consistently improve robustness. In fact, strongly supervised transformers sometimes saw small drops in performance. Weakly supervised models benefited more when evaluated with weak labels, but overall, label quality proved more important than sheer dataset size.

Large-scale in-distribution results.

Takeaway. Modeling the temporal evolution of token-level uncertainty with transformers is more predictive than static sequence-level scores, supporting sequential models for help detection.

Simulation OOD

Setting. Train on real-world π0-FAST, test on a different π0-FAST fine-tuned on LIBERO → shift in tasks and policy behavior; a highly OOD stress test.

How do models transfer across simulation OOD?

Despite training on real-world kitchen tuned π0-FAST data, INSIGHT transferred surprisingly well to simulation-based LIBERO tuned π0-FAST data tests. Strongly supervised models, especially the jumbo variant, reached accuracy and F1 approaching that of the sim-only benchmark. This shows that token-level uncertainty features are robust across environments and policy checkpoints, suggesting that strong-label introspection modules can generalize without retraining or re-annotation.

Simulation OOD results.

Robust transfer. Strongly supervised INSIGHT generalizes to simulation, with jumbo models nearing sim-only performance.

Help Timing & Frequency

What we measure. Time-to-First-Help (TTFH), trigger counts on success vs. failure, and per-step trigger rates characterize when and how often interventions occur.

When and how often do models trigger help?

The strongly supervised model triggered help earliest and most frequently, maximizing failure coverage but risking over-intervention. Weakly supervised models were conservative, rarely interrupting successes but sometimes missing failures.

Method TTFH (fail) ↓ Triggerssucc Triggersfail (≥ 1 ok) Trigger Rate (success) ↓ Trigger Rate (fail) ↑
CP-W (Entropy) 6.891 ± 2.257 0.457 ± 0.302 1.721 ± 0.739 0.031 ± 0.020 0.118 ± 0.050
Strong Superv. 5.597 ± 0.809 0.710 ± 0.440 7.062 ± 1.225 0.047 ± 0.029 0.472 ± 0.081
Weak Superv. 7.929 ± 1.867 0.122 ± 0.172 1.566 ± 1.025 0.008 ± 0.011 0.105 ± 0.069

Early vs. conservative. Strong supervision fires earlier on failing episodes (lower TTFH) and more often during fails (higher fail-rate), at the cost of more triggers on successes; CP-W and Weak are more conservative (fewer success triggers) but react later/less on fails.

Takeaways

What INSIGHT Shows

Token-level introspection works. Modeling temporal uncertainty with a compact transformer yields earlier and more reliable help triggers than static sequence-level scores, demonstrating the need to capture sequence dynamics rather than treating uncertainty as a single snapshot.

Supervision quality matters more than scale. Strong step-level labels consistently provide the most precise timing and highest F1 scores. Weak labels remain competitive when training and testing align, but dataset size was less important than label fidelity for robust introspection.

Weak supervision remains viable for scalability. Episode-level labels are easy and cheap to collect and can support deployment scenarios where dense step annotations are infeasible, though they inevitably trade off some fidelity compared to strong supervision.

Help triggering is a design choice. Strongly supervised models intervene earlier and more frequently, maximizing safety coverage but risking over-intervention. Weakly supervised and CP-based models are more conservative, favoring unobtrusiveness at the cost of missed failures.

Generalization is challenging but tractable. All models degrade under distribution shift, but strongly supervised transformers degrade least and maintain more balanced precision and recall. Moreover, models trained on real-world data transferred surprisingly well to simulation OOD, suggesting stability of token-level uncertainty features across environments and checkpoints.

Conformal prediction provides a baseline, not a substitute. CP can align with weak-label evaluation when calibration and testing match, but its reliance on aggregated sequence scores limits robustness compared to temporal modeling.

Future Directions

Active learning is the next frontier. INSIGHT’s uncertainty signals create natural hooks for prioritizing annotation, adaptively setting thresholds, and guiding human-in-the-loop recovery policies. Extending introspection into active data collection loops offers a path toward scalable, real-time learning.

Toward adaptive deployment. Future work should explore how help-triggering policies can adapt dynamically to different environments, user preferences, and task criticalities—balancing safety, efficiency, and user experience.

BibTeX

If you found this useful, please cite INSIGHT using the entry below.

        @misc{karli2025insight,
      title={INSIGHT: INference-time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models}, 
      author={Ulas Berk Karli and Ziyao Shangguan and Tesca FItzgerald},
      year={2025},
      eprint={2510.01389},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2510.01389}, 
}
            

Download .bib