Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack
introspective mechanisms for anticipating failures and requesting help from a human supervisor. We
present INSIGHT, a learning framework for leveraging token-level uncertainty signals to predict
when a VLA should request help. Using π0-FAST as the underlying model, we extract per-token
entropy, log-probability, and Dirichlet-based estimates of aleatoric and epistemic
uncertainty, and train compact transformer classifiers to map these sequences to help triggers.
We explore supervision regimes for strong or weak supervision, and extensively compare them across
in-distribution and out-of-distribution tasks. Our results show a trade-off: strong labels enable models
to capture fine-grained uncertainty dynamics for reliable help detection, while weak labels, though
noisier, still support competitive introspection when training and evaluation are aligned, offering a
scalable path when dense annotation is impractical. Crucially, we find that modeling the temporal
evolution of token-level uncertainty signals with transformers provides far greater predictive power
than static sequence-level scores. This study provides the first systematic evaluation of
uncertainty-based introspection in VLAs, opening future avenues for active learning and for real-time
error mitigation through selective human intervention.
Problem
Inference in VLAs
Our work builds upon π0-FAST, a VLA model that takes as input a natural language instruction,
RGB images of the environment, and the robot’s state. The model processes these inputs in a forward pass and auto-regressively generates
a sequence of tokens. These tokens are then decoded into continuous joint actions, which the robot executes in the environment.
Importantly, each token is drawn from its own probability distribution over the action vocabulary. This exposes token-level uncertainty signals,
such as entropy, negative log-likelihood, or Dirichlet-based measures. Our central research question is therefore:
Can token-level uncertainty reliably indicate when the robot should request human help?
Granularity of signals
Episode vs Step vs Token
With the inference loop in mind, we next clarify the three levels of granularity—episode, step, and token—used throughout our analysis.
INSIGHT taps uncertainty at the token level from the VLA,
then reasons over temporal windows to trigger help.
Token is the infinitesimal unit produced during inference in VLAs.
A VLA generates n tokens, each drawn from its own probability distribution over the vocabulary.
Action involves the n tokens, which are not fixed in number per
inference for π0-FAST, decoded in to a chunk of robot actions.
Step consists of one cycle of collecting an observation, performing inference,
decoding the resulting token sequence into an action chunk, and executing it.
Episode refers to one complete rollout (success or failure), which consists of multiple inferences by the VLA.
Formally, an episode includes all K steps:
E = (a1:H1, a1:H2, …, a1:HK)
Key idea: model the temporal evolution of token-level uncertainty with
a compact transformer → more reliable help triggers.
Modeling
Training Compact Transformers for Introspection
Our approach INSIGHT operates on token-level signals over time. We imlpement this by training a lightweight transformer that predicts when to request help.
A lightweight transformer over uncertainty sequences.
Inputs: token-level features per step: entropy, −log p, Dirichlet AU/EU.
Backbone: 1–2 layers, small hidden size; positional encodings for time.
Objective: predict help/no-help for the window’s end step.
Why a small transformer? Token features are informative—the introspector transformer can stay
compact for fast, online triggering.
Supervision
Strong vs Weak Labels
How do we train this model?The effectiveness of introspection depends on supervision of the compact transformer. We compare dense step-level (strong) and scalable episode-level (weak) labels.
Strong (step labels)
Dense, per-step “help/no-help” supervision. Captures the onset and decay of uncertainty around
challenging events.
Each step in an episode is labeled as needing help or not
Pros: precise timing, best early detection.
Cons: costly to annotate densely.
Weak (episode labels / MIL)
Only episode-level success/fail. We use multiple-instance learning to supervise the transformer.
Each episode is labeled as success or failure
Pros: scalable, easy to collect.
Cons: noisier; results in delayed help triggers.
Datasets
Evaluation Settings
We evaluate across three settings to probe generalization and robustness:
In-Distribution
Lift and Pick-Place variants in the Kitchen (real-world) environment.
Objects and backgrounds match those seen during training.
Distribution Shift
Evaluation in the Kitchen with distribution shifts: new distractor objects,
changes in shape, color, location, and orientation. Tests robustness to visual variation
while staying in the same environment.
Simulated OOD
Cross-domain evaluation between real Kitchen and simulated LIBERO-10.
Objects and tasks differ completely, and the underlying π0-FAST checkpoints differ as well
(Kitchen-trained vs. LIBERO-trained). This setting represents true out-of-distribution generalization.
Rollouts
If you would like to explore examples from our datasets, this interactive section allows you to
browse rollout videos filtered by dataset (In-Distribution, Distribution Shift, or Simulation-OOD)
and by outcome (Success or Failure). These examples illustrate how fine-tuned π0-FAST
behaves across different settings and why uncertainty-aware help detection is necessary.
No rollouts match the selected filters.
Results
Results for the transformer (INSIGHT) and Conformal Prediction based on entropy (CP-E) and
perplexity (CP-P). Boxplots show mean (dashed) and median (solid) across folds; error bars denote ±1 SD.
Paired Wilcoxon significance: * p<0.05, ** p<0.01. Higher is better unless noted.
In-Distribution Performance
Setting.
Calibration, and testing all come from the same distribution,
so Conformal Prediction's coverage guarantees formally hold.
Do uncertainty metrics provide predictive power for requesting help?
Yes. Across all experiments, token-level uncertainty signals (entropy, log-probability, aleatoric and epistemic uncertainty)
offered predictive power beyond random guessing (accuracy/F1 > 0.5).
Signal exists. Token-level uncertainty (entropy, log-p, AU, EU) carries predictive
power beyond random (accuracy/F1 > 0.5).
Baselines vs. INSIGHT. CP—though sequence-level—sometimes reaches ~0.7 under
weak-label evaluation when calibration and evaluation align, but temporal transformers
remain stronger overall.
Distribution Shift Performance
Setting. Object positions/orientations and unseen objects differ from training;
Conformal Prediction’s exchangeability assumption breaks, so results are diagnostic rather than guaranteed.
How transferable are models under distribution shift?
All models degraded under distribution shift. Strongly supervised models degraded least.
Label/regime mismatch. The weak-training / strong-testing case can
drop below F1=0.5 due to noisy supervision evaluated with strict labels under shift.
Large In-Distribution Performance
Setting. In-Distribution and Distribution Shift dataets are combined to increase training diversity while
maintaining exchangeability, letting us assess whether scaling within
distribution helps.
Does adding more data help?
Adding shifted data to increase training size did not consistently improve robustness. In fact, strongly supervised
transformers sometimes saw small drops in performance. Weakly supervised models benefited more when evaluated with
weak labels, but overall, label quality proved more important than sheer dataset size.
Takeaway. Modeling the temporal evolution of token-level uncertainty with
transformers is more predictive than static sequence-level scores, supporting sequential
models for help detection.
Simulation OOD
Setting. Train on real-world π0-FAST, test on a different π0-FAST
fine-tuned on LIBERO → shift in tasks and policy behavior; a highly OOD stress test.
How do models transfer across simulation OOD?
Despite training on real-world kitchen tuned π0-FAST data, INSIGHT transferred surprisingly well to
simulation-based LIBERO tuned π0-FAST data tests. Strongly supervised models, especially the jumbo variant,
reached accuracy and F1 approaching that of the sim-only benchmark. This shows that token-level uncertainty features are
robust across environments and policy checkpoints, suggesting that strong-label introspection modules can generalize without
retraining or re-annotation.
Robust transfer. Strongly supervised INSIGHT generalizes to simulation,
with jumbo models nearing sim-only performance.
Help Timing & Frequency
What we measure. Time-to-First-Help (TTFH), trigger counts on success vs. failure,
and per-step trigger rates characterize when and how often interventions occur.
When and how often do models trigger help?
The strongly supervised model triggered help earliest and most frequently, maximizing failure coverage but risking
over-intervention. Weakly supervised models were conservative, rarely interrupting successes but sometimes missing
failures.
Method
TTFH (fail) ↓
Triggerssucc ↓
Triggersfail (≥ 1 ok)
Trigger Rate (success) ↓
Trigger Rate (fail) ↑
CP-W (Entropy)
6.891 ± 2.257
0.457 ± 0.302
1.721 ± 0.739
0.031 ± 0.020
0.118 ± 0.050
Strong Superv.
5.597 ± 0.809
0.710 ± 0.440
7.062 ± 1.225
0.047 ± 0.029
0.472 ± 0.081
Weak Superv.
7.929 ± 1.867
0.122 ± 0.172
1.566 ± 1.025
0.008 ± 0.011
0.105 ± 0.069
Early vs. conservative. Strong supervision fires earlier on failing episodes (lower TTFH)
and more often during fails (higher fail-rate), at the cost of more triggers on successes;
CP-W and Weak are more conservative (fewer success triggers) but react later/less on fails.
Takeaways
What INSIGHT Shows
Token-level introspection works. Modeling temporal uncertainty with a compact transformer yields earlier and more reliable help triggers than static sequence-level scores, demonstrating the need to capture sequence dynamics rather than treating uncertainty as a single snapshot.
Supervision quality matters more than scale. Strong step-level labels consistently provide the most precise timing and highest F1 scores. Weak labels remain competitive when training and testing align, but dataset size was less important than label fidelity for robust introspection.
Weak supervision remains viable for scalability. Episode-level labels are easy and cheap to collect and can support deployment scenarios where dense step annotations are infeasible, though they inevitably trade off some fidelity compared to strong supervision.
Help triggering is a design choice. Strongly supervised models intervene earlier and more frequently, maximizing safety coverage but risking over-intervention. Weakly supervised and CP-based models are more conservative, favoring unobtrusiveness at the cost of missed failures.
Generalization is challenging but tractable. All models degrade under distribution shift, but strongly supervised transformers degrade least and maintain more balanced precision and recall. Moreover, models trained on real-world data transferred surprisingly well to simulation OOD, suggesting stability of token-level uncertainty features across environments and checkpoints.
Conformal prediction provides a baseline, not a substitute. CP can align with weak-label evaluation when calibration and testing match, but its reliance on aggregated sequence scores limits robustness compared to temporal modeling.
Future Directions
Active learning is the next frontier. INSIGHT’s uncertainty signals create natural hooks for prioritizing annotation, adaptively setting thresholds, and guiding human-in-the-loop recovery policies. Extending introspection into active data collection loops offers a path toward scalable, real-time learning.
Toward adaptive deployment. Future work should explore how help-triggering policies can adapt dynamically to different environments, user preferences, and task criticalities—balancing safety, efficiency, and user experience.
BibTeX
If you found this useful, please cite INSIGHT using the entry below.
@misc{karli2025insight,
title={INSIGHT: INference-time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models},
author={Ulas Berk Karli and Ziyao Shangguan and Tesca FItzgerald},
year={2025},
eprint={2510.01389},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2510.01389},
}