Title: VIEW-DISAGREEMENT ESCALATION ROBUST WEB-AGENT TRAJECTORY JUDGES
PDF: view-disagreement-escalation-trajectory-judges.pdf
Score: 4.8
Verdict: Reject
Confidence: 0.60
Elapsed: 46.3s

Strengths:
1. Clear mechanistic hypothesis with empirical validation: The 2.15× disagreement enrichment on attacked failures (16.4%) vs unmodified failures (7.6%) directly supports the claim that CoT manipulation asymmetrically affects the two views (Figure 2, Section 4.3). This is a well-grounded mechanistic argument, not just a post-hoc rationalization.
2. Practical robustness improvement with minimal accuracy cost: The method achieves a 63% relative reduction in attack sensitivity (∆-FPR: 4.31% vs 11.71%) while maintaining F1 within 0.8 points of the best baseline (72.13% vs 72.93%) and achieving the highest recall (77.63%) (Table 1, Section 4.2). This is a favorable robustness-accuracy trade-off.
3. Well-designed ablation study: The Random Escalation baseline (same escalation rate but random selection) shows ∆-FPR of 11.34% vs 4.31%, confirming that the disagreement signal—not merely escalation—provides robustness. The Matched-Cost Control (two View2 calls at different temperatures) shows that CoT-agnostic approaches sacrifice F1 (68.97%) (Table 2, Section 4.4). These ablations cleanly isolate the contribution of the counterfactual view contrast.

Weaknesses:
1. Single attack type evaluated: Only the Progress Fabrication attack from Khalifa et al. (2026) is tested (Section 4.1). The paper does not evaluate robustness against other plausible CoT manipulation strategies (e.g., action-justification attacks, partial truth mixing, or attacks that also modify observations). This limits the generalizability of the claimed robustness. The method's effectiveness may be specific to attacks that fabricate progress in CoT while leaving actions unchanged.
2. Limited evaluation scope and no statistical significance: Results are reported on a single benchmark (AgentRewardBench) with a single judge model (Llama-3.3-70B-Instruct). No confidence intervals, error bars, or significance tests are provided for any metric (Tables 1 and 2). With 811 failure trajectories, the absolute differences in some metrics (e.g., F1: 72.13 vs 72.93) could be within noise. The lack of multi-model or multi-benchmark validation raises reproducibility concerns.
3. 2.1× inference cost with unclear cost-benefit for real deployment: The method requires running the judge twice on every trajectory (View1 + View2) plus a strict evaluation on ~10.4% of cases, totaling ~2.1× calls per trajectory (Section 5). The paper acknowledges this as a limitation but does not analyze whether the marginal robustness gain justifies the cost in practical settings (e.g., rejection sampling at scale). No wall-clock time or token cost comparisons are provided.

Must Fix Items:
1. Add statistical significance tests (e.g., bootstrap confidence intervals on F1, ∆-FPR, recall) or at minimum report results across multiple random seeds to establish that reported improvements are not due to chance.
2. Evaluate on at least one additional attack type (beyond Progress Fabrication) to demonstrate that the view-disagreement signal generalizes to other CoT manipulation strategies.
3. Report per-environment breakdowns on AgentRewardBench (WebArena, VisualWebArena, etc.) to assess whether robustness gains are consistent across environments or driven by a subset.

Runs:
- run=1 score=4.8 verdict=Reject confidence=0.6 error=None