Title: ACTION-SUPPORT LIKELIHOOD AUDITS PREDICT
PDF: cr-coverage-audit.pdf
Score: 3.5
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 42.6s

Strengths:
1. The paper identifies a practically relevant problem—predicting when world model rollouts fail upon real-environment transfer (W2R problem)—and proposes a simple, training-free diagnostic (Enhanced Support-NLL) that achieves AUROC=0.831, substantially outperforming the world model's own observation likelihood baseline (0.587) as shown in Table 1.
2. The ablation study (Table 2) is well-structured, demonstrating that each of the three sub-scores (Verb NLL AUROC=0.629, Transition NLL AUROC=0.747, Repetition Rate AUROC=0.731) captures a distinct failure signal and that the combination (0.831) strictly dominates any individual component (+0.084 over the best single sub-score).
3. The method is computationally efficient—requiring only frequency counting from training data and O(T) lookups per rollout, with no neural network inference (Section 3.4)—making it practical as a cheap pre-filter before expensive real-environment replay.

Weaknesses:
1. Extremely limited experimental scope: only a single environment (TextWorld), a single world model (Qwen2.5-7B fine-tuned), and a single acting agent (GPT-4o-mini). The generalizability of the approach to other text-based or non-text environments, other world models, or other agents is completely untested. The 177-episode evaluation set is also small, and the 41 W2R failures vs 136 successes creates a class-imbalanced setting where AUROC can be misleadingly high (Section 4.1).
2. The three sub-score weights (w_verb=0.2, w_trans=0.3, w_rep=0.5) in Equation 6 appear to be hand-tuned with no systematic justification or sensitivity analysis. The paper does not report how performance varies with different weight combinations, raising concerns about overfitting these weights to the specific evaluation set. This undermines the claim of a generally applicable diagnostic.
3. The novelty is incremental: the core idea—measuring action support via NLL against a training distribution—is a direct application of well-established OOD detection concepts from offline RL (CQL, BEAR, etc., cited in Section 2). The three sub-scores (unigram NLL, bigram NLL, repetition rate) are standard NLP features. The paper packages these familiar components under the 'Enhanced Support-NLL' branding without introducing a new conceptual insight beyond 'rare actions predict failures,' which is intuitive and largely expected.

Must Fix Items:
1. Conduct weight sensitivity analysis for the three sub-scores (Equation 6) and report AUROC across a range of weight combinations to demonstrate robustness or acknowledge the current weights are tuned on the test set.
2. Evaluate on at least one additional environment, world model, or agent to provide any evidence of generalizability beyond the single TextWorld/Qwen2.5-7B/GPT-4o-mini setup.
3. Report precision, recall, and F1 at operationally meaningful thresholds (not just AUROC) to assess practical utility, especially given the class imbalance (41 failures vs 136 successes).

Runs:
- run=1 score=3.5 verdict=Strong Reject confidence=0.6 error=None