Title: SEARCH-ANCHORED HYBRID ROLLOUTS FOR TEXT-BASED WORLD MODELS FARS Analemma
PDF: search-anchored-rollouts.pdf
Score: 4.0
Verdict: Reject
Confidence: 0.60
Elapsed: 90.2s

Strengths:
1. Clear root-cause identification: The paper provides a precise and compelling diagnosis that 100% of first divergences occur at search-result observations (Figure 2, Section 4.3), with a 99.8% per-step divergence rate for search results. This is a concrete, quantifiable finding rather than a vague claim about 'world model hallucination.'
2. Elegant minimal-intervention design: Search-anchored hybrid rollouts ground only search observations while leaving other observations simulated, embodying a principled targeted-fix approach. The method is simple to implement (intercept search actions, substitute with cached BM25 results) and does not require retraining or improving the world model itself (Section 3.3).
3. Strong ablation evidence for compounding errors: The comparison between first-search anchored (CR=0.598) and all-search anchored (CR=0.819) cleanly demonstrates that compounding search errors—not just the initial search—drive drift (Table 1, Section 4.2). The stratified analysis by search count (Table 2, Figure 3) further confirms this, with CR improvements scaling from +0.111 (1 search) to +0.247 (≥3 searches).

Weaknesses:
1. Extremely narrow evaluation scope: The entire paper evaluates on a single benchmark (WebShop) with a single world model (Qwen2.5-7B fine-tuned) and a single acting agent (Gemini-2.5-Flash). WebShop has a deterministic BM25 search engine, making search caching trivial. The authors themselves acknowledge this limitation (Section 5), but it severely undermines generalizability claims. Whether this finding transfers to WebArena, Mind2Web, or any environment with non-deterministic/non-cacheable retrieval is entirely unknown.
2. The core 'insight' is somewhat expected and the method is nearly trivial: It is unsurprising that an LLM cannot accurately simulate a search engine over 1.18M products—this is essentially an open-book retrieval problem that LLMs are known to fail at. The solution (replace search results with real search results) is almost definitional: 'the world model is bad at X, so we use the real X instead.' While the compounding-error analysis adds value, the method itself has very low algorithmic novelty.
3. Unfair/self-referential baseline comparison and metric interpretation issues: The 'Pure WM' baseline CR=0.594 is measured with a basic prompt, while the best result CR=0.824 combines search anchoring AND ReAct prompting. The standalone contribution of all-search anchored (basic) is CR=0.819 vs. Pure WM's 0.594—but the Real success rate also differs (22.8% vs. 22.8% for basic, but 30.3% for ReAct). Since CR = W2R/Real, an increase in Real (from ReAct) can actually make CR harder to improve, making the ReAct+anchoring combination look better than it is. Additionally, CR values >1.0 appear in Table 2 (all-search anchored, 1 search: CR=1.034), which is metricologically problematic and suggests CR may not be a well-calibrated measure.
4. Missing statistical rigor: With only 200 test episodes and 3 seeds, no p-values or confidence interval overlap analysis is reported. Some stratified cells have very small sample sizes (e.g., n=21 for Pure WM with 1 search, n=37 for all-search with 1 search), making those CR estimates unreliable. The standard deviations in Table 1 (e.g., Pure WM: 0.099, All-search basic: 0.130) are large enough that some comparisons may not be statistically significant.

Must Fix Items:
1. Add statistical significance tests (e.g., bootstrap confidence intervals on CR, or permutation tests) for all claimed improvements, especially the key comparisons (Pure WM vs. All-search anchored, First-search vs. All-search).
2. Address the CR>1.0 anomaly in Table 2 and discuss what this means for the metric's validity—can CR exceed 1.0, and if so, how should it be interpreted?
3. Evaluate on at least one additional environment (WebArena or Mind2Web) or provide a thorough discussion of why the finding may or may not transfer, beyond the one-sentence acknowledgment in the conclusion.

Runs:
- run=1 score=4 verdict=Reject confidence=0.6 error=None