Title: TIMEOUT BOOTSTRAPPING FOR LONG-COT RLVR: PROMISE AND PITFALLS FARS Analemma
PDF: 36b17d4e-ccf0-42c5-9e36-fc9d71b86bb2.pdf
Score: 4.8
Verdict: Reject
Confidence: 0.78
Elapsed: 101.1s

Strengths:
1. Pre-registered evaluation design: The paper commits to explicit success criteria before running experiments (Section 4.1), including a bootstrap CI test and a collapse incidence criterion. This is rare in the RL-for-LLM literature and significantly increases the credibility of the negative finding. The honesty in reporting failure against pre-registered criteria — rather than cherry-picking the favorable seed-42 result — is commendable.
2. Mechanistic diagnosis of failure mode: The critic value collapse analysis (Figure 2, Table 2) goes beyond surface-level performance comparison. Demonstrating that V(truncated) → −1.0 within 15–20 steps (Figure 2c) and that the critic exhibits negative correlation with correctness (Pearson r = −0.211, AUROC = 0.314 at 90% prefix, Table 2) provides a concrete, falsifiable explanation for *why* timeout bootstrapping fails, rather than merely reporting that it fails.
3. Sound theoretical grounding: The formalization of three truncation strategies (Eqs. 2–4) and the connection to Pardo et al. (2017) time-limit theory (Eq. 5) is clean and well-motivated. The distinction between timeout and terminal states is precisely the right RL concept to import, and the stop-gradient + clip design choices in Eq. 4 show awareness of practical pitfalls.

Weaknesses:
1. Severely underpowered experimental design: Only 2 seeds per condition and 100 training steps (Section 4.1). The pre-registered collapse criteria require 200–500 consecutive steps and could not formally trigger (Section 5, Limitations). The long-subset is only 125 problems, evaluated with 4 samples each — the 95% CI for the key comparison is [−4.70, +2.20]pp (Section 4.2), a range so wide as to be essentially uninformative. No significance test is reported for the per-seed +2.20pp claim that the authors highlight as 'promise.' This is a HF_NO_SIGNIFICANCE concern.
2. The 'promise' narrative is misleading given the data: The paper repeatedly emphasizes that timeout bootstrapping 'shows promise' on the stable seed (+2.20pp, Sections 1/4.2/5/6). But this is a single-seed comparison on 125 problems with no statistical test. By the same logic, seed-137 baseline A (55.20%) outperforms seed-42 timeout bootstrap (54.60%) — so baseline A also 'shows promise' by this evidentiary standard. The packaging strips to: a method that fails its own pre-registered criteria, with one seed showing a modest, non-significant gain and the other diverging. Calling this 'promise' overstates the evidence.
3. Critical design choices are unexplained or contradictory: (1) The delayed bootstrapping (10-step warmup, Section 3.4) contradicts the theoretical argument — if truncation is truly a timeout and not a failure, assigning R = −1 for 10 steps teaches the critic exactly the wrong signal on the states where it will later need to bootstrap. (2) Truncated rollouts are excluded from critic training to avoid 'self-fulfilling targets' (Section 5), but then the critic has *zero* supervision on truncation states, making value collapse predictable rather than surprising. (3) The entropy bonus (0.001) is added without justification for this specific value, and the authors themselves note it causes divergence in one seed — this is an uncontrolled confound that makes the Condition C results uninterpretable relative to the core research question about bootstrapping vs. truncation-as-failure.

Must Fix Items:
1. Add statistical significance tests for all pairwise comparisons, especially the per-seed long-subset claims. Report exact p-values and effect sizes. Without this, the +2.20pp 'promise' claim is unsupported.
2. Run additional seeds (minimum 5 per condition) to determine whether seed-42's result is replicable or a statistical fluke. The current 2-seed design cannot distinguish signal from noise on 125 problems.
3. Resolve the confound between the entropy bonus and timeout bootstrapping: either run Condition C without the entropy bonus (to isolate the bootstrapping effect) or add an A+entropy-bonus control condition. Currently it is impossible to attribute the seed-42 gain to bootstrapping versus the entropy bonus.

Runs:
- run=1 score=4.8 verdict=Reject confidence=0.78 error=None