Title: DRAFT-AND-CONTINUE SELF-CONSISTENCY: EMPIRICAL STUDY OF TWO-STAGE BRANCH BUDGET-ING FOR LLM REASONING
PDF: ff82e1b8-cd38-4767-9608-2082f78a9989.pdf
Score: 3.5
Verdict: Reject
Confidence: 0.7
Elapsed: 410.1s

Strengths:
1. Negative result is honestly reported with transparent diagnostic analysis: the paper does not cherry-pick favorable conditions and explicitly shows DCS is Pareto-dominated by CGES-LNS (Table 1: 76.8% accuracy at 4114.5 tokens vs 76.8% at 1963.0 tokens). This honesty is valuable for the community.
2. Clear root-cause analysis of failure mode: Table 2 demonstrates that 96.3% of drafts complete within Tdraft=1024, making the continuation mechanism effectively vacuous. The fallback rate of only 3.73% directly explains why DCS degenerates into standard SC with overhead. This diagnostic depth is commendable.
3. Well-structured experimental comparison with three baselines (Greedy CoT, Uniform SC, CGES-LNS) and multiple metrics (accuracy, tokens, API calls, efficiency), providing a complete picture of the accuracy-efficiency trade-off frontier (Table 1). The multi-seed evaluation (3 seeds) is a minimum but present.

Weaknesses:
1. Trivial core mechanism after packaging stripping: DCS reduces to 'sample B drafts with limited tokens, count votes, continue top-k vote branches.' This is a straightforward two-phase resource allocation with a basic hedge heuristic. There is no theoretical justification for why vote histograms on partial solutions should be predictive of final answer quality, nor any analysis of the information-theoretic properties of interim answers vs. complete solutions (Section 2.2).
2. Single benchmark, single model, and the one hyperparameter configuration tested is trivially miscalibrated: The entire evaluation uses MATH-500 with Qwen2.5-Math-7B-Instruct only (Section 3.1). More critically, Tdraft=1024 was chosen such that 96% of drafts already complete—this means the method was never actually tested under conditions where its core mechanism (continuation of incomplete drafts) would meaningfully activate. No sweep over Tdraft values (e.g., 128, 256, 512) is provided to show whether DCS works when drafts are genuinely truncated. A negative result is only informative if the method was given a fair chance to succeed.
3. No statistical significance tests: With only 3 seeds and accuracies of 76.0%–77.8% (Table 2), the ±0.75 std is large relative to the ~1.2% gaps being discussed. No t-test, bootstrap, or McNemar test is reported to establish whether the accuracy differences are statistically meaningful. The variance difference (DCS 0.75 vs baselines 0.28–0.41) is asserted as meaningful without any formal test (Section 3.2).
4. Method was never shown to work under any configuration: A negative-result paper should demonstrate that the proposed method was tested across a reasonable range of conditions and still fails. Here, DCS was only tested at one Tdraft value where 96% of drafts complete—essentially testing the method in a degenerate regime. Without showing that DCS also fails at shorter Tdraft (where continuation would actually activate), the paper cannot conclude that 'simpler early-stopping methods dominate two-stage branch budgeting' (Abstract). The conclusion overgeneralizes from a single, unrepresentative configuration.

Must Fix Items:
1. Add experiments with shorter Tdraft values (e.g., 128, 256, 512) to test DCS in the regime where drafts are genuinely incomplete and the continuation mechanism activates. Without this, the core claim about two-stage budgeting being inferior is unsupported—it may simply be that the specific hyperparameter choice was poor.
2. Report statistical significance tests (paired bootstrap or McNemar on per-problem correctness) for accuracy comparisons in Table 1. The current ±std reporting with 3 seeds is insufficient to draw conclusions about whether DCS truly matches or differs from baselines.
3. Evaluate on at least one additional benchmark (e.g., GSM8K, AIME) and/or one additional model to assess generalizability of the negative finding.

Runs:
- run=1 score=3.5 verdict=Reject confidence=0.7 error=None