Title: EXECUTION-SIGNATURE RECYCLING: DEDUPLICAT-ING UNIT-TEST FAILURE FEEDBACK FOR TEST-TIME CODE SCALING FARS
PDF: f0dbc9b7-cda1-4c32-85ae-97abdb2123d4.pdf
Score: 4.2
Verdict: Reject
Confidence: 0.72
Elapsed: 541.6s

Strengths:
1. Honest negative-result reporting with pre-registered decision rule and paired bootstrap CI (Section 3.2): the 95% CI [−0.81, 2.24] includes zero and the authors correctly conclude no significant improvement over Self-Debug, avoiding the common temptation to spin a non-significant trend as a positive finding.
2. Transparent mechanism activation analysis (Table 2): reporting that ~68% of tasks yield empty failure banks (all 8 Round-1 candidates pass) is crucial context that honestly reveals the proposed mechanism is inactive for the majority of the benchmark, which strengthens the negative-result interpretation.
3. Statistically principled evaluation protocol (Section 3.1): paired bootstrap over 492 task-seed pairs (164 tasks × 3 seeds) is a more appropriate significance test than unpaired comparisons, and the pre-registered decision rule prevents p-hacking across the three method comparisons.

Weaknesses:
1. Trivial core contribution after packaging stripping: Execution-Signature Recycling reduces to grouping candidates by the tuple (failing test IDs, error types) — essentially a GROUP BY operation on a 2-column table — then appending the top-3 groups to the prompt. No learning, no adaptive clustering, no intelligent summarization; the 'deduplicated failure bank' is just a sorted frequency list (Section 2.2–2.3, Eq. 1).
2. Negative-result claim is under-evidenced: the paper concludes 'cross-sample feedback aggregation does not provide reliable benefits,' but the mechanism only fires on ~53 tasks (Table 2, ~32% of 164). With n=53 × 3 seeds = 159 effective observations, the study is underpowered to detect even moderate effect sizes; a non-significant result in an underpowered experiment is weak evidence for the null, not evidence against the mechanism (Section 3.3).
3. Single benchmark, single model, no ablation on key hyperparameters: only HumanEval+ with Qwen2.5-Coder-7B-Instruct is tested; no evaluation on MBPP, LiveCodeBench, or weaker/stronger models despite the conclusion explicitly suggesting results may differ for weaker models (Section 5). Hyperparameters M=3, K=2, token budget=900, 8+8 split are fixed with no sensitivity analysis (Section 2.3).
4. Higher variance undermines practical utility: ESR's inter-seed std (1.27) is nearly 2× that of Best-of-16 (0.50) and 1.8× that of Self-Debug (0.70), with per-seed results ranging from +2.44pp gain to −0.61pp loss vs Self-Debug (Table 1). A method that adds complexity but increases variance is practically worse than a simpler, more stable baseline.
5. Self-Debug baseline may be weakly configured: the Self-Debug implementation produces only 1 debug revision per candidate with feedback from up to 3 failing tests (Section 3.1), whereas the original Chen et al. (2023) allows multiple debug rounds. A stronger Self-Debug with iterative refinement could widen the gap further, making ESR's marginal mean advantage even more fragile.

Must Fix Items:
1. Report per-task analysis restricted to the ~53 tasks where ESR's failure bank is non-empty; the aggregate comparison is diluted by the 68% of tasks where ESR degenerates to Best-of-8+Best-of-8, making the negative claim misleading without this stratified view.
2. Add at least one additional benchmark (MBPP or LiveCodeBench) or one additional model (a weaker model where the mechanism is more likely to fire on >50% of tasks) to test whether the negative result generalizes beyond a single high-accuracy regime.
3. Conduct ablation on M, K, and the 8+8 budget split to show whether ESR's design choices are reasonable or whether the negative result is an artifact of poor hyperparameter selection.

Runs:
- run=1 score=4.2 verdict=Reject confidence=0.72 error=None