Title: COMPUTE-MATCHED EVALUATION OF TRANSFORM-AUGMENTED GRPO FOR MATHEMATICAL REASON-FARS Analemma
PDF: compute-matched-ta-grpo.pdf
Score: 3.5
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 54.1s

Strengths:
1. The paper addresses a genuinely important confound in prior work: TA-GRPO's original comparison with GRPO was unfair due to 4× more rollouts. Designing a compute-matched evaluation (Section 2.1, Equation 1) that equalizes total rollouts at ~725K is a meaningful methodological contribution that the community needs more of. Evidence: Section 2.1–2.2, Table 1.
2. The ablation design (Condition C: TA-GRPO Unpooled) cleanly disentangles data augmentation from pooled advantage normalization, revealing that 87% of the gain comes from augmentation and only 13% from pooling. This is a valuable empirical insight for practitioners. Evidence: Table 2, Section 3.2.
3. The Pass@k scaling analysis (Figure 2, Section 3.3) showing that TA-GRPO's advantage grows from +1.07pp at k=1 to +2.02pp at k=32 provides interpretable evidence that semantic transforms improve solution diversity, not just greedy accuracy.

Weaknesses:
1. The reported improvement is extremely small: +2.02pp on mean Pass@32 (49.47% vs 47.45%). No statistical significance tests (e.g., bootstrap confidence intervals, paired t-tests across problems) are reported. With only 2 random seeds and benchmark sizes as small as 30 problems (AIME24/AIME25), this difference could easily arise from noise. This is a critical gap for a paper whose central claim is that the improvement is 'genuine.' Evidence: Section 2.3 (evaluation protocol), Table 1, no significance section anywhere.
2. Conditions A and B/C use different hyperparameters (Section 2.2: A uses lr=1e-6, β=0.01; B/C use lr=5e-6, β=0.001). This introduces a second confound alongside compute-matching. The higher learning rate and lower KL penalty in B/C could independently improve performance, making it impossible to attribute the +2.02pp gain solely to semantic transformations. The paper does not discuss or ablate this. Evidence: Section 2.2.
3. The paper is extremely thin on content for a conference submission: 5 pages of main text with no theoretical analysis, no exploration of transformation quality/diversity, no analysis of which transformation types matter most, no scaling study across model sizes, and only one base model (Qwen3-1.7B-Base). The contribution is essentially one compute-matched experiment and one ablation, which is incremental. Evidence: entire paper structure, Section 2.2 (single model), no transformation-type ablation.
4. The per-benchmark results in Table 1 are inconsistent and concerning: TA-GRPO Pooled actually drops on AMC12 (82.5% vs 86.3% for GRPO-Long) and ties on AIME24 (18.3%). The 'consistent across most benchmarks' claim in Section 3.1 is misleading—on 2 of 5 benchmarks, TA-GRPO Pooled does not outperform GRPO-Long. Evidence: Table 1, Section 3.1.

Must Fix Items:
1. Add statistical significance tests (e.g., bootstrap CIs on per-problem Pass@k, or McNemar-style paired tests) to support the claim that +2.02pp is a genuine improvement rather than noise, especially given small benchmark sizes and only 2 seeds.
2. Either use identical hyperparameters across all conditions, or explicitly ablate the hyperparameter difference to show it does not account for the observed gain. The current design confounds hyperparameter tuning with the semantic transformation mechanism.
3. Report per-seed results with variance, not just averages over 2 seeds, so readers can assess stability.

Runs:
- run=1 score=3.5 verdict=Strong Reject confidence=0.6 error=None