Title: RAZORSFT: ON-POLICY SUPERVISED FINE-TUNING WITH KL-MINIMAL TARGET SELECTION FOR CON-TINUAL LEARNING FARS PDF: razorsft-continual-onpolicy-sft.pdf Score: 3.8 Verdict: Strong Reject Confidence: 0.60 Elapsed: 54.8s Strengths: 1. Clean ablation design isolating on-policy data vs KL-minimal selection contributions (Table 2): OnPolicy-SFT alone accounts for 76% of the FM improvement over Offline SFT (FM: -0.179 vs -0.642), providing concrete evidence that on-policy data is the primary driver, not reward optimization. This is a meaningful empirical contribution to the understanding of why RL mitigates forgetting. 2. Strong forgetting mitigation with a simple method: RazorSFT achieves FM of -0.039 compared to Offline SFT's -0.643 (Table 1), a 60.4pp improvement that is practically significant. The method is genuinely simpler than GRPO—no reward model, no policy gradient estimation, just sample-filter-select within standard SFT. 3. RazorSFT outperforms GRPO on average accuracy (0.616 vs 0.515) and especially on Countdown task adaptation (0.628 vs 0.261, Table 1), demonstrating that simplicity does not come at the cost of task learning capability. The 2.4× improvement on Countdown is particularly notable. Weaknesses: 1. Extremely limited experimental scope—only a 3-stage curriculum with 3 tasks on a single 7B model (Qwen2.5-7B-Instruct). No evaluation on larger models, different model families, longer task sequences (4+ stages), or more diverse task types. The generalizability of the findings is unknown. Table 1 results may not hold for 70B+ models where generation costs and distributional properties differ significantly. 2. The 'verifier' requirement is a significant practical limitation that is under-discussed. The method requires a task-specific verifier Vt(x,y) ∈ {0,1} for each task (Section 3.2, Equation 1). IFEval uses rule-based verification, MMLU uses exact-match, Countdown uses programmatic checking—all conveniently verifiable. For open-ended generation, summarization, creative writing, or dialogue tasks, constructing reliable verifiers is itself a major unsolved problem. The paper does not acknowledge this scope limitation. 3. Missing critical baselines and unfair comparison framing: (a) No comparison with established continual learning methods like EWC, LwF, or replay-based approaches mentioned in Related Work. (b) No comparison with DPO or other simpler RL alternatives. (c) GRPO's poor Countdown performance (0.261) is suspicious—below the zero-shot baseline (0.408 from Table 1 Best-of-8 context)—suggesting possible implementation issues or unfair hyperparameter tuning. If GRPO is badly tuned, the comparison is meaningless. The paper does not provide GRPO hyperparameters or tuning details. 4. Statistical significance concerns: The reported standard deviations in Table 1 show GRPO has very high variance on Countdown (0.261 ± 0.092), while RazorSFT has very low variance (0.628 ± 0.010). With only 3 seeds and such disparate variances, it is unclear whether the GRPO vs RazorSFT comparison is statistically significant. No significance tests are reported (potential HF_NO_SIGNIFICANCE). 5. The paper is generated by an automated research system (stated in abstract), and the 'Analemma' author affiliation with email fars@analemma.ai raises concerns about the depth of human oversight in experimental design, analysis, and interpretation. The related work section is standard boilerplate and the experimental design lacks the nuance typical of human-driven research. Must Fix Items: 1. Add established continual learning baselines (EWC, LwF, replay) to demonstrate that RazorSFT outperforms not just naive SFT but also dedicated anti-forgetting methods. 2. Report GRPO hyperparameters and tuning procedure, and explain why GRPO fails on Countdown (0.261 < zero-shot 0.408). If GRPO is poorly configured, the comparison is invalid. 3. Conduct statistical significance tests (e.g., paired t-test or bootstrap) across the 3 seeds for all key comparisons, especially RazorSFT vs GRPO. 4. Discuss the verifier requirement as a scope limitation—acknowledge that many practical LLM fine-tuning scenarios lack reliable verifiers. Runs: - run=1 score=3.8 verdict=Strong Reject confidence=0.6 error=None