Title: TINY-LR PROXY SFT FOR DATASET RANKING: AN EMPIRICAL INVESTIGATION FARS Analemma PDF: 86c42b4e-98ee-40b3-b8db-8a04295d4ff1.pdf Score: 3.8 Verdict: Reject Confidence: 0.72 Elapsed: 66.8s Strengths: 1. Clear and honest negative-result reporting: The paper directly refutes its own hypothesis (Tiny-LR improves transfer), reporting PDA=0.500 (random) vs Standard-LR PDA=0.712, rather than spinning the result. This is a commendable practice (Section 4.1, Table 1). 2. Well-structured experimental comparison with four conditions including a Training-Free NLL baseline, providing meaningful context for the Standard-LR Proxy's performance (Table 1, Figure 1). 3. Per-benchmark analysis (Table 2) reveals the MATH-500 vs GSM8K divergence, which is an informative observation: Tiny-LR works on MATH-500 (PDA=0.818, ρ=0.846, p<0.001) but fails on GSM8K (PDA=0.515). This decomposition helps future work understand *where* Tiny-LR might or might not apply. Weaknesses: 1. Trivial core contribution after packaging stripping: The paper tests exactly one hyperparameter change (5e-5→1e-5 learning rate) from pretraining literature applied to SFT. The 'hypothesis' is a direct port from Wang et al. (2025), and the negative result is unsurprising — pretraining and SFT have fundamentally different loss landscapes and data regimes. The contribution reduces to 'a pretraining trick does not transfer to SFT,' which is a minimal finding (Section 3.2). 2. Confounded comparison between Standard-LR and Tiny-LR: Standard-LR runs 500 steps while Tiny-LR runs 1000 steps (Section 3.2, Appendix B.3). Different step counts mean different total parameter updates, making it impossible to isolate the learning rate effect from the training duration effect. The Tiny-LR condition could fail simply because 1000 steps at 1e-5 produces different cumulative update magnitudes than 500 steps at 5e-5, or because longer training induces overfitting on the proxy model. This confound undermines the central claim. 3. Extremely narrow experimental scope threatening generalizability: Single model family (Qwen2.5), single domain (math), single proxy-target pair (1.5B→7B), only two LR values, and only two evaluation benchmarks. With 12 datasets producing 66 pairs, the PDA metric has high variance. The per-benchmark split in Table 2 involves only 66 pairs each, and no multiple-comparison correction is applied despite testing MATH-500 and GSM8K separately (Section 3.4, Table 2). 4. No significance tests on the key comparison (Standard-LR vs Tiny-LR): The paper reports p=0.042 for Standard-LR vs random (Table 1), but never tests whether Standard-LR is significantly better than Tiny-LR directly. With overlapping confidence intervals ([0.606, 0.818] vs [0.379, 0.621]), the difference may not be statistically significant despite the narrative. Additionally, the Tiny-LR PDA=0.500 is suspiciously exact for a bootstrap-derived metric on 66 pairs (Table 1). 5. Mechanistic explanation is purely post-hoc and correlational: The claim that 'reduced learning rates amplify sensitivity to surface-level features' (Section 4.3, Section 5.1) is supported only by the observation that Tiny-LR ranks verbose-CoT datasets higher on GSM8K. No direct evidence (e.g., feature importance analysis, probing, controlled experiment varying response length) is provided. The response-length explanation is speculation, not evidence. Must Fix Items: 1. Equalize training steps or compute between Standard-LR and Tiny-LR conditions to remove the step-count confound. At minimum, report cumulative gradient magnitude (lr × steps) for each condition and discuss its implications. 2. Report a direct statistical test comparing Standard-LR vs Tiny-LR PDA (not just each vs random), and apply multiple-comparison correction for the per-benchmark analysis in Table 2. 3. Provide controlled evidence for the surface-level feature hypothesis (e.g., ablation on response length, probing for formatting sensitivity), rather than relying solely on correlational observation. Runs: - run=1 score=3.8 verdict=Reject confidence=0.72 error=None