Title: EMA-KPO: SIMPLIFYING KALMAN POLICY PDF: a0385c4c-c4a4-4966-be09-34a0d1502771.pdf Score: 4.8 Verdict: Reject Confidence: 0.72 Elapsed: 313.2s Strengths: 1. The core observation is correct and non-trivial: with fixed noise parameters Q=1e-6, V=1, the Kalman gain recursion (Eqs. 4-7) is observation-independent, making KPO's 'adaptive' filter actually a deterministic schedule. This is a legitimate analytical insight that strips away unnecessary complexity from KPO (Section 3.2, Equations 4-7, 8). 2. The mathematical equivalence is rigorously verified: MSE of 3.89e-15 between Kalman filter and scheduled EMA on 198 rollout sequences (Section 3.3), with max absolute error of 1.34e-7. This is at machine epsilon level, confirming the equivalence claim is not approximate but exact in practice (Section 3.3, paragraph 2). 3. The ablation study (Table 2) is informative: it shows that the smoothing strength is critical — α=0.0001 (10× stronger than K∞) causes complete training collapse, α=0.01 (10× weaker) degrades by 1-3pp, and the scheduled αt=Kt achieves best overall results. This demonstrates the gain schedule matters and is not arbitrary (Table 2, Section 4.4). Weaknesses: 1. Trivial core contribution: the observation that a Kalman filter with fixed parameters yields a deterministic gain schedule is a basic property of Kalman filtering, well-known in signal processing textbooks. The paper repackages this textbook fact as a 'discovery' and proposes replacing Kalman recursion with a precomputed lookup table — which is essentially the same algorithm with a trivial implementation optimization. The conceptual contribution is near-zero (Section 3.2, Equations 4-5). 2. Extremely narrow experimental evaluation: single base model (Qwen3-4B-Base), single training dataset (DAPO-Math-17k), single domain (mathematical reasoning), 3 benchmarks where 2 have only 30 problems each. No significance tests reported. The AIME'24 scores are identical (12.29%) by construction (deterministic equivalence), AIME'25 shows a 1.67pp gap (9.79 vs 11.46) dismissed as 'sampling noise' without any statistical test, and MATH-500 shows a 1.45pp improvement that is also within noise without confidence intervals (Section 4.2, Table 1). 3. EMA-KPO actually underperforms KPO-clipped on AIME'25 (9.79% vs 11.46%, a -1.67pp gap) and underperforms GRPO on AIME'24 (12.29% vs 14.37%). The paper's 'equivalent performance' claim is selectively stated — on 2 of 3 benchmarks, EMA-KPO is not the best method. GRPO, which suffers entropy collapse, still achieves higher AIME'24 accuracy than both KPO variants, raising questions about the practical value of the smoothing approach (Table 1). 4. No significance tests anywhere: the paper reports point estimates from avg@16 sampling but provides no confidence intervals, standard errors, or statistical tests. With only 30 problems in AIME benchmarks, the sampling variance is enormous. The claim of 'equivalent performance' and 'within sampling noise' is asserted without evidence (Table 1, Section 4.2). 5. Entropy recovery is worse for EMA-KPO than KPO-clipped: KPO-clipped recovers to 59% of initial entropy (0.89) while EMA-KPO only recovers to 47% (0.67). This 12pp gap in entropy recovery contradicts the 'equivalent' claim — the scheduled EMA is not perfectly preserving KPO's dynamics (Section 4.3, Figure 2 description). Must Fix Items: 1. Add statistical significance tests (bootstrap confidence intervals or paired tests) for all benchmark comparisons; the current 'within sampling noise' assertion for AIME'25 is unsubstantiated. 2. Explain the entropy recovery gap (59% vs 47%): if EMA-KPO is mathematically equivalent to KPO's filter, why does entropy recovery differ by 12 percentage points? This suggests the implementations are not actually equivalent in practice, undermining the core claim. 3. Evaluate on at least one additional model (different size or family) and one additional domain beyond mathematical reasoning to establish generalizability. Runs: - run=1 score=4.8 verdict=Reject confidence=0.72 error=None