Title: EXPONENTIAL INTEGRATOR FOR DIAGONAL-DECAY DELTA ATTENTION: A NEGATIVE RESULT ON LENGTH EXTRAPOLATION FARS
PDF: 59af4b6d-78d3-4f27-90bf-2d9c319a0093.pdf
Score: 4.5
Verdict: Reject
Confidence: 0.7
Elapsed: 496.6s

Strengths:
1. Honest negative-result reporting: The paper clearly states that the exponential integrator does not improve length extrapolation accuracy, with pre-registered success criterion (≥5 pp improvement on ≥2/3 tasks) that is not met (0/3 tasks show improvement). This level of transparency is commendable and rare. Evidence: Section 3.2, 'The primary success criterion of ≥5 pp improvement on ≥2/3 tasks is not met (0/3 tasks show improvement).'
2. Clean ablation design with three conditions (C1/C2/C3) that properly isolate the integrator effect (C2 vs C1) from the L2 normalization removal effect (C3 vs C2). This allows unambiguous attribution of observed differences. Evidence: Section 3.1 experimental setup defining C1=C2 isolates integrator, C3 vs C2 isolates normalization removal.
3. Numerical stability claim is well-evidenced: 0 NaN/divergence across all 27 runs (3 conditions × 3 tasks × 3 seeds), directly demonstrating that the bounded coefficient property enables stable training without L2 normalization. Evidence: Table 1 (all cells populated with valid numbers), Section 3.2 'All 27 experimental runs completed without NaN or divergence failures.'

Weaknesses:
1. Core derivation is a direct application of an existing result, not a novel contribution. Equation (3) is exactly the rank-1 matrix exponential from Lei et al. (2025) EFLA — substituting λ_t = ‖k_t‖² into the known formula exp(−βkk^T) = I − (1−e^{−βλ})/(λ) kk^T yields the proposed coefficient trivially. Equation (5) is simply (1−exp(−β‖k‖²))/‖k‖², which follows immediately. The paper applies this known result to KDA's delta substep, but this is an engineering substitution, not a methodological advance. Evidence: Section 2.2 acknowledges 'the matrix exponential exp(−β_t k_t k_t^T) admits a closed-form solution due to the rank-1 structure (Lei et al., 2025; Chen et al., 2018).'
2. Evaluation is limited to synthetic tasks only with no real-world or downstream validation. The three tasks (Palindrome, MQAR, Stack) are minimal synthetic probes with a 2-layer, 2-head model (d=128). MQAR shows a ceiling effect (~100% for all conditions), making it useless for discriminating methods. Palindrome accuracy is near-random at extrapolation lengths (1-2%), suggesting the task may be fundamentally beyond the model's capacity rather than informative about the method. Evidence: Table 1 (MQAR all ~99.95-100%; Palindrome L=4096 all ≤1.81%), Section 3.1 (2-layer, 2-head, dk=dv=128).
3. No statistical significance testing despite small sample size (n=3 seeds). The reported differences are small and within standard deviation overlap: e.g., Stack L=4096 C1=87.99±1.71 vs C3=85.01±6.83 — the huge variance in C3 (driven by seed 42 at 75.43%) means no conclusion about accuracy differences is warranted. The paper draws strong conclusions ('neither provides accuracy benefits') without statistical tests. Evidence: Table 1 (all std values), Section 3.2 ablation analysis, Section 3.2 'increased variance' noting C3 std=6.83 vs C1 std=1.71 on Stack L=4096.
4. The hypothesis that 'key-norm information could serve as a signal-strength channel for improved length extrapolation' is tested but never theoretically motivated. Why would key norms carry length-extrapolation-relevant information? The paper provides no mechanistic argument for this hypothesis, making the negative result unsurprising rather than illuminating. Evidence: Section 1 'We hypothesize that L2 normalization may discard useful key-norm information that could serve as a signal-strength channel' — no theoretical or empirical justification provided for why this should be true.

Must Fix Items:
1. Add statistical significance tests (e.g., paired t-test or bootstrap confidence intervals) for all accuracy comparisons; with n=3 the current differences are not distinguishable from noise, especially given the Stack L=4096 variance explosion.
2. Replace or supplement the MQAR task with a task that does not exhibit a ceiling effect; currently MQAR provides zero discriminative power (all conditions ~100%).
3. Provide theoretical justification for the 'signal-strength channel' hypothesis; without it, the negative result is uninformative — one cannot learn from the failure of an unmotivated hypothesis.

Runs:
- run=1 score=4.5 verdict=Reject confidence=0.7 error=None