Title: GAUGEFIX-LRM: FUNCTION-PRESERVING GAUGE FIXING FOR LEARNABLE MULTIPLIERS IN LANGUAGE MODEL TRAINING
PDF: gauge-fixed-learnable-multipliers.pdf
Score: 3.8
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 51.6s

Strengths:
1. The paper identifies a genuine and well-defined problem: Q/K multiplicative gauge symmetry in LRM causes scale drift under low-precision arithmetic, and weight decay on Q/K multipliers conflates two distinct functions (symmetry control vs magnitude regularization). This dual-purpose insight is the most valuable contribution, as it clarifies why naively replacing weight decay with gauge-fixing is insufficient (Section 4.4, Discussion).
2. The proposed GaugeFix projection (Eq. 4-6) is mathematically clean, function-preserving by construction (rQ/g · rK·g = rQ·rK), and computationally lightweight at O(h·dk). The theoretical guarantee that attention output is unchanged is correct and clearly stated (Section 3.3).
3. Honest and transparent reporting of negative results: the paper openly discloses that per-step GaugeFix causes late-training instability in 2/3 seeds (Table 1, Seed 123 diverges from 3.1173 to 3.1904), and that the proposed method does not straightforwardly outperform the baseline. This level of candor is commendable and scientifically valuable.

Weaknesses:
1. The main experimental result is weak and unreliable: the best reported improvement (0.0307 nats over baseline, Table 2) comes from a single seed (42) at a specific frequency (every 100 steps). This is not a robust finding. The frequency analysis (Table 2) uses only seed 42, which was hand-selected as the most stable seed. With 3 seeds showing high variance (std 0.064 for GaugeFix vs 0.013 for baseline, Table 1), no statistical significance test is reported, and the improvement may not hold across seeds (HF_NO_SIGNIFICANCE concern).
2. The core insight — that gauge-fixing alone is insufficient because weight decay provides magnitude regularization — is essentially a negative result. The paper's proposed solution (GaugeFix) does not work reliably per-step, and the 'fix' (every 100 steps) lacks principled justification. The paper itself states the real solution should combine gauge-fixing with magnitude regularization, which is left as future work (Section 7). This makes the current contribution incomplete and incremental.
3. Limited scale and scope of evaluation: only GPT-2 124M on OpenWebText is tested. The instability may manifest differently (or worse) in larger models. No downstream task evaluation is provided — only validation perplexity. The paper does not evaluate whether the Q/K drift actually causes measurable degradation in practice (Condition C with no control achieves mean val loss 3.0780, which is close to baseline 3.0588, suggesting drift of 0.053 may not be practically harmful at this scale).
4. The per-step GaugeFix introduces instability that the no-control ablation (Condition C) does not exhibit, suggesting the projection itself interacts badly with Adam momentum states. This is acknowledged but not analyzed rigorously — no experiments isolating the Adam state staleness mechanism, no ablation with fresh optimizer states, and no theoretical analysis of how the projection distorts gradient estimates.

Must Fix Items:
1. Run frequency analysis (Table 2) on all 3 seeds, not just seed 42, and report mean±std. A single-seed result for the paper's best configuration is insufficient for a reliable claim.
2. Add statistical significance tests (e.g., paired t-test or bootstrap) comparing GaugeFix-every-100 vs baseline across seeds. Currently no significance is reported.
3. Clarify the practical impact of Q/K drift: Condition C (no control) performs comparably to Condition A (baseline with weight decay), suggesting drift of ~0.05 may be negligible. If the drift being fixed is not actually harmful, the motivation for GaugeFix is weakened.

Runs:
- run=1 score=3.8 verdict=Strong Reject confidence=0.6 error=None