Title: TOEPLITZ BLOCK MIXING FOR SCALABLE MULTI-HEAD LINEAR ATTENTION FARS Analemma
PDF: toeplitz-block-mixing-mhla.pdf
Score: 3.2
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 74.1s

Strengths:
1. The discovery that Dense MHLA learned mixing matrices are approximately Toeplitz (R² > 0.995 across all layers) is a genuine empirical insight with clear structural implications. This is supported by Figure 2 showing R² values per layer, with later layers reaching R² > 0.999. This finding directly motivates the proposed method and is not a trivial observation.
2. The proposed TBM formulation (Eq. 3-6) is clean and well-motivated: the mixture-of-exponentials kernel K(δ) = Σᵣ aᵣ exp(−λᵣδ) directly exploits the Toeplitz structure, enables efficient O(MRd²) recurrent computation via Eq. 5, and naturally supports length extrapolation since the kernel is defined for any δ. The recurrence in Eq. 5 is a meaningful computational contribution.
3. The complexity reduction from O(M²d²) to O(MRd²) is theoretically sound and verified empirically in Figure 3, where log-log slopes match predictions (≈1.96 for Dense MHLA, ≈0.97 for TBM). The length extrapolation capability (Table 2, TBM @ 65536 achieving 1.28%) is a concrete advantage over the fixed M×M matrix of Dense MHLA.

Weaknesses:
1. The absolute MQAR accuracy is catastrophically low across all methods (~1.25% for TBM, 0.17% for Dense MHLA). The paper's headline claims '7.3× higher accuracy' are misleading when the absolute performance is near-random for a task with 64 key-value pairs. A 7.3× improvement on near-zero accuracy is not practically meaningful. The authors acknowledge this in the Limitations section but still frame the results as a success throughout the abstract and conclusion.
2. The experimental evaluation is extremely narrow: only a single synthetic task (MQAR), a single model size (97M), a single training length (8192), and only two baselines (Dense MHLA and Frozen MHLA). No comparison with standard linear attention baselines (RetNet, GLA), SSMs (Mamba/Mamba2), or hybrid architectures. No language modeling perplexity results. No real-world benchmarks. This makes it impossible to assess whether TBM is useful in practice.
3. The scaling analysis in Section 4.4 contains a contradictory data point: at M=16384, Dense MHLA requires 96.87ms while TBM requires 699.17ms — TBM is 7.2× slower at this scale, not faster. The paper claims 'TBM's linear scaling means it remains tractable at arbitrarily large M' but the constant factor is so high that the claimed crossover at M=4096 means TBM is slower for all practical sequence lengths below ~262K tokens. The efficiency claims in the abstract ('1.24× throughput improvement') are misleading since they only hold at M=128.

Must Fix Items:
1. Add evaluation on at least one real-world benchmark (e.g., language modeling perplexity on WikiText-103 or Pile, downstream tasks from lm-eval-harness) to demonstrate practical relevance beyond synthetic MQAR.
2. Clarify and correct the misleading efficiency framing: acknowledge that TBM is slower than Dense MHLA for most practical M values due to constant factor overhead, and reframe the '1.24× throughput' claim with proper caveats about the M range where it holds.
3. Add statistically meaningful baselines: include standard linear attention (without MHLA), RetNet, GLA, and/or Mamba to contextualize whether the ~1.25% MQAR accuracy represents any meaningful retrieval capability at all.

Runs:
- run=1 score=3.2 verdict=Strong Reject confidence=0.6 error=None