Title: SINKCAST: AN EMPIRICAL STUDY OF INFERENCE-TIME CORRECTION BF16
PDF: sinkcast-bos-fp32-rope.pdf
Score: 3.5
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 50.7s

Strengths:
1. Honest reporting of negative results: The paper transparently reports that SinkCast fails to achieve its goals (sink key accounts for only 5–8% of shift-error, gap closure max 36% vs 80% target, downstream improvement −0.91 points). This is refreshing in a field where negative results are often suppressed. Evidence: Abstract, Section 4.2–4.4, Table 1–3.
2. Principled correction formula with closed-form derivation: The SinkCast correction (Equations 4–9) is mathematically sound—removing BF16 sink contribution, rescaling, and adding FP32-corrected contribution using log-sum-exp arithmetic. The multi-key extension (Equation 9) is cleanly generalized. Evidence: Section 3.3, Equations 4–9.
3. Clear refutation of the localization hypothesis with quantitative evidence: Table 1 shows Dlogit(0) is the smallest among all measured indices (4.48 vs 22.57 for j=1 in Llama; 3.84 vs 10.14 for j=1 in Mistral), and j0-fraction is only 5.0% and 8.5%. This directly refutes the premise that sink-focused correction could work. Evidence: Table 1, Figure 2.

Weaknesses:
1. Extremely limited contribution given negative results: The paper proposes a method (SinkCast), then demonstrates it does not work. While negative results can be valuable, the actual insight—'BF16 shift-error is distributed, not localized at sinks'—is a relatively narrow empirical finding that could be established in a few pages without needing a full method proposal. The method itself is a dead end. Evidence: entire paper, particularly Section 5 Discussion which offers no path forward for the proposed approach.
2. Over-packaging of a simple empirical observation: The paper frames a failed hypothesis test as a full method contribution. The SinkCast algorithm (Sections 3.2–3.3) is presented with a figure and formal equations, but since the method demonstrably does not work, this elaborate presentation inflates the perceived contribution. The core finding (error is distributed, not localized) could be shown with Table 1 alone. Evidence: Figure 1, Equations 4–9, the mismatch between method complexity and negative outcome.
3. Insufficient evaluation scope and no statistical significance: Evaluation covers only two 7–8B models, one shift offset (Δ=4096), and one sequence length for microbenchmarks (2048). No error bars, confidence intervals, or significance tests are reported for any downstream metric in Table 3. The improvements/drops are all sub-2-point magnitudes—within noise for most benchmarks—making the −0.91 'overall improvement' unreliable as evidence. Evidence: Table 3 (all values <2 points, no error bars), Section 5 ('evaluation is limited to 7–8B parameter models').
4. Missing critical baselines and comparisons: The paper does not compare SinkCast against the obvious alternative of full FP32 RoPE (only gap closure is reported, not actual FP32 performance). No comparison with AnchorAttention (Wang et al., 2024) or any other correction method. The paper also does not evaluate on the PIC systems (CacheBlend, EPIC) that supposedly motivate this work—so the practical relevance of even fixing shift-error is untested. Evidence: Section 4.1 (no PIC system evaluation), Section 2 (AnchorAttention cited but not compared).

Must Fix Items:
1. Add error bars or statistical significance tests to Table 3; current sub-2-point differences may be noise.
2. Evaluate on at least one PIC system (CacheBlend/EPIC) to demonstrate practical relevance of the shift-error problem being studied.
3. Report actual FP32 baseline performance alongside SinkCast for direct comparison, not just gap closure percentages.

Runs:
- run=1 score=3.5 verdict=Strong Reject confidence=0.6 error=None