{
  "pdf": "saap-moment-matching.pdf",
  "title": "TRAINING-FREE LINEAR ROUTING FOR SPARSE AT-TENTION VIA ATTENTION-MASS PREDICTION FARS Analemma",
  "elapsed": 52.8,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.5,
  "scores": [
    3.5
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.5,
    "presentation": 3,
    "contribution": 1.8,
    "overall_rating": 3.5,
    "confidence": 3
  },
  "strengths": [
    "The paper provides a clear and well-structured negative result: gauge-coupled whitening (GWR Geometric) achieves 50.7% recall vs symmetric k-means 51.3%, demonstrating that distribution alignment is not the right objective for routing (Section 3.3, Table 1). This negative result is valuable for the community as it rules out a seemingly promising direction.",
    "The OLS-based GWR Linear method is elegantly simple—a single closed-form matrix solve (Equation 5) with no iterative optimization—and achieves 72.6% recall@32, closing 63.6% of the gap to learned routing (Table 1). The simplicity and training-free nature is a genuine practical advantage for deployment.",
    "The budget sweep (Figure 2) shows monotonic improvement in gap closure from 44.3% to 80.3% as budget increases, providing evidence that the method scales well and that linear structure dominates at lower sparsity levels. This is a meaningful characterization of the method's behavior across operating points.",
    "Multi-head generalization across 6 heads spanning layers 2-26 (Table 2) shows consistent improvement over symmetric k-means on every tested head (mean gap closure 69.8%), with particularly strong results on L7-H0 (92.1% gap closure) where symmetric routing nearly fails at 6.7% recall."
  ],
  "weaknesses": [
    "Extremely limited experimental scope: only one model (Qwen2.5-7B), one dataset (PG19), 6 prompts, and 6 attention heads out of 784 total heads (28 layers × 28 heads). The paper reports results on only head-0 for 5 of 6 tested heads, creating a severe selection bias concern. Whether these results generalize to other models, datasets, or even the majority of attention heads is unknown. (Section 4.1, Table 2)",
    "The 'training-free' framing is misleading. GWR Linear requires calibration data (Q and Y matrices) to compute the OLS solution W* (Equation 5). This is functionally identical to training—it learns parameters from data. The distinction that it uses closed-form OLS rather than iterative SGD is an implementation detail, not a fundamental difference. The method is 'training-iteration-free' but not truly training-free, as it requires labeled calibration data with ground-truth attention masses. (Section 3.4)",
    "No end-to-end evaluation on downstream tasks. The paper exclusively evaluates recall@ℓ as a proxy metric but never measures whether the improved routing translates to maintained perplexity, generation quality, or task performance. High recall does not guarantee that the selected buckets contain the most critical attention mass for model output quality. (Section 4, throughout)",
    "The paper is essentially an ablation study of a negative result (GWR Geometric fails) plus a straightforward application of OLS regression to predict attention mass. The core contribution—using OLS to predict a target variable from features—is textbook linear regression, not a novel methodological insight. The '63.6% gap closure' metric inflates the perceived contribution by measuring relative improvement within a specific gap rather than absolute performance. (Section 3.4, Table 1)",
    "Missing critical baselines: no comparison to other training-free routing methods beyond symmetric k-means (e.g., attention-sink-based routing, random + local patterns, or other heuristic approaches). The Saap-MLP comparison uses their own re-implementation details that may not be optimized. No statistical significance tests reported despite relatively small sample sizes (6 prompts). (Table 1, Section 4.1)"
  ],
  "must_fix_items": [
    "Evaluate on at least one additional model and one additional dataset to demonstrate generalization beyond a single experimental setting.",
    "Add downstream task evaluation (perplexity or generation benchmarks) to validate that recall improvement translates to actual model quality preservation.",
    "Report results on a representative sample of all 784 attention heads, not just 6 cherry-picked heads, to address selection bias.",
    "Be transparent about the calibration data requirement—the method is not truly 'training-free' in the conventional sense; reframe as 'iteration-free' or 'closed-form calibration.'"
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.5,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "The paper provides a clear and well-structured negative result: gauge-coupled whitening (GWR Geometric) achieves 50.7% recall vs symmetric k-means 51.3%, demonstrating that distribution alignment is not the right objective for routing (Section 3.3, Table 1). This negative result is valuable for the community as it rules out a seemingly promising direction.",
        "The OLS-based GWR Linear method is elegantly simple—a single closed-form matrix solve (Equation 5) with no iterative optimization—and achieves 72.6% recall@32, closing 63.6% of the gap to learned routing (Table 1). The simplicity and training-free nature is a genuine practical advantage for deployment.",
        "The budget sweep (Figure 2) shows monotonic improvement in gap closure from 44.3% to 80.3% as budget increases, providing evidence that the method scales well and that linear structure dominates at lower sparsity levels. This is a meaningful characterization of the method's behavior across operating points.",
        "Multi-head generalization across 6 heads spanning layers 2-26 (Table 2) shows consistent improvement over symmetric k-means on every tested head (mean gap closure 69.8%), with particularly strong results on L7-H0 (92.1% gap closure) where symmetric routing nearly fails at 6.7% recall."
      ],
      "weaknesses": [
        "Extremely limited experimental scope: only one model (Qwen2.5-7B), one dataset (PG19), 6 prompts, and 6 attention heads out of 784 total heads (28 layers × 28 heads). The paper reports results on only head-0 for 5 of 6 tested heads, creating a severe selection bias concern. Whether these results generalize to other models, datasets, or even the majority of attention heads is unknown. (Section 4.1, Table 2)",
        "The 'training-free' framing is misleading. GWR Linear requires calibration data (Q and Y matrices) to compute the OLS solution W* (Equation 5). This is functionally identical to training—it learns parameters from data. The distinction that it uses closed-form OLS rather than iterative SGD is an implementation detail, not a fundamental difference. The method is 'training-iteration-free' but not truly training-free, as it requires labeled calibration data with ground-truth attention masses. (Section 3.4)",
        "No end-to-end evaluation on downstream tasks. The paper exclusively evaluates recall@ℓ as a proxy metric but never measures whether the improved routing translates to maintained perplexity, generation quality, or task performance. High recall does not guarantee that the selected buckets contain the most critical attention mass for model output quality. (Section 4, throughout)",
        "The paper is essentially an ablation study of a negative result (GWR Geometric fails) plus a straightforward application of OLS regression to predict attention mass. The core contribution—using OLS to predict a target variable from features—is textbook linear regression, not a novel methodological insight. The '63.6% gap closure' metric inflates the perceived contribution by measuring relative improvement within a specific gap rather than absolute performance. (Section 3.4, Table 1)",
        "Missing critical baselines: no comparison to other training-free routing methods beyond symmetric k-means (e.g., attention-sink-based routing, random + local patterns, or other heuristic approaches). The Saap-MLP comparison uses their own re-implementation details that may not be optimized. No statistical significance tests reported despite relatively small sample sizes (6 prompts). (Table 1, Section 4.1)"
      ],
      "must_fix_items": [
        "Evaluate on at least one additional model and one additional dataset to demonstrate generalization beyond a single experimental setting.",
        "Add downstream task evaluation (perplexity or generation benchmarks) to validate that recall improvement translates to actual model quality preservation.",
        "Report results on a representative sample of all 784 attention heads, not just 6 cherry-picked heads, to address selection bias.",
        "Be transparent about the calibration data requirement—the method is not truly 'training-free' in the conventional sense; reframe as 'iteration-free' or 'closed-form calibration.'"
      ],
      "conference_scores": {
        "soundness": 2.5,
        "presentation": 3,
        "contribution": 1.8,
        "overall_rating": 3.5,
        "confidence": 3
      }
    }
  ]
}