{
  "pdf": "8eb2aa2d-3df0-4923-bf42-a4677819026f.pdf",
  "title": "DATA-FREE TRANSITION-SPECTRUM WINSORIZA-TION FOR MAMBA LONG-CONTEXT GENERALIZATION",
  "elapsed": 250.6,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 4.2,
  "scores": [
    4.2
  ],
  "score_std": 0.0,
  "final_verdict": "Reject",
  "final_confidence": 0.78,
  "conference_scores": null,
  "strengths": [
    "The paper honestly reports a critical negative finding in Section 4.4: winsorization does NOT substantially reduce the fraction of near-1 effective eigenvalues (21.63% after winsorization vs 21.24% for base model), which directly contradicts the motivating hypothesis stated in Sections 1 and 3.2 that 'eigenvalues near 1 cause state explosion.' This transparency about mechanism failure is commendable and scientifically valuable, even though it undermines the paper's own premise.",
    "The data-free property is a genuine practical advantage: the method requires only a single pass through model weights to extract A matrices and compute per-layer percentiles (Section 3.4), with no calibration data or gradient computation needed. This makes the approach applicable in settings where representative calibration data is unavailable, which is a real use case.",
    "The percentile-based ablation study (Table 2, Figure 2) is informative and shows a clear sensitivity landscape: conservative q (<4%) provides insufficient compression (PPL@64K > 200), the sweet spot is q=5-8%, and beyond q=8% diminishing returns on long-context PPL accelerate short-context regression. The comparison between adaptive per-layer percentile bounds vs fixed global thresholds [0.1, 0.99] (PPL@64K of 11.44 vs 38.0) demonstrates that per-layer adaptation matters.",
    "The method modifies only 17.5% of channels vs 100% for constant scaling (Table 1), and achieves lower short-context regression (+6.8% vs +20.8%). This demonstrates a genuine Pareto improvement on the short-context vs long-context tradeoff compared to the constant-scaling baseline."
  ],
  "weaknesses": [
    "The core mechanism analysis in Section 4.4 fatally undermines the paper's stated motivation. The paper is framed around the hypothesis that 'eigenvalues near 1 cause state explosion' (Section 1, Section 3.2), yet the mechanism analysis shows winsorization does NOT reduce near-1 effective eigenvalue mass (21.63% post-winsorization vs 21.24% base). The paper then speculates the improvement 'may therefore arise from a different mechanism, such as improved gradient flow or regularization effects' — but these alternative mechanisms are entirely untested and unverified. The paper's title and framing promise a spectrum-based intervention, but the intervention's effect does not operate through the spectrum mechanism it claims. This is a fundamental coherence problem: the method is proposed for reason X, but analysis shows reason X is false, and no alternative mechanism is validated.",
    "Evaluation is extremely narrow: single benchmark (PG-19), single model (Mamba2-1.3B), single task (language modeling perplexity). No downstream task evaluation (e.g., long-context QA, retrieval, summarization), no other SSM architecture (original Mamba, Mamba-130M/370M/2.8B, other SSM variants), no other long-context benchmark (e.g., LongBench, L-Eval, ZeroSCROLLS). The 13% improvement over constant scaling on a single benchmark with a single model provides very limited evidence of generalizability.",
    "No statistical significance tests or variance estimates reported. All results appear to be single-run numbers. The PPL differences are small in absolute terms (9.94 vs 11.25 at 2K, 11.44 vs 13.19 at 64K) and could be within run-to-run variance, especially given bfloat16 precision is used. No standard deviations, no confidence intervals, no multiple seeds. HF_NO_SIGNIFICANCE is warranted.",
    "The comparison against the only truly competitive method — calibrated scaling (Lu et al., 2025) — shows an enormous gap: PPL@64K of 11.44 vs 4.72, a 2.4× worse perplexity. While the paper acknowledges this gap as 'expected,' it also notes calibrated scaling achieves -52.9% short-context regression (i.e., it IMPROVES short-context performance while vastly improving long-context). The 'data-free' constraint is presented as the differentiator, but the practical utility of a method that leaves PPL@64K at 11.44 (still 22.7% regression from base 9.31) is questionable — the model is still severely degraded at long context, just less catastrophically so.",
    "The '13% improvement' claim over constant scaling is misleading in context. Going from PPL@64K=13.19 to 11.44 is a 13.3% relative improvement, but both values represent severely degraded models (base PPL@2K=9.31, so PPL@64K of 11.44 still means 22.7% worse than training-length performance). The absolute gap of 1.75 PPL points is marginal given both methods produce models that are fundamentally broken at 64K context. The framing of this as a substantial improvement overstates the practical significance."
  ],
  "must_fix_items": [
    "Provide evidence for the actual mechanism by which winsorization improves perplexity. The current Section 4.4 shows the proposed mechanism (tail reduction of near-1 eigenvalues) is falsified. Without validating an alternative mechanism, the method is a black-box heuristic with a post-hoc justification gap. At minimum: test the 'gradient flow' or 'regularization' hypotheses; analyze whether winsorization's effect operates through the A-matrix modification at all by testing random A perturbations of similar magnitude.",
    "Add at least one additional evaluation benchmark beyond PG-19, and at least one additional model scale. Without this, generalizability claims are unsupported.",
    "Report variance across multiple random seeds and add significance tests for all reported comparisons. Current results could be noise."
  ],
  "runs": [
    {
      "run": 1,
      "score": 4.2,
      "verdict": "Reject",
      "confidence": 0.78,
      "strengths": [
        "The paper honestly reports a critical negative finding in Section 4.4: winsorization does NOT substantially reduce the fraction of near-1 effective eigenvalues (21.63% after winsorization vs 21.24% for base model), which directly contradicts the motivating hypothesis stated in Sections 1 and 3.2 that 'eigenvalues near 1 cause state explosion.' This transparency about mechanism failure is commendable and scientifically valuable, even though it undermines the paper's own premise.",
        "The data-free property is a genuine practical advantage: the method requires only a single pass through model weights to extract A matrices and compute per-layer percentiles (Section 3.4), with no calibration data or gradient computation needed. This makes the approach applicable in settings where representative calibration data is unavailable, which is a real use case.",
        "The percentile-based ablation study (Table 2, Figure 2) is informative and shows a clear sensitivity landscape: conservative q (<4%) provides insufficient compression (PPL@64K > 200), the sweet spot is q=5-8%, and beyond q=8% diminishing returns on long-context PPL accelerate short-context regression. The comparison between adaptive per-layer percentile bounds vs fixed global thresholds [0.1, 0.99] (PPL@64K of 11.44 vs 38.0) demonstrates that per-layer adaptation matters.",
        "The method modifies only 17.5% of channels vs 100% for constant scaling (Table 1), and achieves lower short-context regression (+6.8% vs +20.8%). This demonstrates a genuine Pareto improvement on the short-context vs long-context tradeoff compared to the constant-scaling baseline."
      ],
      "weaknesses": [
        "The core mechanism analysis in Section 4.4 fatally undermines the paper's stated motivation. The paper is framed around the hypothesis that 'eigenvalues near 1 cause state explosion' (Section 1, Section 3.2), yet the mechanism analysis shows winsorization does NOT reduce near-1 effective eigenvalue mass (21.63% post-winsorization vs 21.24% base). The paper then speculates the improvement 'may therefore arise from a different mechanism, such as improved gradient flow or regularization effects' — but these alternative mechanisms are entirely untested and unverified. The paper's title and framing promise a spectrum-based intervention, but the intervention's effect does not operate through the spectrum mechanism it claims. This is a fundamental coherence problem: the method is proposed for reason X, but analysis shows reason X is false, and no alternative mechanism is validated.",
        "Evaluation is extremely narrow: single benchmark (PG-19), single model (Mamba2-1.3B), single task (language modeling perplexity). No downstream task evaluation (e.g., long-context QA, retrieval, summarization), no other SSM architecture (original Mamba, Mamba-130M/370M/2.8B, other SSM variants), no other long-context benchmark (e.g., LongBench, L-Eval, ZeroSCROLLS). The 13% improvement over constant scaling on a single benchmark with a single model provides very limited evidence of generalizability.",
        "No statistical significance tests or variance estimates reported. All results appear to be single-run numbers. The PPL differences are small in absolute terms (9.94 vs 11.25 at 2K, 11.44 vs 13.19 at 64K) and could be within run-to-run variance, especially given bfloat16 precision is used. No standard deviations, no confidence intervals, no multiple seeds. HF_NO_SIGNIFICANCE is warranted.",
        "The comparison against the only truly competitive method — calibrated scaling (Lu et al., 2025) — shows an enormous gap: PPL@64K of 11.44 vs 4.72, a 2.4× worse perplexity. While the paper acknowledges this gap as 'expected,' it also notes calibrated scaling achieves -52.9% short-context regression (i.e., it IMPROVES short-context performance while vastly improving long-context). The 'data-free' constraint is presented as the differentiator, but the practical utility of a method that leaves PPL@64K at 11.44 (still 22.7% regression from base 9.31) is questionable — the model is still severely degraded at long context, just less catastrophically so.",
        "The '13% improvement' claim over constant scaling is misleading in context. Going from PPL@64K=13.19 to 11.44 is a 13.3% relative improvement, but both values represent severely degraded models (base PPL@2K=9.31, so PPL@64K of 11.44 still means 22.7% worse than training-length performance). The absolute gap of 1.75 PPL points is marginal given both methods produce models that are fundamentally broken at 64K context. The framing of this as a substantial improvement overstates the practical significance."
      ],
      "must_fix_items": [
        "Provide evidence for the actual mechanism by which winsorization improves perplexity. The current Section 4.4 shows the proposed mechanism (tail reduction of near-1 eigenvalues) is falsified. Without validating an alternative mechanism, the method is a black-box heuristic with a post-hoc justification gap. At minimum: test the 'gradient flow' or 'regularization' hypotheses; analyze whether winsorization's effect operates through the A-matrix modification at all by testing random A perturbations of similar magnitude.",
        "Add at least one additional evaluation benchmark beyond PG-19, and at least one additional model scale. Without this, generalizability claims are unsupported.",
        "Report variance across multiple random seeds and add significance tests for all reported comparisons. Current results could be noise."
      ],
      "conference_scores": null
    }
  ]
}