{
  "pdf": "38e7377f-32b5-44fa-acb8-1693dbc0f97c.pdf",
  "title": "BUDGET-DISTILLED ES-SSM: CROSS-BUDGET KNOWLEDGE DISTILLATION FOR ELASTIC SPECTRAL",
  "elapsed": 314.4,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 5.2,
  "scores": [
    5.2
  ],
  "score_std": 0.0,
  "final_verdict": "Reject",
  "final_confidence": 0.72,
  "conference_scores": null,
  "strengths": [
    "The paper identifies a real and important problem: ES-SSM accuracy collapses at low spectral budgets (58.06% at K=2 vs 77.45% at K=32 per Table 1), and proposes a principled solution via in-place distillation. The problem statement is clear and well-motivated (Section 1, Section 3.1).",
    "Compute-matched comparison design: both conditions use two forward passes per training step, ensuring the improvement from KL distillation is not confounded by additional compute. The anchored dual-CE baseline (Eq. 2) isolates the contribution of the KL term specifically (Section 4.1, Table 1).",
    "Substantial empirical improvement at low budgets: +22.61 pp at K=2 and near-flat accuracy curves (0.53 pp variation from K=2 to K=32) as shown in Table 1 and Figure 2. Even the worst BD-ES-SSM seed at K=2 (77.00%) exceeds the best baseline seed (65.41%), suggesting the improvement is robust despite variance (Section 4.3)."
  ],
  "weaknesses": [
    "Trivial core contribution after packaging stripping: the method is a direct application of in-place distillation from Universally Slimmable Networks (Yu & Huang, 2019) to spectral SSM truncation. The KL loss (Eq. 3), T² scaling, and stop-gradient are all standard from Hinton et al. (2015). The paper adds no new algorithmic insight beyond 'apply existing distillation to a different truncation modality' (Section 3.2, Eq. 3; related work Section 2.2).",
    "Evaluation is dangerously narrow: single benchmark (LRA Text, byte-level IMDB binary classification), single architecture (ES-SSM with dmodel=256, nlayers=8, K̄=32), single task type (binary classification). No multi-class, no generation, no other LRA tasks (ListOps, Image, PathFinder, etc.). The generalizability claim is unsupported (Section 4.1, Table 1).",
    "No statistical significance tests: with only 3 seeds and very high variance (baseline K=2 std=6.26, BD-ES-SSM K=8 std=4.64), no t-test, bootstrap CI, or Wilcoxon test is reported. The +22.61pp claim at K=2 is based on point estimates from n=3. The paper acknowledges 'moderate seed variance' but does not quantify significance (Table 1, Section 4.3).",
    "Baseline is algorithmically augmented beyond published ES-SSM: the 'anchored dual-CE baseline' (Eq. 2) adds a full-budget CE loss that the original ES-SSM (Song & Wang, 2026) does not use. This means the baseline is already stronger than the published ES-SSM. The paper's contribution is only the KL term *on top of* this augmented baseline, not the full improvement over the original ES-SSM. The title and abstract do not clarify this distinction (Section 3.2, Eq. 2 vs Section 3.1 description of ES-SSM training).",
    "Budget inconsistency mechanism claim is contradicted by own data: Section 4.3 states that 1 of 3 seeds shows *higher* inconsistency despite achieving the highest accuracy, which directly contradicts the claimed mechanism (that KL distillation works by reducing budget inconsistency). The paper speculates this is because 'highly confident predictions may diverge while still being correct,' but this undermines the theoretical motivation (Section 4.3, paragraph 3)."
  ],
  "must_fix_items": [
    "Report significance tests (paired t-test or bootstrap CI) for all budget levels, especially K=2 where variance is highest and the headline claim rests.",
    "Report comparison against the original published ES-SSM (Song & Wang 2026) without the dual-CE augmentation, to quantify the full improvement chain: original ES-SSM → dual-CE baseline → BD-ES-SSM. Currently the reader cannot tell how much the KL term adds vs. how much the dual-CE already adds.",
    "Evaluate on at least 2 additional benchmarks (other LRA tasks or a different dataset) to support generalizability claims."
  ],
  "runs": [
    {
      "run": 1,
      "score": 5.2,
      "verdict": "Reject",
      "confidence": 0.72,
      "strengths": [
        "The paper identifies a real and important problem: ES-SSM accuracy collapses at low spectral budgets (58.06% at K=2 vs 77.45% at K=32 per Table 1), and proposes a principled solution via in-place distillation. The problem statement is clear and well-motivated (Section 1, Section 3.1).",
        "Compute-matched comparison design: both conditions use two forward passes per training step, ensuring the improvement from KL distillation is not confounded by additional compute. The anchored dual-CE baseline (Eq. 2) isolates the contribution of the KL term specifically (Section 4.1, Table 1).",
        "Substantial empirical improvement at low budgets: +22.61 pp at K=2 and near-flat accuracy curves (0.53 pp variation from K=2 to K=32) as shown in Table 1 and Figure 2. Even the worst BD-ES-SSM seed at K=2 (77.00%) exceeds the best baseline seed (65.41%), suggesting the improvement is robust despite variance (Section 4.3)."
      ],
      "weaknesses": [
        "Trivial core contribution after packaging stripping: the method is a direct application of in-place distillation from Universally Slimmable Networks (Yu & Huang, 2019) to spectral SSM truncation. The KL loss (Eq. 3), T² scaling, and stop-gradient are all standard from Hinton et al. (2015). The paper adds no new algorithmic insight beyond 'apply existing distillation to a different truncation modality' (Section 3.2, Eq. 3; related work Section 2.2).",
        "Evaluation is dangerously narrow: single benchmark (LRA Text, byte-level IMDB binary classification), single architecture (ES-SSM with dmodel=256, nlayers=8, K̄=32), single task type (binary classification). No multi-class, no generation, no other LRA tasks (ListOps, Image, PathFinder, etc.). The generalizability claim is unsupported (Section 4.1, Table 1).",
        "No statistical significance tests: with only 3 seeds and very high variance (baseline K=2 std=6.26, BD-ES-SSM K=8 std=4.64), no t-test, bootstrap CI, or Wilcoxon test is reported. The +22.61pp claim at K=2 is based on point estimates from n=3. The paper acknowledges 'moderate seed variance' but does not quantify significance (Table 1, Section 4.3).",
        "Baseline is algorithmically augmented beyond published ES-SSM: the 'anchored dual-CE baseline' (Eq. 2) adds a full-budget CE loss that the original ES-SSM (Song & Wang, 2026) does not use. This means the baseline is already stronger than the published ES-SSM. The paper's contribution is only the KL term *on top of* this augmented baseline, not the full improvement over the original ES-SSM. The title and abstract do not clarify this distinction (Section 3.2, Eq. 2 vs Section 3.1 description of ES-SSM training).",
        "Budget inconsistency mechanism claim is contradicted by own data: Section 4.3 states that 1 of 3 seeds shows *higher* inconsistency despite achieving the highest accuracy, which directly contradicts the claimed mechanism (that KL distillation works by reducing budget inconsistency). The paper speculates this is because 'highly confident predictions may diverge while still being correct,' but this undermines the theoretical motivation (Section 4.3, paragraph 3)."
      ],
      "must_fix_items": [
        "Report significance tests (paired t-test or bootstrap CI) for all budget levels, especially K=2 where variance is highest and the headline claim rests.",
        "Report comparison against the original published ES-SSM (Song & Wang 2026) without the dual-CE augmentation, to quantify the full improvement chain: original ES-SSM → dual-CE baseline → BD-ES-SSM. Currently the reader cannot tell how much the KL term adds vs. how much the dual-CE already adds.",
        "Evaluate on at least 2 additional benchmarks (other LRA tasks or a different dataset) to support generalizability claims."
      ],
      "conference_scores": null
    }
  ]
}