{
  "pdf": "rocket-activation-aware-knapsack.pdf",
  "title": "OUTPUT-SPACE ALLOCATION COSTS CALIBRATION-GUIDED",
  "elapsed": 47,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.2,
  "scores": [
    3.2
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.3,
    "presentation": 2.8,
    "contribution": 1.5,
    "overall_rating": 3.2,
    "confidence": 3
  },
  "strengths": [
    "Clear and focused research question: the paper identifies a genuine design inconsistency in ROCKET (output-space factorization objective vs. weight-space allocation cost) and proposes a minimal, principled fix (Section 2.2, Equation 4). The hypothesis is well-defined and testable.",
    "Honest reporting of mixed results: the paper transparently reports the accuracy-perplexity tradeoff on Qwen3-8B (Table 1: +0.8pp accuracy but +16% perplexity), rather than cherry-picking favorable metrics. The Llama-3.2-1B secondary experiment (Table 3) further shows the tradeoff is setting-dependent, adding nuance.",
    "Insightful analysis of why the effect is modest: the >0.99 Spearman correlation between weight-space and output-space errors (Section 3.3) provides a principled explanation for why only 70/252 layers change allocation, grounding the modest gains in an observable structural property rather than hand-waving."
  ],
  "weaknesses": [
    "Extremely limited experimental scope: only 2 model-size/compression-ratio settings are tested (Qwen3-8B at 50%, Llama-3.2-1B at 20%), with the secondary setting using only a single calibration seed. No comparison against other compression baselines (e.g., ASVD, SliceGPT, SVD-LLM) is provided, making it impossible to contextualize the absolute quality of either variant. The paper only compares ROCKET-default vs. ROCKET-ActCost, which is an ablation rather than a full empirical evaluation (Tables 1-3).",
    "No statistical significance testing: the +0.8pp accuracy difference on Qwen3-8B is reported as a mean across only 2 seeds, with no error bars, confidence intervals, or significance tests. Given the small sample size and modest effect, it is unclear whether this difference is statistically meaningful (Table 1). HF_NO_SIGNIFICANCE concern applies.",
    "Modest and partially negative contribution: on the primary evaluation setting, the proposed method worsens perplexity by 16% while gaining only +0.8pp accuracy on 8 zero-shot benchmarks. The high correlation (>0.99) between the two cost functions means the proposed change is nearly redundant by construction. The contribution reduces to 'swapping one nearly-identical cost function for another in an existing method's MCKP step,' which is incremental at best (Section 3.3).",
    "Missing key analysis: the paper does not investigate why output-space cost improves accuracy but worsens perplexity on Qwen3-8B, nor does it characterize when the correlation between the two costs is lower. Without understanding the mechanism behind the tradeoff, the practical guidance for practitioners is limited to 'try both and see which metric you care about,' which is not a strong contribution (Section 3.2, Conclusion)."
  ],
  "must_fix_items": [
    "Add statistical significance tests or at minimum error bars across multiple seeds for the primary Qwen3-8B result; 2 seeds is insufficient to claim +0.8pp is meaningful.",
    "Test on at least 2-3 additional model architectures and compression ratios to determine whether the findings generalize beyond the two current settings.",
    "Provide analysis explaining the accuracy-perplexity tradeoff mechanism: which layers receive different allocations, and how do those differences selectively help task accuracy while hurting language modeling?"
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.2,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "Clear and focused research question: the paper identifies a genuine design inconsistency in ROCKET (output-space factorization objective vs. weight-space allocation cost) and proposes a minimal, principled fix (Section 2.2, Equation 4). The hypothesis is well-defined and testable.",
        "Honest reporting of mixed results: the paper transparently reports the accuracy-perplexity tradeoff on Qwen3-8B (Table 1: +0.8pp accuracy but +16% perplexity), rather than cherry-picking favorable metrics. The Llama-3.2-1B secondary experiment (Table 3) further shows the tradeoff is setting-dependent, adding nuance.",
        "Insightful analysis of why the effect is modest: the >0.99 Spearman correlation between weight-space and output-space errors (Section 3.3) provides a principled explanation for why only 70/252 layers change allocation, grounding the modest gains in an observable structural property rather than hand-waving."
      ],
      "weaknesses": [
        "Extremely limited experimental scope: only 2 model-size/compression-ratio settings are tested (Qwen3-8B at 50%, Llama-3.2-1B at 20%), with the secondary setting using only a single calibration seed. No comparison against other compression baselines (e.g., ASVD, SliceGPT, SVD-LLM) is provided, making it impossible to contextualize the absolute quality of either variant. The paper only compares ROCKET-default vs. ROCKET-ActCost, which is an ablation rather than a full empirical evaluation (Tables 1-3).",
        "No statistical significance testing: the +0.8pp accuracy difference on Qwen3-8B is reported as a mean across only 2 seeds, with no error bars, confidence intervals, or significance tests. Given the small sample size and modest effect, it is unclear whether this difference is statistically meaningful (Table 1). HF_NO_SIGNIFICANCE concern applies.",
        "Modest and partially negative contribution: on the primary evaluation setting, the proposed method worsens perplexity by 16% while gaining only +0.8pp accuracy on 8 zero-shot benchmarks. The high correlation (>0.99) between the two cost functions means the proposed change is nearly redundant by construction. The contribution reduces to 'swapping one nearly-identical cost function for another in an existing method's MCKP step,' which is incremental at best (Section 3.3).",
        "Missing key analysis: the paper does not investigate why output-space cost improves accuracy but worsens perplexity on Qwen3-8B, nor does it characterize when the correlation between the two costs is lower. Without understanding the mechanism behind the tradeoff, the practical guidance for practitioners is limited to 'try both and see which metric you care about,' which is not a strong contribution (Section 3.2, Conclusion)."
      ],
      "must_fix_items": [
        "Add statistical significance tests or at minimum error bars across multiple seeds for the primary Qwen3-8B result; 2 seeds is insufficient to claim +0.8pp is meaningful.",
        "Test on at least 2-3 additional model architectures and compression ratios to determine whether the findings generalize beyond the two current settings.",
        "Provide analysis explaining the accuracy-perplexity tradeoff mechanism: which layers receive different allocations, and how do those differences selectively help task accuracy while hurting language modeling?"
      ],
      "conference_scores": {
        "soundness": 2.3,
        "presentation": 2.8,
        "contribution": 1.5,
        "overall_rating": 3.2,
        "confidence": 3
      }
    }
  ]
}