{
  "pdf": "fcboost-dominant-fc-kv-quantization.pdf",
  "title": "FCBOOST: STATIC FREQUENCY-AWARE CHANNEL SELECTION FOR 2-BIT KV CACHE QUANTIZATION FARS",
  "elapsed": 52.6,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.8,
  "scores": [
    3.8
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.3,
    "presentation": 2.8,
    "contribution": 2,
    "overall_rating": 3.8,
    "confidence": 3
  },
  "strengths": [
    "Clear and simple method: FCBoost replaces Kitty's dynamic per-page channel selection with a static mask derived from CA scores, reducing selection complexity from O(N/P) to O(1) (Section 3.4). This is a genuine engineering simplification with practical value for long-sequence inference.",
    "Strong empirical results on AIME24/25: FCBoost achieves 71.11% average accuracy vs Kitty's 66.67% (+4.44pp) and KIVI-KV2*'s 66.11% (+5.00pp), with notably lower variance (std=1.57 vs 7-9) (Table 1). The low variance is a meaningful signal of robustness.",
    "Ablation study validates CA signal: Random masks average 64.44% vs CA mask's 71.11% (+6.67pp), and random masks even underperform the no-boost baseline (64.44% vs 66.11%), confirming that CA identifies genuinely important channels rather than any static pattern sufficing (Table 2)."
  ],
  "weaknesses": [
    "Extremely narrow experimental evaluation: Only one model (Qwen3-8B) and one task family (AIME mathematical reasoning, 30 problems per benchmark) are tested. The paper itself acknowledges this limitation (Section 5), but no results on other architectures (e.g., LLaMA, Mistral), other model sizes, or other task types (e.g., long-context retrieval, summarization, code generation) are provided. This severely limits claims of generalizability.",
    "Statistical significance is questionable with only 3 seeds on 30-problem benchmarks: AIME24/AIME25 each contain only 30 problems. With 3 random seeds, the effective sample size is very small. The reported differences (e.g., 4.44pp between FCBoost and Kitty) may not be statistically significant. No formal significance tests (e.g., paired t-test, bootstrap confidence intervals) are reported (HF_NO_SIGNIFICANCE concern). The low std for FCBoost (1.57) is suspicious given only 3 seeds—could be a sampling artifact.",
    "Over-packaging of a relatively incremental contribution: The core idea—using CA scores from FASA (Wang et al., 2026) to select channels for higher precision instead of magnitude—is a straightforward application of an existing metric to an existing problem. The paper does not provide theoretical justification for why CA-identified channels should be quantization-sensitive; the connection is hypothesized and empirically validated on one model but not deeply analyzed. The method itself is essentially: compute CA offline → select top-F pairs → apply static mask. This is a 3-step recipe with limited technical novelty beyond the observation itself.",
    "Unfair/incomplete baseline comparison: Only two baselines (KIVI-KV2* and Kitty) are compared. Other relevant mixed-precision methods like MixKVQ, KVmix, MiniKV, GEAR, RotateKV, and QuaRot are discussed in related work but not included in experiments. The paper does not explain why these are omitted or whether they are applicable at 2-bit precision. Additionally, the 'KIVI-KV2*' baseline appears to be the authors' own implementation rather than the original KIVI, raising fairness concerns.",
    "CA vs magnitude analysis is shallow: Section 4.4 reports Jaccard overlap (0.299) and Spearman correlation (0.670) but does not investigate which channels are uniquely selected by CA and why they matter more for quantization. No per-layer or per-head analysis is provided. The claim that 'CA captures structural importance...which identifies a qualitatively different (and more effective) subset' is asserted but not mechanistically explained."
  ],
  "must_fix_items": [
    "Add formal statistical significance tests (e.g., bootstrap CI or paired permutation test) for the main results in Table 1 to confirm that the 4.44pp improvement over Kitty is not due to random variation with only 3 seeds on 30-problem benchmarks.",
    "Evaluate on at least one additional model architecture and one additional task domain (e.g., long-context retrieval like Needle-in-a-Haystack or RULER) to support the generality claim that 'quantization sensitivity is structurally determined by RoPE frequencies.'",
    "Include at least one more recent mixed-precision KV cache quantization baseline (e.g., MixKVQ or RotateKV) in the experimental comparison."
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.8,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "Clear and simple method: FCBoost replaces Kitty's dynamic per-page channel selection with a static mask derived from CA scores, reducing selection complexity from O(N/P) to O(1) (Section 3.4). This is a genuine engineering simplification with practical value for long-sequence inference.",
        "Strong empirical results on AIME24/25: FCBoost achieves 71.11% average accuracy vs Kitty's 66.67% (+4.44pp) and KIVI-KV2*'s 66.11% (+5.00pp), with notably lower variance (std=1.57 vs 7-9) (Table 1). The low variance is a meaningful signal of robustness.",
        "Ablation study validates CA signal: Random masks average 64.44% vs CA mask's 71.11% (+6.67pp), and random masks even underperform the no-boost baseline (64.44% vs 66.11%), confirming that CA identifies genuinely important channels rather than any static pattern sufficing (Table 2)."
      ],
      "weaknesses": [
        "Extremely narrow experimental evaluation: Only one model (Qwen3-8B) and one task family (AIME mathematical reasoning, 30 problems per benchmark) are tested. The paper itself acknowledges this limitation (Section 5), but no results on other architectures (e.g., LLaMA, Mistral), other model sizes, or other task types (e.g., long-context retrieval, summarization, code generation) are provided. This severely limits claims of generalizability.",
        "Statistical significance is questionable with only 3 seeds on 30-problem benchmarks: AIME24/AIME25 each contain only 30 problems. With 3 random seeds, the effective sample size is very small. The reported differences (e.g., 4.44pp between FCBoost and Kitty) may not be statistically significant. No formal significance tests (e.g., paired t-test, bootstrap confidence intervals) are reported (HF_NO_SIGNIFICANCE concern). The low std for FCBoost (1.57) is suspicious given only 3 seeds—could be a sampling artifact.",
        "Over-packaging of a relatively incremental contribution: The core idea—using CA scores from FASA (Wang et al., 2026) to select channels for higher precision instead of magnitude—is a straightforward application of an existing metric to an existing problem. The paper does not provide theoretical justification for why CA-identified channels should be quantization-sensitive; the connection is hypothesized and empirically validated on one model but not deeply analyzed. The method itself is essentially: compute CA offline → select top-F pairs → apply static mask. This is a 3-step recipe with limited technical novelty beyond the observation itself.",
        "Unfair/incomplete baseline comparison: Only two baselines (KIVI-KV2* and Kitty) are compared. Other relevant mixed-precision methods like MixKVQ, KVmix, MiniKV, GEAR, RotateKV, and QuaRot are discussed in related work but not included in experiments. The paper does not explain why these are omitted or whether they are applicable at 2-bit precision. Additionally, the 'KIVI-KV2*' baseline appears to be the authors' own implementation rather than the original KIVI, raising fairness concerns.",
        "CA vs magnitude analysis is shallow: Section 4.4 reports Jaccard overlap (0.299) and Spearman correlation (0.670) but does not investigate which channels are uniquely selected by CA and why they matter more for quantization. No per-layer or per-head analysis is provided. The claim that 'CA captures structural importance...which identifies a qualitatively different (and more effective) subset' is asserted but not mechanistically explained."
      ],
      "must_fix_items": [
        "Add formal statistical significance tests (e.g., bootstrap CI or paired permutation test) for the main results in Table 1 to confirm that the 4.44pp improvement over Kitty is not due to random variation with only 3 seeds on 30-problem benchmarks.",
        "Evaluate on at least one additional model architecture and one additional task domain (e.g., long-context retrieval like Needle-in-a-Haystack or RULER) to support the generality claim that 'quantization sensitivity is structurally determined by RoPE frequencies.'",
        "Include at least one more recent mixed-precision KV cache quantization baseline (e.g., MixKVQ or RotateKV) in the experimental comparison."
      ],
      "conference_scores": {
        "soundness": 2.3,
        "presentation": 2.8,
        "contribution": 2,
        "overall_rating": 3.8,
        "confidence": 3
      }
    }
  ]
}