{
  "pdf": "nll-guided-swaa-layer-selection.pdf",
  "title": "NLL-GUIDED FULL-ATTENTION LAYER SELECTION FOR TRAINING-FREE SLIDING-WINDOW ADAPTATION FARS Analemma",
  "elapsed": 46.2,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 4.6,
  "scores": [
    4.6
  ],
  "score_std": 0,
  "final_verdict": "Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.8,
    "presentation": 3,
    "contribution": 2.5,
    "overall_rating": 4.6,
    "confidence": 3
  },
  "strengths": [
    "The NLL-guided scoring idea is simple and well-motivated: directly measuring per-layer degradation on answer tokens via teacher-forced NLL is a principled way to identify which layers need full attention, avoiding indirect heuristics. Equation 1 defines this cleanly, and the method requires no training or gradient computation (Section 3.2–3.4).",
    "Strong empirical gains over baselines under the same FA budget: NLL-Guided 1/4-FA (64.6%) outperforms the best periodic 1/4-FA by 10.4pp and LightTransfer by 26.4pp on LongMemEval (Table 1). The per-task breakdown (Table 2) shows consistent improvement across all 6 task types, with especially large gains on temporal-reasoning (+33.9pp) and single-session-user (+37.1pp).",
    "De-confounding analysis (Section 4.4, Figure 2) is a valuable sanity check: low Spearman ρ=0.306 between long-prompt and short-prompt layer rankings, only 3/9 overlap (Jaccard=0.2), and 85.6× magnitude difference confirm the signal is specific to long-range attention needs, not generic layer sensitivity."
  ],
  "weaknesses": [
    "Extremely narrow evaluation scope: only one model (Qwen3-4B-Thinking-2507), one benchmark (LongMemEval), and one SWA window size (2048). The authors acknowledge this in Section 5 but it severely limits generalizability claims. It is unknown whether NLL-guided selection works on other model families (LLaMA, Mistral, Gemma), other scales (7B, 13B, 70B), or other long-context benchmarks (RULER, InfiniteBench, Needle-in-a-Haystack).",
    "No statistical significance reporting: all accuracy numbers are presented without confidence intervals, standard errors, or p-values. With 500 samples in LongMemEval, a 0.4pp difference (NLL-Guided 64.6% vs 1/2-FA Periodic 65.0%) is well within plausible sampling variance. The claim that NLL-Guided 1/4-FA 'matches' the 1/2-FA baseline is not statistically supported (HF_NO_SIGNIFICANCE concern).",
    "Calibration data dependency and potential distribution shift: the calibration uses 64 examples from LongAlign-10k and fusang-v1-filtered (Appendix A.1), but there is no analysis of whether the selected layers are sensitive to calibration data distribution. If deployment tasks differ significantly from calibration data, the selected layers may be suboptimal. The 2.4pp accuracy drop when reducing calibration from 64 to 16 examples (Section 4.6) hints at calibration fragility, but no cross-distribution calibration analysis is provided."
  ],
  "must_fix_items": [
    "Add confidence intervals or standard errors to all accuracy numbers in Tables 1 and 2 to support the claim that NLL-Guided 1/4-FA matches the 1/2-FA periodic baseline and significantly outperforms other 1/4-FA methods.",
    "Evaluate on at least one additional model family (e.g., LLaMA-3) and one additional long-context benchmark to substantiate generalizability claims.",
    "Analyze calibration sensitivity to data distribution: show layer selection stability when calibration data comes from a different distribution than the evaluation benchmark."
  ],
  "runs": [
    {
      "run": 1,
      "score": 4.6,
      "verdict": "Reject",
      "confidence": 0.6,
      "strengths": [
        "The NLL-guided scoring idea is simple and well-motivated: directly measuring per-layer degradation on answer tokens via teacher-forced NLL is a principled way to identify which layers need full attention, avoiding indirect heuristics. Equation 1 defines this cleanly, and the method requires no training or gradient computation (Section 3.2–3.4).",
        "Strong empirical gains over baselines under the same FA budget: NLL-Guided 1/4-FA (64.6%) outperforms the best periodic 1/4-FA by 10.4pp and LightTransfer by 26.4pp on LongMemEval (Table 1). The per-task breakdown (Table 2) shows consistent improvement across all 6 task types, with especially large gains on temporal-reasoning (+33.9pp) and single-session-user (+37.1pp).",
        "De-confounding analysis (Section 4.4, Figure 2) is a valuable sanity check: low Spearman ρ=0.306 between long-prompt and short-prompt layer rankings, only 3/9 overlap (Jaccard=0.2), and 85.6× magnitude difference confirm the signal is specific to long-range attention needs, not generic layer sensitivity."
      ],
      "weaknesses": [
        "Extremely narrow evaluation scope: only one model (Qwen3-4B-Thinking-2507), one benchmark (LongMemEval), and one SWA window size (2048). The authors acknowledge this in Section 5 but it severely limits generalizability claims. It is unknown whether NLL-guided selection works on other model families (LLaMA, Mistral, Gemma), other scales (7B, 13B, 70B), or other long-context benchmarks (RULER, InfiniteBench, Needle-in-a-Haystack).",
        "No statistical significance reporting: all accuracy numbers are presented without confidence intervals, standard errors, or p-values. With 500 samples in LongMemEval, a 0.4pp difference (NLL-Guided 64.6% vs 1/2-FA Periodic 65.0%) is well within plausible sampling variance. The claim that NLL-Guided 1/4-FA 'matches' the 1/2-FA baseline is not statistically supported (HF_NO_SIGNIFICANCE concern).",
        "Calibration data dependency and potential distribution shift: the calibration uses 64 examples from LongAlign-10k and fusang-v1-filtered (Appendix A.1), but there is no analysis of whether the selected layers are sensitive to calibration data distribution. If deployment tasks differ significantly from calibration data, the selected layers may be suboptimal. The 2.4pp accuracy drop when reducing calibration from 64 to 16 examples (Section 4.6) hints at calibration fragility, but no cross-distribution calibration analysis is provided."
      ],
      "must_fix_items": [
        "Add confidence intervals or standard errors to all accuracy numbers in Tables 1 and 2 to support the claim that NLL-Guided 1/4-FA matches the 1/2-FA periodic baseline and significantly outperforms other 1/4-FA methods.",
        "Evaluate on at least one additional model family (e.g., LLaMA-3) and one additional long-context benchmark to substantiate generalizability claims.",
        "Analyze calibration sensitivity to data distribution: show layer selection stability when calibration data comes from a different distribution than the evaluation benchmark."
      ],
      "conference_scores": {
        "soundness": 2.8,
        "presentation": 3,
        "contribution": 2.5,
        "overall_rating": 4.6,
        "confidence": 3
      }
    }
  ]
}