{
  "pdf": "window-diffusion-overlap-refresh.pdf",
  "title": "OVERLAP-REFRESH: DECOUPLING WINDOW SHIFTS FROM FULL KV REFRESH IN DIFFUSION LANGUAGE MODELS FARS",
  "elapsed": 61.8,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.5,
  "scores": [
    3.5
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.3,
    "presentation": 2.8,
    "contribution": 1.8,
    "overall_rating": 3.5,
    "confidence": 3
  },
  "strengths": [
    "Clear identification of a real inefficiency: Window-Diffusion couples window shifts with full KV refresh, and the paper convincingly argues these serve different purposes (Section 3.1-3.2). The overlap observation (most tokens unchanged between shifts) is straightforward but valid and previously unexploited.",
    "Delta-prefill is a well-defined, implementable operation: computing KV only for newly entered tokens N while reusing cached KV for overlap tokens O, with a clean cost analysis showing O(|N|×C) vs O(C²) per layer (Section 3.2). The 3.3× measured cost reduction vs full refresh (Table 2: 0.194s vs 0.648s) validates the theoretical advantage.",
    "Honest and transparent reporting of limitations: The paper openly acknowledges the 8× gap between theoretical and observed speedup ratios (Section 4.3), the sensitivity to hyperparameters (Section 4.4), and the per-step overhead problem. This level of self-criticism is commendable."
  ],
  "weaknesses": [
    "Extremely narrow experimental scope: Only one model (Dream-7B), only two code generation benchmarks (MBPP, HumanEval), and only code generation tasks. No evaluation on natural language generation, longer sequences, or other dLLM architectures (e.g., LLaDA). The 6.0% throughput improvement on a single benchmark is a marginal gain that may not generalize. Table 3 on HumanEval shows OR (s=32,R=64) is actually SLOWER than Baseline A (8.07 vs 8.48 tokens/sec), contradicting the main MBPP finding.",
    "The 'optimal' configuration (s=16, R=32) is essentially the same refresh frequency as Baseline A with extra delta-prefills added—meaning the method does NOT actually reduce full refresh frequency as claimed. The paper itself admits this in Section 4.4: 'The benefit of Overlap-Refresh comes from adding intermediate delta-prefills between full refreshes, not from reducing full refresh frequency.' This undermines the core narrative of 'decoupling' since the best result keeps R=32 (same as baseline refresh cycle=32) and merely adds intermediate shifts. The decoupling only helps when it doesn't actually reduce refreshes.",
    "No statistical significance testing: All results are single-run with deterministic decoding (seed=42). The 0.6 percentage point quality gap (54.4% vs 55.0%) and the 6.0% throughput difference could easily fall within noise. No confidence intervals, no multiple seeds, no significance tests. This is a critical omission for a systems/efficiency paper claiming improvements.",
    "The paper is auto-generated (stated in abstract: 'This paper was generated by an automated research system'), which raises concerns about the depth of insight and the rigor of the experimental design. The contribution feels like an obvious engineering optimization (skip recomputation for unchanged tokens) rather than a research insight, and the experimental evaluation appears designed to show a positive result rather than thoroughly characterize the method."
  ],
  "must_fix_items": [
    "Add statistical significance testing: run multiple seeds, report confidence intervals or standard deviations for Pass@1 and throughput metrics.",
    "Evaluate on broader benchmarks: include natural language tasks, longer sequence settings, and at least one additional dLLM model to demonstrate generalizability.",
    "Reconcile the HumanEval result (Table 3) where Overlap-Refresh is slower than Baseline A, which contradicts the main narrative. Report the s=16,R=32 configuration on HumanEval as well."
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.5,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "Clear identification of a real inefficiency: Window-Diffusion couples window shifts with full KV refresh, and the paper convincingly argues these serve different purposes (Section 3.1-3.2). The overlap observation (most tokens unchanged between shifts) is straightforward but valid and previously unexploited.",
        "Delta-prefill is a well-defined, implementable operation: computing KV only for newly entered tokens N while reusing cached KV for overlap tokens O, with a clean cost analysis showing O(|N|×C) vs O(C²) per layer (Section 3.2). The 3.3× measured cost reduction vs full refresh (Table 2: 0.194s vs 0.648s) validates the theoretical advantage.",
        "Honest and transparent reporting of limitations: The paper openly acknowledges the 8× gap between theoretical and observed speedup ratios (Section 4.3), the sensitivity to hyperparameters (Section 4.4), and the per-step overhead problem. This level of self-criticism is commendable."
      ],
      "weaknesses": [
        "Extremely narrow experimental scope: Only one model (Dream-7B), only two code generation benchmarks (MBPP, HumanEval), and only code generation tasks. No evaluation on natural language generation, longer sequences, or other dLLM architectures (e.g., LLaDA). The 6.0% throughput improvement on a single benchmark is a marginal gain that may not generalize. Table 3 on HumanEval shows OR (s=32,R=64) is actually SLOWER than Baseline A (8.07 vs 8.48 tokens/sec), contradicting the main MBPP finding.",
        "The 'optimal' configuration (s=16, R=32) is essentially the same refresh frequency as Baseline A with extra delta-prefills added—meaning the method does NOT actually reduce full refresh frequency as claimed. The paper itself admits this in Section 4.4: 'The benefit of Overlap-Refresh comes from adding intermediate delta-prefills between full refreshes, not from reducing full refresh frequency.' This undermines the core narrative of 'decoupling' since the best result keeps R=32 (same as baseline refresh cycle=32) and merely adds intermediate shifts. The decoupling only helps when it doesn't actually reduce refreshes.",
        "No statistical significance testing: All results are single-run with deterministic decoding (seed=42). The 0.6 percentage point quality gap (54.4% vs 55.0%) and the 6.0% throughput difference could easily fall within noise. No confidence intervals, no multiple seeds, no significance tests. This is a critical omission for a systems/efficiency paper claiming improvements.",
        "The paper is auto-generated (stated in abstract: 'This paper was generated by an automated research system'), which raises concerns about the depth of insight and the rigor of the experimental design. The contribution feels like an obvious engineering optimization (skip recomputation for unchanged tokens) rather than a research insight, and the experimental evaluation appears designed to show a positive result rather than thoroughly characterize the method."
      ],
      "must_fix_items": [
        "Add statistical significance testing: run multiple seeds, report confidence intervals or standard deviations for Pass@1 and throughput metrics.",
        "Evaluate on broader benchmarks: include natural language tasks, longer sequence settings, and at least one additional dLLM model to demonstrate generalizability.",
        "Reconcile the HumanEval result (Table 3) where Overlap-Refresh is slower than Baseline A, which contradicts the main narrative. Report the s=16,R=32 configuration on HumanEval as well."
      ],
      "conference_scores": {
        "soundness": 2.3,
        "presentation": 2.8,
        "contribution": 1.8,
        "overall_rating": 3.5,
        "confidence": 3
      }
    }
  ]
}