{
  "pdf": "compute-matched-diffusion-planning-audit.pdf",
  "title": "COMPUTE-MATCHED EVALUATION REVEALS TASK-DEPENDENT DIFFUSION PLANNING ADVANTAGE FARS Analemma",
  "elapsed": 48.1,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 4.2,
  "scores": [
    4.2
  ],
  "score_std": 0,
  "final_verdict": "Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.5,
    "presentation": 3,
    "contribution": 2.3,
    "overall_rating": 4.2,
    "confidence": 3
  },
  "strengths": [
    "The paper identifies and addresses a genuine methodological gap: prior comparisons of diffusion vs. AR models (Dream-7B, LLaDA) evaluate against greedy AR decoding without controlling for the vastly different inference compute budgets (Section 1, Table 1). This is a real confound and the compute-matched protocol is a meaningful contribution to fair evaluation.",
    "The compute-matching protocol itself is well-designed: wall-clock calibration on held-out data, robustness checks with median vs. p75 estimators (Table 2), and multiple random seeds for best-of-k (seeds 42, 123, 456) with reported standard deviations. The k values (35 and 39) are derived from actual timing measurements rather than theoretical FLOPs, which is pragmatic and reproducible (Section 3.3, Table 2).",
    "The task-dependent finding is informative and nuanced: diffusion loses badly on Countdown (-32.5pp) but wins on Mini Sudoku (+10.4pp with 95% CI [+6.1, +14.6]). This moves beyond blanket claims about diffusion superiority and provides quantitative evidence that the advantage depends on problem structure (Section 4.2, Table 1)."
  ],
  "weaknesses": [
    "Extremely narrow scope: only 2 tasks, 1 model pair (Dream-7B vs. Qwen2.5-7B), 1 GPU type (A100). The claim that 'diffusion may provide genuine advantages for constraint-satisfaction problems' is based on a single constraint-satisfaction task (Mini Sudoku). Generalizing from one CSP instance to all CSPs is unwarranted. No additional planning domains (e.g., graph coloring, N-queens, SAT, logistics planning) are tested (Section 3.1, Section 5).",
    "The best-of-k paradigm is a very weak inference-time strategy for AR models, creating a potentially unfair comparison favoring diffusion. Best-of-k with independent sampling and no search (no backtracking, no lookahead, no tree-of-thoughts) is near the bottom of inference-time scaling methods. A more compute-efficient AR strategy (e.g., beam search with verification, tree-of-thoughts, or guided decoding with constraint propagation) could plausibly close or reverse the Mini Sudoku gap entirely. The paper acknowledges tree-of-thoughts and self-consistency in Related Work (Section 2) but never compares against them, making the 'compute-matched' label misleading—it is compute-matched only for one specific, naive AR strategy (Section 3.2, Section 4.5).",
    "The paper was generated by an automated research system (explicitly stated in the abstract). This raises concerns about depth of analysis, novelty of insight beyond surface-level pattern matching, and the adequacy of the discussion. The Discussion section (4.5) is thin—offering hypotheses about 'sequential structure' vs. 'global coherence' without any mechanistic analysis, probing experiments, or controlled ablations to test these hypotheses. No analysis of failure cases, no per-difficulty-stratum breakdown, no examination of what diffusion does differently on the same instances (Section 4.5)."
  ],
  "must_fix_items": [
    "Add at least 2-3 additional tasks per category (sequential reasoning and constraint satisfaction) to substantiate the task-dependent claim. Without this, the generalization is speculative.",
    "Compare against at least one stronger AR inference-time method (e.g., beam search with verification or tree-of-thoughts) at the same compute budget, to test whether the Mini Sudoku diffusion advantage holds against non-naive AR strategies.",
    "Provide mechanistic or ablation evidence for the 'global coherence' hypothesis: e.g., analyze diffusion intermediate denoising steps on Sudoku, compare per-instance difficulty strata, or show that diffusion succeeds on instances where all k AR samples fail for structural reasons (not just sampling luck)."
  ],
  "runs": [
    {
      "run": 1,
      "score": 4.2,
      "verdict": "Reject",
      "confidence": 0.6,
      "strengths": [
        "The paper identifies and addresses a genuine methodological gap: prior comparisons of diffusion vs. AR models (Dream-7B, LLaDA) evaluate against greedy AR decoding without controlling for the vastly different inference compute budgets (Section 1, Table 1). This is a real confound and the compute-matched protocol is a meaningful contribution to fair evaluation.",
        "The compute-matching protocol itself is well-designed: wall-clock calibration on held-out data, robustness checks with median vs. p75 estimators (Table 2), and multiple random seeds for best-of-k (seeds 42, 123, 456) with reported standard deviations. The k values (35 and 39) are derived from actual timing measurements rather than theoretical FLOPs, which is pragmatic and reproducible (Section 3.3, Table 2).",
        "The task-dependent finding is informative and nuanced: diffusion loses badly on Countdown (-32.5pp) but wins on Mini Sudoku (+10.4pp with 95% CI [+6.1, +14.6]). This moves beyond blanket claims about diffusion superiority and provides quantitative evidence that the advantage depends on problem structure (Section 4.2, Table 1)."
      ],
      "weaknesses": [
        "Extremely narrow scope: only 2 tasks, 1 model pair (Dream-7B vs. Qwen2.5-7B), 1 GPU type (A100). The claim that 'diffusion may provide genuine advantages for constraint-satisfaction problems' is based on a single constraint-satisfaction task (Mini Sudoku). Generalizing from one CSP instance to all CSPs is unwarranted. No additional planning domains (e.g., graph coloring, N-queens, SAT, logistics planning) are tested (Section 3.1, Section 5).",
        "The best-of-k paradigm is a very weak inference-time strategy for AR models, creating a potentially unfair comparison favoring diffusion. Best-of-k with independent sampling and no search (no backtracking, no lookahead, no tree-of-thoughts) is near the bottom of inference-time scaling methods. A more compute-efficient AR strategy (e.g., beam search with verification, tree-of-thoughts, or guided decoding with constraint propagation) could plausibly close or reverse the Mini Sudoku gap entirely. The paper acknowledges tree-of-thoughts and self-consistency in Related Work (Section 2) but never compares against them, making the 'compute-matched' label misleading—it is compute-matched only for one specific, naive AR strategy (Section 3.2, Section 4.5).",
        "The paper was generated by an automated research system (explicitly stated in the abstract). This raises concerns about depth of analysis, novelty of insight beyond surface-level pattern matching, and the adequacy of the discussion. The Discussion section (4.5) is thin—offering hypotheses about 'sequential structure' vs. 'global coherence' without any mechanistic analysis, probing experiments, or controlled ablations to test these hypotheses. No analysis of failure cases, no per-difficulty-stratum breakdown, no examination of what diffusion does differently on the same instances (Section 4.5)."
      ],
      "must_fix_items": [
        "Add at least 2-3 additional tasks per category (sequential reasoning and constraint satisfaction) to substantiate the task-dependent claim. Without this, the generalization is speculative.",
        "Compare against at least one stronger AR inference-time method (e.g., beam search with verification or tree-of-thoughts) at the same compute budget, to test whether the Mini Sudoku diffusion advantage holds against non-naive AR strategies.",
        "Provide mechanistic or ablation evidence for the 'global coherence' hypothesis: e.g., analyze diffusion intermediate denoising steps on Sudoku, compare per-instance difficulty strata, or show that diffusion succeeds on instances where all k AR samples fail for structural reasons (not just sampling luck)."
      ],
      "conference_scores": {
        "soundness": 2.5,
        "presentation": 3,
        "contribution": 2.3,
        "overall_rating": 4.2,
        "confidence": 3
      }
    }
  ]
}