{
  "pdf": "6bed6fc1-48fc-4175-98ce-c7eb56f963ea.pdf",
  "title": "DRAFT DE-ANCHORING DECODING DOES NOT MITI-GATE CONTEXTUAL DRAG IN LLM REASONING FARS Analemma",
  "elapsed": 398.7,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 4.5,
  "scores": [
    4.5
  ],
  "score_std": 0.0,
  "final_verdict": "Reject",
  "final_confidence": 0.78,
  "conference_scores": null,
  "strengths": [
    "Pre-registered success criteria (Criterion 1: ≥+5pp wrong-draft improvement; Criterion 2: ≤1pp correct-draft loss) established before running experiments, providing methodological rigor and preventing post-hoc goalpost shifting (Section 4.1, 'Pre-registered Success Criteria').",
    "Honest and transparent negative-result reporting: D3 fails both criteria (−0.65pp instead of +5pp on wrong-draft; −2.21pp instead of ≤1pp loss on correct-draft), and the paper provides clear mechanistic analysis of why (Section 4.4). This is rare and valuable in the current literature.",
    "Well-designed control conditions: Drop-Draft (46.03%) and Filler (48.71%) baselines cleanly demonstrate that draft content—not context length or position—drives the drag effect, ruling out confounds (Table 1, Section 4.2 'Draft Content Drives the Effect')."
  ],
  "weaknesses": [
    "Single-task, single-model evaluation: All experiments use only Game of 24 with Qwen3-8B. The paper's title makes a general claim ('Does Not Mitigate Contextual Drag in LLM Reasoning') but provides no evidence for generalizability to other tasks (math, code, QA), models, or scales. The negative result may be task-specific—Game of 24 is a narrow arithmetic puzzle where draft-absent accuracy is unusually low (~46%), making logit interpolation inherently destructive (Section 4.1, Table 1).",
    "No statistical significance tests on any reported numbers. The key claim that D3 achieves −0.65pp on wrong-draft accuracy vs. the 1F baseline (82.38% vs. 81.73%) is within noise range for n=1084 without reported confidence intervals or p-values. The divergence rate comparison (5.57% vs. 6.10%) also lacks significance testing (Section 4.4, 'Method Cannot Distinguish Draft Quality'). HF_NO_SIGNIFICANCE applies.",
    "The core method is a straightforward application of the contrastive decoding paradigm (CAD: Xu 2023; CFG: Sanchez 2023) to contextual drag—maintaining dual KV caches and interpolating logits with adaptive weighting. The adaptive β via JSD is a minor twist on fixed-weight interpolation. The failure was predictable from the Drop-Draft baseline alone: if draft-absent logits yield ~46% accuracy, interpolating toward them cannot help. The 'fundamental flaw' analysis in Section 4.4 is post-hoc rationalization of an unsurprising outcome, not a deep insight.",
    "The paper reports both 'D3 (original)' and 'D3 (optimized)' results (Table 1: 75.74% vs. 81.73% wrong-draft accuracy) but provides no details on what 'optimized' means, how optimization was done, or whether it constitutes data snooping on the test set. The gap between original and optimized D3 (5.99pp) is larger than the effect being studied, raising concerns about hyperparameter search overfitting (Table 1)."
  ],
  "must_fix_items": [
    "Add statistical significance tests (confidence intervals or p-values) for all pairwise comparisons, especially the key D3 vs. 1F baseline comparisons and the divergence rate comparison (5.57% vs. 6.10%).",
    "Clarify what 'D3 (optimized)' means: what was optimized, over what data, and whether the optimization constitutes test-set leakage. The original vs. optimized gap (75.74%→81.73% wrong-draft) needs explicit accounting.",
    "Soften the general claim in the title and conclusion: 'Does Not Mitigate Contextual Drag in LLM Reasoning' overstates the scope. The evidence only covers one task and one model. A title like 'Draft De-Anchoring Decoding Fails on Game of 24 with Qwen3-8B' would be accurate."
  ],
  "runs": [
    {
      "run": 1,
      "score": 4.5,
      "verdict": "Reject",
      "confidence": 0.78,
      "strengths": [
        "Pre-registered success criteria (Criterion 1: ≥+5pp wrong-draft improvement; Criterion 2: ≤1pp correct-draft loss) established before running experiments, providing methodological rigor and preventing post-hoc goalpost shifting (Section 4.1, 'Pre-registered Success Criteria').",
        "Honest and transparent negative-result reporting: D3 fails both criteria (−0.65pp instead of +5pp on wrong-draft; −2.21pp instead of ≤1pp loss on correct-draft), and the paper provides clear mechanistic analysis of why (Section 4.4). This is rare and valuable in the current literature.",
        "Well-designed control conditions: Drop-Draft (46.03%) and Filler (48.71%) baselines cleanly demonstrate that draft content—not context length or position—drives the drag effect, ruling out confounds (Table 1, Section 4.2 'Draft Content Drives the Effect')."
      ],
      "weaknesses": [
        "Single-task, single-model evaluation: All experiments use only Game of 24 with Qwen3-8B. The paper's title makes a general claim ('Does Not Mitigate Contextual Drag in LLM Reasoning') but provides no evidence for generalizability to other tasks (math, code, QA), models, or scales. The negative result may be task-specific—Game of 24 is a narrow arithmetic puzzle where draft-absent accuracy is unusually low (~46%), making logit interpolation inherently destructive (Section 4.1, Table 1).",
        "No statistical significance tests on any reported numbers. The key claim that D3 achieves −0.65pp on wrong-draft accuracy vs. the 1F baseline (82.38% vs. 81.73%) is within noise range for n=1084 without reported confidence intervals or p-values. The divergence rate comparison (5.57% vs. 6.10%) also lacks significance testing (Section 4.4, 'Method Cannot Distinguish Draft Quality'). HF_NO_SIGNIFICANCE applies.",
        "The core method is a straightforward application of the contrastive decoding paradigm (CAD: Xu 2023; CFG: Sanchez 2023) to contextual drag—maintaining dual KV caches and interpolating logits with adaptive weighting. The adaptive β via JSD is a minor twist on fixed-weight interpolation. The failure was predictable from the Drop-Draft baseline alone: if draft-absent logits yield ~46% accuracy, interpolating toward them cannot help. The 'fundamental flaw' analysis in Section 4.4 is post-hoc rationalization of an unsurprising outcome, not a deep insight.",
        "The paper reports both 'D3 (original)' and 'D3 (optimized)' results (Table 1: 75.74% vs. 81.73% wrong-draft accuracy) but provides no details on what 'optimized' means, how optimization was done, or whether it constitutes data snooping on the test set. The gap between original and optimized D3 (5.99pp) is larger than the effect being studied, raising concerns about hyperparameter search overfitting (Table 1)."
      ],
      "must_fix_items": [
        "Add statistical significance tests (confidence intervals or p-values) for all pairwise comparisons, especially the key D3 vs. 1F baseline comparisons and the divergence rate comparison (5.57% vs. 6.10%).",
        "Clarify what 'D3 (optimized)' means: what was optimized, over what data, and whether the optimization constitutes test-set leakage. The original vs. optimized gap (75.74%→81.73% wrong-draft) needs explicit accounting.",
        "Soften the general claim in the title and conclusion: 'Does Not Mitigate Contextual Drag in LLM Reasoning' overstates the scope. The evidence only covers one task and one model. A title like 'Draft De-Anchoring Decoding Fails on Game of 24 with Qwen3-8B' would be accurate."
      ],
      "conference_scores": null
    }
  ]
}