{
  "pdf": "290af5cd-cf21-4aba-ba7f-83492b52674a.pdf",
  "title": "TOOL-GATED RESIDUAL DISTILLATION FOR DAT-ACHEF VERIFIER SCORING FARS",
  "elapsed": 79.5,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 4.2,
  "scores": [
    4.2
  ],
  "score_std": 0.0,
  "final_verdict": "Reject",
  "final_confidence": 0.82,
  "conference_scores": null,
  "strengths": [
    "The paper identifies an important practical problem: LLM-as-judge rubric scores may not correlate with downstream fine-tuning performance, and the anti-correlation finding (ρ = −0.405 average) is an interesting empirical observation that challenges assumptions in data curation frameworks (Section 1, Abstract).",
    "The cost reduction from 2.56M tokens to zero at inference is a clear practical benefit, and the overall framework of replacing expensive API calls with a small distilled model is a reasonable engineering direction (Section 3.2.2, Section 4.2).",
    "The ablation in Table 2 comparing 3-way factorized vs 5-way non-factorized distillation provides some evidence that rubric simplification matters, with a 1.54-point ρ swing on LiveCodeBench — though the 5-way variant's strong negative correlation raises its own questions (Table 2)."
  ],
  "weaknesses": [
    "Fatal statistical power issue: Spearman ρ is computed on only n=8 data points per task. For n=8, the critical value for p<0.05 (two-tailed) is |ρ|≥0.738. The LiveCodeBench ρ=0.667 is NOT statistically significant. No p-values, confidence intervals, or significance tests are reported anywhere in the paper. The entire contribution rests on correlations that may be noise (Table 1, Section 4.2). HF_NO_SIGNIFICANCE.",
    "The 'Tool-Gated' component in the title is effectively inert: gate coverage is 0.25%, and LLM-only and Tool+LLM produce IDENTICAL results (Table 1). The paper acknowledges this (Section 4.2) but still names the method 'Tool-Gated Residual Distillation,' which is over-packaging. The tool gating's only role is simplifying the rubric for distillation — a minor architectural convenience, not a gating mechanism in any meaningful sense.",
    "The score map assigns TASK MISMATCH→0.35 and PASS→0.05, meaning task-mismatched instances score 7× higher than passing instances. This is counterintuitive and 'optimized on a held-out validation set' (Section 3.2.3) with no details about the validation data, optimization procedure, or safeguards against overfitting. With only 2 tasks × 8 datasets total, there is virtually no independent data for this tuning, creating a severe data leakage / overfitting risk. HF_DATA_LEAK.",
    "The '1.18-point improvement' claim (ρ from −0.405 to 0.771) is mathematically misleading: averaging a negative and positive Spearman ρ across tasks is not a meaningful metric. The proper comparison should be per-task, where LiveCodeBench goes from −0.762 to 0.667 (which is not significant at n=8) and OpenFinData goes from −0.048 to 0.874 (the baseline was already near-zero, making the comparison weak) (Abstract, Section 4.2).",
    "The 44% teacher-student agreement is framed as 'imperfect distillation that corrects biases' (Section 4.4), but this is post-hoc rationalization with no controlled experiment. The student could simply be a poor learner that happens to produce better rankings by luck on n=8 points. No analysis controls for the alternative hypothesis that the student is just noisy and the correlation is spurious.",
    "Evaluation scope is extremely narrow: only 2 tasks, 1 teacher model (gpt-4.1), 1 student model (Qwen2.5-1.5B), 1 downstream model (Qwen3-1.7B), and 3 seeds with no variance reported for ground-truth DBS (Section 4.1). Zero top-1 regret on 2 tasks with n=8 is not convincing — it means the method correctly picked the best of 8 datasets twice, which could easily occur by chance."
  ],
  "must_fix_items": [
    "Report statistical significance for all correlation metrics (p-values for Spearman ρ at n=8; the LiveCodeBench result is likely not significant).",
    "Explain and justify the score map (TASK MISMATCH=0.35 > PASS=0.05), disclose the validation set used for optimization, and demonstrate it does not overlap with evaluation data. Without this, the result may be overfit.",
    "Report variance across seeds for DBS ground truth, and ideally expand to more tasks or datasets to increase statistical power beyond n=8."
  ],
  "runs": [
    {
      "run": 1,
      "score": 4.2,
      "verdict": "Reject",
      "confidence": 0.82,
      "strengths": [
        "The paper identifies an important practical problem: LLM-as-judge rubric scores may not correlate with downstream fine-tuning performance, and the anti-correlation finding (ρ = −0.405 average) is an interesting empirical observation that challenges assumptions in data curation frameworks (Section 1, Abstract).",
        "The cost reduction from 2.56M tokens to zero at inference is a clear practical benefit, and the overall framework of replacing expensive API calls with a small distilled model is a reasonable engineering direction (Section 3.2.2, Section 4.2).",
        "The ablation in Table 2 comparing 3-way factorized vs 5-way non-factorized distillation provides some evidence that rubric simplification matters, with a 1.54-point ρ swing on LiveCodeBench — though the 5-way variant's strong negative correlation raises its own questions (Table 2)."
      ],
      "weaknesses": [
        "Fatal statistical power issue: Spearman ρ is computed on only n=8 data points per task. For n=8, the critical value for p<0.05 (two-tailed) is |ρ|≥0.738. The LiveCodeBench ρ=0.667 is NOT statistically significant. No p-values, confidence intervals, or significance tests are reported anywhere in the paper. The entire contribution rests on correlations that may be noise (Table 1, Section 4.2). HF_NO_SIGNIFICANCE.",
        "The 'Tool-Gated' component in the title is effectively inert: gate coverage is 0.25%, and LLM-only and Tool+LLM produce IDENTICAL results (Table 1). The paper acknowledges this (Section 4.2) but still names the method 'Tool-Gated Residual Distillation,' which is over-packaging. The tool gating's only role is simplifying the rubric for distillation — a minor architectural convenience, not a gating mechanism in any meaningful sense.",
        "The score map assigns TASK MISMATCH→0.35 and PASS→0.05, meaning task-mismatched instances score 7× higher than passing instances. This is counterintuitive and 'optimized on a held-out validation set' (Section 3.2.3) with no details about the validation data, optimization procedure, or safeguards against overfitting. With only 2 tasks × 8 datasets total, there is virtually no independent data for this tuning, creating a severe data leakage / overfitting risk. HF_DATA_LEAK.",
        "The '1.18-point improvement' claim (ρ from −0.405 to 0.771) is mathematically misleading: averaging a negative and positive Spearman ρ across tasks is not a meaningful metric. The proper comparison should be per-task, where LiveCodeBench goes from −0.762 to 0.667 (which is not significant at n=8) and OpenFinData goes from −0.048 to 0.874 (the baseline was already near-zero, making the comparison weak) (Abstract, Section 4.2).",
        "The 44% teacher-student agreement is framed as 'imperfect distillation that corrects biases' (Section 4.4), but this is post-hoc rationalization with no controlled experiment. The student could simply be a poor learner that happens to produce better rankings by luck on n=8 points. No analysis controls for the alternative hypothesis that the student is just noisy and the correlation is spurious.",
        "Evaluation scope is extremely narrow: only 2 tasks, 1 teacher model (gpt-4.1), 1 student model (Qwen2.5-1.5B), 1 downstream model (Qwen3-1.7B), and 3 seeds with no variance reported for ground-truth DBS (Section 4.1). Zero top-1 regret on 2 tasks with n=8 is not convincing — it means the method correctly picked the best of 8 datasets twice, which could easily occur by chance."
      ],
      "must_fix_items": [
        "Report statistical significance for all correlation metrics (p-values for Spearman ρ at n=8; the LiveCodeBench result is likely not significant).",
        "Explain and justify the score map (TASK MISMATCH=0.35 > PASS=0.05), disclose the validation set used for optimization, and demonstrate it does not overlap with evaluation data. Without this, the result may be overfit.",
        "Report variance across seeds for DBS ground truth, and ideally expand to more tasks or datasets to increase statistical power beyond n=8."
      ],
      "conference_scores": null
    }
  ]
}