{
  "pdf": "selective-self-reference-judge.pdf",
  "title": "SELECTIVE SELF-REFERENCE LLM-AS-A-JUDGE:",
  "elapsed": 66.2,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 4.2,
  "scores": [
    4.2
  ],
  "score_std": 0,
  "final_verdict": "Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.5,
    "presentation": 3,
    "contribution": 2,
    "overall_rating": 4.2,
    "confidence": 3
  },
  "strengths": [
    "Clear problem identification: The error propagation problem in self-reference judging is well-defined with concrete evidence — low-agreement items yield 24.50% accuracy with always-self-reference vs. 37.75% with no-reference (Section 1, Table 2), a 13.25pp degradation that motivates the method.",
    "Simple and intuitive method: SSR-Judge applies a straightforward self-consistency gate (agreement ≥ τ) to conditionally enable self-reference. The method is easy to implement, requires no training or fine-tuning, and the agreement gate (Equation 2) is well-formulated with a clear precision-coverage tradeoff analysis (Section 3.3, Section 4.2).",
    "Thorough slice-level analysis: Table 2 and Figure 2 decompose performance into four slices (unanimous/non-unanimous × correct/wrong), demonstrating that SSR-Judge matches the better baseline in each slice — preserving self-reference benefit on unanimous+correct (+22.57pp over no-reference) and avoiding error propagation on non-unanimous+wrong (+13.25pp over always-self-reference).",
    "Calibration analysis supports the gate design: Figure 3 shows a strong monotonic relationship between agreement level and majority answer correctness (35.5% at 3/5, 57.7% at 4/5, 78.7% at 5/5), justifying the use of agreement as a confidence signal for the gate."
  ],
  "weaknesses": [
    "Marginal improvement over always-self-reference baseline: The main headline number (58.93% vs. 58.21%) is only a +0.72pp improvement over always-self-reference, which is unlikely to be statistically significant given N=1,400 items. No confidence intervals, standard errors, or significance tests are reported for the main comparison. This is a critical gap — the paper's core claim of outperforming always-self-reference rests on an improvement that could be noise. (Table 1, Section 4.2)",
    "Single benchmark, single model, limited task scope: All experiments use MMLU-Pro with Qwen2.5-72B-Instruct only, on pairwise preference judgment with artificially constructed candidate pairs (one correct, one random incorrect). No evaluation on MT-Bench, Chatbot Arena, or any open-ended generation task. No other judge model is tested. The constructed candidate pairs (Section 4.1) are a simplified and unrealistic evaluation protocol — real LLM-as-a-Judge scenarios involve nuanced quality differences, not correct-vs-random pairings. This severely limits generalizability claims. (Section 4.1)",
    "Compute overhead is substantial and under-discussed: Generating k=5 self-solve samples at temperature T=0.7 per evaluation item multiplies inference cost by 5x, plus the re-evaluation with swapped position order (Section 3.4) doubles it further — roughly 10x the cost of no-reference judging. The paper mentions this can be 'mitigated through parallel inference' (Section 5) but does not quantify the latency, throughput, or cost implications, nor compare against other methods that might achieve similar accuracy gains with less compute.",
    "The threshold τ=4/5 is chosen without systematic exploration: The paper uses ≥4/5 as the gate threshold but does not report results for other thresholds (e.g., 3/5, 5/5) in the main comparison, making it unclear whether this is optimal or arbitrarily chosen. Section 4.5 mentions domain-specific threshold tuning could help, but no ablation over τ is provided. The calibration data (Figure 3) actually suggests that 4/5 agreement yields only 57.7% majority correctness — meaning 42.3% of gate-on items still have wrong self-answers, which limits the method's reliability."
  ],
  "must_fix_items": [
    "Add statistical significance tests (e.g., McNemar's test or bootstrap confidence intervals) for the main comparison between SSR-Judge and always-self-reference. The +0.72pp improvement is the paper's central claim and must be shown to not be due to chance.",
    "Evaluate on at least one additional benchmark (e.g., MT-Bench, JudgeBench) and/or with a different judge model to demonstrate generalizability beyond MMLU-Pro + Qwen2.5-72B.",
    "Report an ablation over the agreement threshold τ (e.g., 3/5, 4/5, 5/5) showing the full precision-coverage tradeoff curve and how overall accuracy varies."
  ],
  "runs": [
    {
      "run": 1,
      "score": 4.2,
      "verdict": "Reject",
      "confidence": 0.6,
      "strengths": [
        "Clear problem identification: The error propagation problem in self-reference judging is well-defined with concrete evidence — low-agreement items yield 24.50% accuracy with always-self-reference vs. 37.75% with no-reference (Section 1, Table 2), a 13.25pp degradation that motivates the method.",
        "Simple and intuitive method: SSR-Judge applies a straightforward self-consistency gate (agreement ≥ τ) to conditionally enable self-reference. The method is easy to implement, requires no training or fine-tuning, and the agreement gate (Equation 2) is well-formulated with a clear precision-coverage tradeoff analysis (Section 3.3, Section 4.2).",
        "Thorough slice-level analysis: Table 2 and Figure 2 decompose performance into four slices (unanimous/non-unanimous × correct/wrong), demonstrating that SSR-Judge matches the better baseline in each slice — preserving self-reference benefit on unanimous+correct (+22.57pp over no-reference) and avoiding error propagation on non-unanimous+wrong (+13.25pp over always-self-reference).",
        "Calibration analysis supports the gate design: Figure 3 shows a strong monotonic relationship between agreement level and majority answer correctness (35.5% at 3/5, 57.7% at 4/5, 78.7% at 5/5), justifying the use of agreement as a confidence signal for the gate."
      ],
      "weaknesses": [
        "Marginal improvement over always-self-reference baseline: The main headline number (58.93% vs. 58.21%) is only a +0.72pp improvement over always-self-reference, which is unlikely to be statistically significant given N=1,400 items. No confidence intervals, standard errors, or significance tests are reported for the main comparison. This is a critical gap — the paper's core claim of outperforming always-self-reference rests on an improvement that could be noise. (Table 1, Section 4.2)",
        "Single benchmark, single model, limited task scope: All experiments use MMLU-Pro with Qwen2.5-72B-Instruct only, on pairwise preference judgment with artificially constructed candidate pairs (one correct, one random incorrect). No evaluation on MT-Bench, Chatbot Arena, or any open-ended generation task. No other judge model is tested. The constructed candidate pairs (Section 4.1) are a simplified and unrealistic evaluation protocol — real LLM-as-a-Judge scenarios involve nuanced quality differences, not correct-vs-random pairings. This severely limits generalizability claims. (Section 4.1)",
        "Compute overhead is substantial and under-discussed: Generating k=5 self-solve samples at temperature T=0.7 per evaluation item multiplies inference cost by 5x, plus the re-evaluation with swapped position order (Section 3.4) doubles it further — roughly 10x the cost of no-reference judging. The paper mentions this can be 'mitigated through parallel inference' (Section 5) but does not quantify the latency, throughput, or cost implications, nor compare against other methods that might achieve similar accuracy gains with less compute.",
        "The threshold τ=4/5 is chosen without systematic exploration: The paper uses ≥4/5 as the gate threshold but does not report results for other thresholds (e.g., 3/5, 5/5) in the main comparison, making it unclear whether this is optimal or arbitrarily chosen. Section 4.5 mentions domain-specific threshold tuning could help, but no ablation over τ is provided. The calibration data (Figure 3) actually suggests that 4/5 agreement yields only 57.7% majority correctness — meaning 42.3% of gate-on items still have wrong self-answers, which limits the method's reliability."
      ],
      "must_fix_items": [
        "Add statistical significance tests (e.g., McNemar's test or bootstrap confidence intervals) for the main comparison between SSR-Judge and always-self-reference. The +0.72pp improvement is the paper's central claim and must be shown to not be due to chance.",
        "Evaluate on at least one additional benchmark (e.g., MT-Bench, JudgeBench) and/or with a different judge model to demonstrate generalizability beyond MMLU-Pro + Qwen2.5-72B.",
        "Report an ablation over the agreement threshold τ (e.g., 3/5, 4/5, 5/5) showing the full precision-coverage tradeoff curve and how overall accuracy varies."
      ],
      "conference_scores": {
        "soundness": 2.5,
        "presentation": 3,
        "contribution": 2,
        "overall_rating": 4.2,
        "confidence": 3
      }
    }
  ]
}