{
  "pdf": "skewguard-polr.pdf",
  "title": "SKEWGUARD-POLR: INVESTIGATING DIRICHLET-UNCERTAINTY GATED MULTI-CLUSTER EXPANSION",
  "elapsed": 86.8,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 2.8,
  "scores": [
    2.8
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2,
    "presentation": 2.7,
    "contribution": 1.5,
    "overall_rating": 2.8,
    "confidence": 3
  },
  "strengths": [
    "Honest reporting of negative results: The paper transparently reports that its proposed method (SkewGuard-PoLR) provides no accuracy improvement over the baseline (PoLR) while incurring 17-32% additional computational cost (Table 1). This is commendable scientific integrity, as negative results that prevent the community from pursuing unproductive directions are valuable.",
    "Reproduction attempt of prior claim: The paper attempts to reproduce the key tail failure result from Jindal et al. (2026) — a reported 10pp accuracy drop of PoLR on AIME25 with QwQ-32B — and finds that PoLR actually outperforms SC by 1.11pp (78.89% vs 77.78%) rather than underperforming by 10pp. This reproduction effort, if confirmed, is a useful contribution (Section 3.2, Table 1).",
    "Clear methodological description: The Dirichlet posterior formulation (Equation 1), credible set estimation via Monte Carlo sampling (Equation 2), and the multi-cluster expansion rule are described precisely and are in principle reproducible. The hyperparameters (δ, α0, S) are clearly enumerated (Section 3.4)."
  ],
  "weaknesses": [
    "Extremely thin experimental scope for a negative-result paper: Only 2 models × 2 benchmarks, with AIME25 having only 30 problems. The confidence intervals in Table 1 (e.g., ±1.57 for AIME25) are large relative to the claimed differences. A 30-problem benchmark cannot reliably distinguish a 1.11pp difference from a 10pp difference — the variance is simply too high. The paper does not perform any formal statistical significance test (e.g., bootstrap, McNemar) to establish that the reproduced result is significantly different from the originally reported one. This constitutes HF_NO_SIGNIFICANCE.",
    "The core contribution is self-nullifying: The paper's proposed method is shown to be unnecessary by its own experiments. While negative results can be valuable, the novelty of the method itself is limited — placing a Dirichlet posterior over categorical counts and using a credible set is a textbook Bayesian technique with no adaptation specific to the PoLR setting. The only 'contribution' that remains is the failed reproduction, which is under-powered as noted above.",
    "Root cause of the reproduction discrepancy is uninvestigated: The paper lists three possible explanations for why their result differs from the original PoLR paper (model versions, prompt templates, statistical variation) but makes no effort to investigate any of them (Section 4.4). Without understanding WHY the reproduction differs, the reader cannot assess whether the original result was wrong, or whether the authors simply used a different setup. The paper references the original PoLR paper as 'Jindal et al., 2026' with ArXiv ID 2601.21494 but does not specify what model checkpoint, prompt template, or answer extraction method was used in either the original or their own experiments.",
    "The paper was generated by an automated research system (stated in the abstract), which raises concerns about the depth of analysis, the motivation for the research direction, and whether the 'negative result' framing is a post-hoc rationalization of a failed approach rather than a genuine empirical discovery."
  ],
  "must_fix_items": [
    "Add formal statistical significance testing to compare PoLR vs SC and to assess whether the reproduction result is significantly different from the originally reported -10pp gap. With N=30 problems, a McNemar or bootstrap test is straightforward and essential.",
    "Investigate and report the root cause of the reproduction discrepancy: specify exact model checkpoints, prompt templates, and answer extraction methods used; compare with the original PoLR paper's setup; attempt to reproduce the original -10pp result by matching their setup exactly.",
    "Report per-problem results or at minimum the number of problems where PoLR and SC disagree, to allow readers to assess the overlap and variance structure on the 30-problem benchmark."
  ],
  "runs": [
    {
      "run": 1,
      "score": 2.8,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "Honest reporting of negative results: The paper transparently reports that its proposed method (SkewGuard-PoLR) provides no accuracy improvement over the baseline (PoLR) while incurring 17-32% additional computational cost (Table 1). This is commendable scientific integrity, as negative results that prevent the community from pursuing unproductive directions are valuable.",
        "Reproduction attempt of prior claim: The paper attempts to reproduce the key tail failure result from Jindal et al. (2026) — a reported 10pp accuracy drop of PoLR on AIME25 with QwQ-32B — and finds that PoLR actually outperforms SC by 1.11pp (78.89% vs 77.78%) rather than underperforming by 10pp. This reproduction effort, if confirmed, is a useful contribution (Section 3.2, Table 1).",
        "Clear methodological description: The Dirichlet posterior formulation (Equation 1), credible set estimation via Monte Carlo sampling (Equation 2), and the multi-cluster expansion rule are described precisely and are in principle reproducible. The hyperparameters (δ, α0, S) are clearly enumerated (Section 3.4)."
      ],
      "weaknesses": [
        "Extremely thin experimental scope for a negative-result paper: Only 2 models × 2 benchmarks, with AIME25 having only 30 problems. The confidence intervals in Table 1 (e.g., ±1.57 for AIME25) are large relative to the claimed differences. A 30-problem benchmark cannot reliably distinguish a 1.11pp difference from a 10pp difference — the variance is simply too high. The paper does not perform any formal statistical significance test (e.g., bootstrap, McNemar) to establish that the reproduced result is significantly different from the originally reported one. This constitutes HF_NO_SIGNIFICANCE.",
        "The core contribution is self-nullifying: The paper's proposed method is shown to be unnecessary by its own experiments. While negative results can be valuable, the novelty of the method itself is limited — placing a Dirichlet posterior over categorical counts and using a credible set is a textbook Bayesian technique with no adaptation specific to the PoLR setting. The only 'contribution' that remains is the failed reproduction, which is under-powered as noted above.",
        "Root cause of the reproduction discrepancy is uninvestigated: The paper lists three possible explanations for why their result differs from the original PoLR paper (model versions, prompt templates, statistical variation) but makes no effort to investigate any of them (Section 4.4). Without understanding WHY the reproduction differs, the reader cannot assess whether the original result was wrong, or whether the authors simply used a different setup. The paper references the original PoLR paper as 'Jindal et al., 2026' with ArXiv ID 2601.21494 but does not specify what model checkpoint, prompt template, or answer extraction method was used in either the original or their own experiments.",
        "The paper was generated by an automated research system (stated in the abstract), which raises concerns about the depth of analysis, the motivation for the research direction, and whether the 'negative result' framing is a post-hoc rationalization of a failed approach rather than a genuine empirical discovery."
      ],
      "must_fix_items": [
        "Add formal statistical significance testing to compare PoLR vs SC and to assess whether the reproduction result is significantly different from the originally reported -10pp gap. With N=30 problems, a McNemar or bootstrap test is straightforward and essential.",
        "Investigate and report the root cause of the reproduction discrepancy: specify exact model checkpoints, prompt templates, and answer extraction methods used; compare with the original PoLR paper's setup; attempt to reproduce the original -10pp result by matching their setup exactly.",
        "Report per-problem results or at minimum the number of problems where PoLR and SC disagree, to allow readers to assess the overlap and variance structure on the 30-problem benchmark."
      ],
      "conference_scores": {
        "soundness": 2,
        "presentation": 2.7,
        "contribution": 1.5,
        "overall_rating": 2.8,
        "confidence": 3
      }
    }
  ]
}