{
  "pdf": "definition-unit-tests-convention-adherence.pdf",
  "title": "DEFINITION UNIT TESTS IMPROVE LLM CONVEN-TION ADHERENCE FARS Analemma",
  "elapsed": 45.5,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 4.5,
  "scores": [
    4.5
  ],
  "score_std": 0,
  "final_verdict": "Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.8,
    "presentation": 3,
    "contribution": 2.2,
    "overall_rating": 4.5,
    "confidence": 3
  },
  "strengths": [
    "Well-designed engagement-matched control (Condition B: neutral checks) that isolates the discriminative content effect from mere additional computation. The neutral checks baseline performs within 2pp of glossary-only (Table 1: Qwen 90.0% vs 90.3%, Llama 58.7% vs 56.7%), convincingly ruling out the confound that extra tokens/reasoning steps alone drive improvement. This is a strong methodological contribution to prompt engineering evaluation.",
    "Clear mechanistic evidence that DUT reduces convention confusion specifically. The alternate-convention match rate drops from 7.0% to 1.3% on Qwen and 9.3% to 2.0% on Llama (Table 1), representing ~80% relative reduction. The error analysis (Section 4.5) further confirms remaining errors shift from convention-related to computational, supporting the claimed mechanism rather than a generic accuracy boost.",
    "Practical efficiency advantage demonstrated: single-sample DUT outperforms 5-sample majority vote by large margins (+53.7pp over A@5 on Qwen, +22.7pp over A@5 on Llama, Table 1). This shows DUT provides a qualitatively different benefit from variance reduction, making it cost-effective for deployment scenarios requiring convention adherence."
  ],
  "weaknesses": [
    "Narrow evaluation scope: only one benchmark (ErdosConventionsBench) with only 3 convention families, all in mathematical notation. The paper's title and abstract claim general improvement in 'LLM convention adherence,' but no evidence is provided beyond math conventions. Domains like API specifications, legal terminology, or medical definitions—explicitly mentioned in the introduction (Section 1)—are untested. The generalizability claim is substantially over-packaged relative to the evidence.",
    "DUT requires manual, convention-specific discriminative check design. The paper acknowledges this limitation (Section 5) but does not demonstrate or even prototype automated check generation. The practical applicability is limited: for each new convention family, a domain expert must craft discriminative questions whose answers differ under alternate conventions. This is a significant deployment barrier that is under-discussed given the paper's framing as a general-purpose method.",
    "Only 2 models evaluated, both relatively small (7B and 8B parameters). No evaluation on frontier models (GPT-4, Claude, Gemini) where convention adherence may manifest differently due to stronger instruction following. The Qwen2.5-Math-7B results show only +5.0pp improvement and are near ceiling on 2/3 families (asymptotics 99%, convolution 91-95%), raising the question of whether DUT provides meaningful benefit on stronger models. The Llama results are more impressive but may reflect that model's weaker baseline instruction-following rather than a general phenomenon."
  ],
  "must_fix_items": [
    "The claim of 'improving LLM convention adherence' in general is over-stated; the title and framing should be scoped to mathematical convention adherence specifically, or evidence from additional domains must be provided.",
    "Statistical significance testing is reported only for the main DUT vs. neutral comparison (95% CIs), but not for the per-family breakdowns or k-ablation comparisons. The k=1 vs k=3 comparison for Qwen is claimed 'not significant' based on CI overlap, but proper paired tests should be reported for all comparisons.",
    "The majority-vote baselines (A@5, B@5) produce strikingly low numbers on Qwen (41.3%, 50.7%) that are far below single-sample performance. This suggests a formatting or implementation issue with the majority-vote setup that needs explanation—otherwise the comparison is unfair."
  ],
  "runs": [
    {
      "run": 1,
      "score": 4.5,
      "verdict": "Reject",
      "confidence": 0.6,
      "strengths": [
        "Well-designed engagement-matched control (Condition B: neutral checks) that isolates the discriminative content effect from mere additional computation. The neutral checks baseline performs within 2pp of glossary-only (Table 1: Qwen 90.0% vs 90.3%, Llama 58.7% vs 56.7%), convincingly ruling out the confound that extra tokens/reasoning steps alone drive improvement. This is a strong methodological contribution to prompt engineering evaluation.",
        "Clear mechanistic evidence that DUT reduces convention confusion specifically. The alternate-convention match rate drops from 7.0% to 1.3% on Qwen and 9.3% to 2.0% on Llama (Table 1), representing ~80% relative reduction. The error analysis (Section 4.5) further confirms remaining errors shift from convention-related to computational, supporting the claimed mechanism rather than a generic accuracy boost.",
        "Practical efficiency advantage demonstrated: single-sample DUT outperforms 5-sample majority vote by large margins (+53.7pp over A@5 on Qwen, +22.7pp over A@5 on Llama, Table 1). This shows DUT provides a qualitatively different benefit from variance reduction, making it cost-effective for deployment scenarios requiring convention adherence."
      ],
      "weaknesses": [
        "Narrow evaluation scope: only one benchmark (ErdosConventionsBench) with only 3 convention families, all in mathematical notation. The paper's title and abstract claim general improvement in 'LLM convention adherence,' but no evidence is provided beyond math conventions. Domains like API specifications, legal terminology, or medical definitions—explicitly mentioned in the introduction (Section 1)—are untested. The generalizability claim is substantially over-packaged relative to the evidence.",
        "DUT requires manual, convention-specific discriminative check design. The paper acknowledges this limitation (Section 5) but does not demonstrate or even prototype automated check generation. The practical applicability is limited: for each new convention family, a domain expert must craft discriminative questions whose answers differ under alternate conventions. This is a significant deployment barrier that is under-discussed given the paper's framing as a general-purpose method.",
        "Only 2 models evaluated, both relatively small (7B and 8B parameters). No evaluation on frontier models (GPT-4, Claude, Gemini) where convention adherence may manifest differently due to stronger instruction following. The Qwen2.5-Math-7B results show only +5.0pp improvement and are near ceiling on 2/3 families (asymptotics 99%, convolution 91-95%), raising the question of whether DUT provides meaningful benefit on stronger models. The Llama results are more impressive but may reflect that model's weaker baseline instruction-following rather than a general phenomenon."
      ],
      "must_fix_items": [
        "The claim of 'improving LLM convention adherence' in general is over-stated; the title and framing should be scoped to mathematical convention adherence specifically, or evidence from additional domains must be provided.",
        "Statistical significance testing is reported only for the main DUT vs. neutral comparison (95% CIs), but not for the per-family breakdowns or k-ablation comparisons. The k=1 vs k=3 comparison for Qwen is claimed 'not significant' based on CI overlap, but proper paired tests should be reported for all comparisons.",
        "The majority-vote baselines (A@5, B@5) produce strikingly low numbers on Qwen (41.3%, 50.7%) that are far below single-sample performance. This suggests a formatting or implementation issue with the majority-vote setup that needs explanation—otherwise the comparison is unfair."
      ],
      "conference_scores": {
        "soundness": 2.8,
        "presentation": 3,
        "contribution": 2.2,
        "overall_rating": 4.5,
        "confidence": 3
      }
    }
  ]
}