Title: EVIDENCE-GROUNDED CONSTRAINT SCHEMAS DO NOT IMPROVE MEDICAL LLM GUARDRAILS LIVEMEDBENCH FARS
PDF: livemedbench-contextual-constraints-guardrail.pdf
Score: 3.5
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 41.3s

Strengths:
1. Clear negative result with strong internal consistency: all 9 experimental conditions (3 main + 6 optimization variants across 3 pipeline architectures) consistently fail to improve over the single-pass baseline (Tables 1–2), which substantially strengthens confidence in the negative finding.
2. Well-controlled experimental design: Conditions B and C share identical information (constraint content + evidence quotes) and differ only in representation format (plain-text checklist vs. structured JSON schema), cleanly isolating the effect of schema structure (Section 2.2, Figure 1).
3. Insightful failure analysis identifying 'cautious bias' mechanism: the paper quantifies the asymmetry—116 positive criteria lost vs. 55 negative criteria avoided—providing a concrete explanatory mechanism rather than merely reporting negative numbers (Section 3.3).

Weaknesses:
1. Severe generalizability limitation: only one model (Qwen3-14B-Instruct) is tested on one benchmark (LiveMedBench v202601). The conclusion that 'single-pass prompts are more effective' cannot be extended to other model families, sizes, or medical domains. The paper itself acknowledges this in Section 5 but does nothing to mitigate it—even testing GPT-4 or a smaller Qwen variant would significantly strengthen the claim.
2. No statistical significance testing: despite high within-condition variance (SD 0.29–0.43 reported in Section 3.3), no confidence intervals, p-values, or effect size measures are provided. With N=500 and SD~0.35, the observed deltas (−0.013, −0.024) are plausibly within noise. This is a critical omission for a paper whose core contribution is a negative empirical result.
3. Evaluation relies entirely on an LLM-based grader (GPT-4.1) with no human validation: the constraint-focused rubric scores and positive/negative criteria counts are all computed by GPT-4.1 via LiveMedBench's automated script. No inter-annotator agreement, human calibration, or error analysis of the grader itself is provided, raising concerns about measurement reliability—especially for a paper whose deltas are small.

Must Fix Items:
1. Add statistical significance tests (e.g., paired bootstrap or Wilcoxon signed-rank on case-level scores) to determine whether the observed deltas are meaningful given the high variance; a negative result without significance testing is not actionable.
2. Test at least one additional model (different family or scale) to assess whether 'cautious bias' is a general phenomenon or a Qwen3-14B artifact.
3. Report inter-rater reliability or human calibration for the GPT-4.1 grader on at least a sample of cases; without this, the measurement validity of small deltas is questionable.

Runs:
- run=1 score=3.5 verdict=Strong Reject confidence=0.6 error=None