Title: EVIDENCE-GROUNDED CONSTRAINT SCHEMAS DO NOT IMPROVE MEDICAL LLM GUARDRAILS LIVEMEDBENCH FARS PDF: livemedbench-contextual-constraints-guardrail.pdf Score: 3.5 Verdict: Strong Reject Confidence: 0.60 Elapsed: 41.3s Strengths: 1. Clear negative result with strong internal consistency: all 9 experimental conditions (3 main + 6 optimization variants across 3 pipeline architectures) consistently fail to improve over the single-pass baseline (Tables 1–2), which substantially strengthens confidence in the negative finding. 2. Well-controlled experimental design: Conditions B and C share identical information (constraint content + evidence quotes) and differ only in representation format (plain-text checklist vs. structured JSON schema), cleanly isolating the effect of schema structure (Section 2.2, Figure 1). 3. Insightful failure analysis identifying 'cautious bias' mechanism: the paper quantifies the asymmetry—116 positive criteria lost vs. 55 negative criteria avoided—providing a concrete explanatory mechanism rather than merely reporting negative numbers (Section 3.3). Weaknesses: 1. Severe generalizability limitation: only one model (Qwen3-14B-Instruct) is tested on one benchmark (LiveMedBench v202601). The conclusion that 'single-pass prompts are more effective' cannot be extended to other model families, sizes, or medical domains. The paper itself acknowledges this in Section 5 but does nothing to mitigate it—even testing GPT-4 or a smaller Qwen variant would significantly strengthen the claim. 2. No statistical significance testing: despite high within-condition variance (SD 0.29–0.43 reported in Section 3.3), no confidence intervals, p-values, or effect size measures are provided. With N=500 and SD~0.35, the observed deltas (−0.013, −0.024) are plausibly within noise. This is a critical omission for a paper whose core contribution is a negative empirical result. 3. Evaluation relies entirely on an LLM-based grader (GPT-4.1) with no human validation: the constraint-focused rubric scores and positive/negative criteria counts are all computed by GPT-4.1 via LiveMedBench's automated script. No inter-annotator agreement, human calibration, or error analysis of the grader itself is provided, raising concerns about measurement reliability—especially for a paper whose deltas are small. Must Fix Items: 1. Add statistical significance tests (e.g., paired bootstrap or Wilcoxon signed-rank on case-level scores) to determine whether the observed deltas are meaningful given the high variance; a negative result without significance testing is not actionable. 2. Test at least one additional model (different family or scale) to assess whether 'cautious bias' is a general phenomenon or a Qwen3-14B artifact. 3. Report inter-rater reliability or human calibration for the GPT-4.1 grader on at least a sample of cases; without this, the measurement validity of small deltas is questionable. Runs: - run=1 score=3.5 verdict=Strong Reject confidence=0.6 error=None