Title: LIVEMEDBENCH-ASK1: EVALUATING ASK-BEFORE-ANSWER BEHAVIOR IN MEDICAL LLMS FARS Analemma
PDF: 3014699f-3da2-4a35-9202-8c1bfc118740.pdf
Score: 4.2
Verdict: Reject
Confidence: 0.78
Elapsed: 301.6s

Strengths:
1. Honest negative-result reporting: The paper transparently reports that Ask1 does not improve scores (B-A CIs span zero for both models, Table 2), resists the temptation to spin a negative result into a positive one, and correctly identifies the ceiling effect (C-A gap 0.6–0.9pp) as the root cause rather than blaming model incompetence. This is commendable scientific integrity—Section 5 Discussion and Section 4.4 both acknowledge the null findings without hedging.
2. Clean three-condition experimental design (A/B/C): The masked baseline, Ask1 protocol, and unmasked upper bound provide a logically complete framework for decomposing the value of asking questions. The C condition directly measures the maximum possible benefit from having the missing information, which makes the null result interpretable rather than ambiguous. This design choice is methodologically sound (Sections 3.2, 4.2).
3. Above-chance slot hit rates with binomial tests: Both models significantly exceed the 12.5% chance baseline for slot identification (GPT-4.1: 50.1%, Qwen3-14B: 37.8%, p<0.001, Section 4.2), demonstrating that models can at least partially identify what information is missing—a meaningful finding even if it doesn't translate to score improvement. The per-slot-type variation in hit rates (18.8%–75.0%, Figure 2c) provides useful diagnostic information.

Weaknesses:
1. Trivially simple oracle design that inflates the protocol's apparent sophistication: The 'deterministic slot oracle' (Section 3.3) is just case-insensitive keyword matching against 8 predefined slot type names. This is not an oracle in any meaningful sense—it cannot handle paraphrases, indirect questions, or multi-slot queries. The paper's framing ('controlled mechanism,' 'eliminates confounds') wraps a simple string match in language suggesting methodological depth. A more realistic oracle would expose whether models can ask natural-language questions that a human could answer, which is the actual clinical use case. The current design tests whether models can name the right slot category, not whether they can elicit information through realistic dialogue.
2. The experimental design is inherently underpowered and the null result was largely predictable: The C-A gap of 0.6–0.9 percentage points (Table 2) means the maximum possible improvement from the Ask1 protocol is less than 1pp on a 0–1 scale. Given rubric scores of ~0.36 with per-case variance, detecting a 0.6–0.9pp effect requires far more than 657 cases. The paper's own C condition reveals that the experiment was doomed to show no effect before it was run. This raises the question of whether the study design was adequately justified—Section 3.1 shows only 23.8% of LiveMedBench cases even qualify for single-slot masking, suggesting the subset construction may have inadvertently selected cases where the masked slot is minimally important to the rubric.
3. Per-slot-type pregnancy status finding is cherry-picked without multiple-comparison correction: Section 4.3 highlights the pregnancy status slot as showing 'the largest positive signal' (B-A = +2.0pp for GPT-4.1), but this is one of 8 slot types tested with no Bonferroni or FDR correction. With 8 comparisons, the effective significance threshold should be ~0.006, not 0.05. The paper presents this as a 'promising direction for targeted applications' (Section 5, Future Directions) without acknowledging the multiple-testing problem, which constitutes selective reporting.
4. Only 2 models evaluated, no medical-specialized models, and single LLM grader with no reliability assessment: The evaluation covers only GPT-4.1 and Qwen3-14B (Section 4.1)—no smaller models (which might benefit more from clarifying questions), no medical fine-tuned models (e.g., Med-PaLM, HuatuoGPT), and no domain-specific reasoning models. More critically, the rubric-based evaluation relies entirely on GPT-4.1 (temp=0) as a judge (Section 3.4), with no inter-grader reliability measured. For a benchmark paper where the main metric is judge-assigned rubric scores, the absence of any reliability assessment is a significant gap—acknowledged in limitations but not addressed experimentally.
5. The contribution boundary is unclear—the protocol adds minimal novelty over LiveMedBench itself: The Ask1 protocol is conceptually straightforward: take LiveMedBench, mask a slot, let the model ask one question, keyword-match the response. The subset construction (Section 3.1) filters for cases where exactly one slot can be masked—useful but not a methodological innovation. The paper's three listed contributions (Section 1) are essentially: (1) the protocol design (a keyword matcher + one extra turn), (2) the 657-case subset (a filtered version of an existing benchmark), and (3) the negative empirical finding. After stripping packaging, the core contribution is the negative result documentation, which has value but does not constitute a substantial methodological advance.

Must Fix Items:
1. Add multiple-comparison correction (Bonferroni or FDR) for all per-slot-type analyses in Section 4.3 and the pregnancy status claim, or clearly flag these as exploratory with uncorrected p-values
2. Report statistical power analysis: given the observed C-A gap of 0.6–0.9pp and the score variance, compute the sample size needed to detect this effect at 80% power, and discuss whether the study was adequately powered
3. Add at least one inter-grader reliability measure (e.g., a second grader on a subset, or human-grader agreement on a sample) to validate the rubric-based evaluation that underpins all findings

Runs:
- run=1 score=4.2 verdict=Reject confidence=0.78 error=None