Title: DISAGREEMENT-GATED JUDGE REUSE: TRAINING-FREE SAFETY SIGNAL FOR MULTI-AGENT
PDF: judge-disagreement-gated-kv-reuse.pdf
Score: 3.2
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 51.2s

Strengths:
1. The paper identifies a real and practical problem: KV cache reuse in multi-agent LLM judge settings introduces position bias and inconsistency (JCR as low as 61-66% per Table 1), and proposes a principled, training-free solution to detect unreliable cases via disagreement between two structurally different reuse methods. This is a reasonable engineering insight with clear practical motivation (Section 3.2).
2. The random gating control experiment is a solid methodological choice that isolates the informational value of the disagreement signal from the trivial benefit of occasional dense fallback. The 5.63pp gap over random gating with statistical significance (p < 0.05) demonstrates that disagreement is genuinely informative rather than the method merely benefiting from fallback (Table 1, Table 2).
3. The paper is commendably transparent about its limitations, explicitly acknowledging the 4.6× latency overhead in Table 4 and discussing the cost-consistency tradeoff in Section 4.5. The partition analysis (Table 5) showing no significant difference in problem characteristics between agreement/disagreement sets further supports the generality claim.

Weaknesses:
1. The core contribution is incremental and essentially an ensemble-of-two with fallback strategy. Running two KV reuse methods and falling back to dense on disagreement is a straightforward engineering heuristic, not a novel algorithmic or theoretical insight. The idea that 'if two approximations disagree, fall back to the gold standard' is a well-known pattern in distributed systems (Byzantine fault detection, N-version programming) and the paper does not sufficiently contextualize or distinguish its contribution from these established paradigms (Section 3.2-3.3).
2. The latency overhead makes the practical utility questionable. DG-JKR achieves 4.61× latency over dense prefill (Table 4), which is slower than simply running dense prefill alone. Since dense prefill is already 100% consistent, the method trades 4.6× more latency for only 83% coverage of the fast path. When accounting for the 17% fallback, the expected latency is still 4.61× dense—there is no net speedup. The paper does not present any scenario where DG-JKR is actually faster than dense prefill, undermining its stated motivation of accelerating judge inference (Introduction: 'KV cache reuse offers an attractive approach to accelerate judge inference').
3. Experimental scope is narrow: only HumanEval (164 problems), only Llama-3.2-3B-Instruct, only N=4 candidates. The JCR metric itself is defined relative to dense prefill under shuffle, but dense prefill also exhibits position bias (Shi et al., 2024), so consistency with a biased reference does not establish correctness. The paper conflates consistency with reliability throughout. There is no evaluation on larger models, different judge architectures, or benchmarks beyond code generation (Section 4.1).
4. The statistical analysis is weak. Only 3 random seeds are used for the multi-seed validation (Table 3), and the main results (Table 1) appear to use a single seed (seed 42). With 164 problems, the effective sample size is small. The claimed p < 0.05 for the gap over random gating is reported without specifying the test procedure or effect size. The JCR-F standard deviation of ±0.62% across 3 seeds is suspiciously low and may reflect insufficient variation in the experimental design rather than genuine stability (Section 4.4).

Must Fix Items:
1. Address the fundamental cost-benefit inversion: DG-JKR is 4.6× slower than dense prefill while only achieving 74.38% JCR. The paper must either demonstrate a regime where DG-JKR is actually faster than dense prefill, or reframing the contribution away from 'acceleration' toward 'selective consistency improvement for scenarios where dual pipelines are already available.'
2. Provide proper statistical testing details: specify the exact test used for the p < 0.05 claim, report effect sizes, and use more than 3 seeds or a more robust resampling procedure to validate the JCR-F variance claim.
3. Evaluate on at least one additional benchmark beyond HumanEval and one additional model scale to support the generality claims made in the title and abstract ('Multi-Agent LLM Systems').

Runs:
- run=1 score=3.2 verdict=Strong Reject confidence=0.6 error=None