Title: ENTITY-ANONYMIZED CONTEXT PROMPTS FOR IM-PROVING CONTEXT FAITHFULNESS IN KNOWLEDGE-CONFLICT QA FARS
PDF: 9e1d02d3-c5a2-4801-b3b6-31cc5fda3c3a.pdf
Score: 5.0
Verdict: Reject
Confidence: 0.72
Elapsed: 156.5s

Strengths:
1. Three-condition controlled experiment (A/B/C) cleanly isolates output format from anonymization in principle; Condition B as a matched control is methodologically sound design (Section 3.3, Table 1)
2. Honest reporting that structured output format alone hurts: Condition B (Pc=32.47%) is worse than baseline A (Pc=52.43%), and the authors report this rather than suppressing it, which strengthens scientific credibility (Table 1, Section 4.2)
3. Cross-model generalization demonstrated: both Llama-3.1-8B (+42.28 Pc) and Qwen2.5-7B (+49.20 Pc) show large improvements, suggesting entity-triggered parametric recall is a general phenomenon in instruction-tuned LLMs (Table 3, Section 4.4)
4. No-harm control experiment on non-counterfactual contexts (Table 4, EM=88.00% vs baseline 66.40%) provides a useful safety signal for deployment considerations (Section 4.5)

Weaknesses:
1. Phantom entity tagging conflated with anonymization in main results: Section 3.4 describes marking MC distractors with '[not in text]' as an 'optimization' included in Condition C but not Condition B. In a multiple-choice setting, being told an option is absent from text is a powerful answer-elimination cue — this is not a minor optimization but a fundamentally different evaluation condition. No ablation separates phantom tagging from anonymization, compromising the +42.28 Pc causal claim (Section 3.4, Table 1)
2. No statistical significance tests anywhere: all results are point estimates without confidence intervals, p-values, or bootstrapping. The complementarity analysis (Table 2, C=72.87% vs E=76.47%, Δ=3.60pp on n=1,500) and no-harm control (Table 4, n=500) are particularly vulnerable to sampling noise. HF_NO_SIGNIFICANCE applies (Tables 1-4, Section 4)
3. Context-DPO comparison is confounded by different base models: Context-DPO result (Pc=54.9%) comes from Qwen2-7B while EACP result (Pc=74.75%) is from Llama-3.1-8B. The authors acknowledge this parenthetically but still claim EACP 'outperforms' Context-DPO — this is not a valid comparison (Table 1, Section 4.2)
4. Single benchmark (ConFiQA-MC) perfectly matched to method: the benchmark is explicitly designed for entity-level counterfactual substitutions in multiple-choice format, which is exactly what EACP exploits. No evaluation on other knowledge-conflict benchmarks (FaithEval, ConFiQA open-ended), non-MC formats, or tasks like summarization/dialogue (Section 4.1)
5. No-harm control results contradict the mechanism story: if entity names trigger parametric recall that overrides context, anonymizing should be neutral or harmful when context aligns with parametric knowledge. Instead, EACP improves accuracy from 66.40% to 88.00% on non-conflict data, suggesting the mechanism involves more than just breaking entity-triggered recall — the structured inventory format itself may guide attention, independent of the conflict mechanism (Section 4.5, Table 4, Discussion)
6. Entity extraction uses gold annotations, not NER: the method's first step relies on 'entities explicitly annotated in the benchmark metadata' with no experiment using actual NER and no analysis of how extraction errors propagate. This is a significant deployment gap left unquantified (Section 3.2, Section 5)
7. Self-consistency decoding (k=8, τ=0.7) is mentioned as an optimization but it is unclear whether Conditions A and B also use it — if not, this is yet another confound in the main comparison (Section 3.4, Table 1)

Must Fix Items:
1. Ablate phantom entity tagging separately from anonymization: run Condition C without '[not in text]' markers to isolate the anonymization effect. Without this, the +42.28 Pc claim cannot be attributed to anonymization alone
2. Add statistical significance tests (bootstrap confidence intervals or paired tests) for all key comparisons, especially B-vs-C (main claim) and C-vs-E (complementarity)
3. Run Context-DPO on the same model (Llama-3.1-8B) or run EACP on Qwen2-7B to enable fair comparison, and remove or qualify the 'outperforms' claim until apples-to-apples data exists
4. Clarify whether self-consistency decoding is applied to all conditions or only Condition C; if only C, add it to B for fair comparison

Runs:
- run=1 score=5.0 verdict=Reject confidence=0.72 error=None