{
  "pdf": "9e1d02d3-c5a2-4801-b3b6-31cc5fda3c3a.pdf",
  "title": "ENTITY-ANONYMIZED CONTEXT PROMPTS FOR IM-PROVING CONTEXT FAITHFULNESS IN KNOWLEDGE-CONFLICT QA FARS",
  "elapsed": 156.5,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 5.0,
  "scores": [
    5.0
  ],
  "score_std": 0.0,
  "final_verdict": "Reject",
  "final_confidence": 0.72,
  "conference_scores": null,
  "strengths": [
    "Three-condition controlled experiment (A/B/C) cleanly isolates output format from anonymization in principle; Condition B as a matched control is methodologically sound design (Section 3.3, Table 1)",
    "Honest reporting that structured output format alone hurts: Condition B (Pc=32.47%) is worse than baseline A (Pc=52.43%), and the authors report this rather than suppressing it, which strengthens scientific credibility (Table 1, Section 4.2)",
    "Cross-model generalization demonstrated: both Llama-3.1-8B (+42.28 Pc) and Qwen2.5-7B (+49.20 Pc) show large improvements, suggesting entity-triggered parametric recall is a general phenomenon in instruction-tuned LLMs (Table 3, Section 4.4)",
    "No-harm control experiment on non-counterfactual contexts (Table 4, EM=88.00% vs baseline 66.40%) provides a useful safety signal for deployment considerations (Section 4.5)"
  ],
  "weaknesses": [
    "Phantom entity tagging conflated with anonymization in main results: Section 3.4 describes marking MC distractors with '[not in text]' as an 'optimization' included in Condition C but not Condition B. In a multiple-choice setting, being told an option is absent from text is a powerful answer-elimination cue — this is not a minor optimization but a fundamentally different evaluation condition. No ablation separates phantom tagging from anonymization, compromising the +42.28 Pc causal claim (Section 3.4, Table 1)",
    "No statistical significance tests anywhere: all results are point estimates without confidence intervals, p-values, or bootstrapping. The complementarity analysis (Table 2, C=72.87% vs E=76.47%, Δ=3.60pp on n=1,500) and no-harm control (Table 4, n=500) are particularly vulnerable to sampling noise. HF_NO_SIGNIFICANCE applies (Tables 1-4, Section 4)",
    "Context-DPO comparison is confounded by different base models: Context-DPO result (Pc=54.9%) comes from Qwen2-7B while EACP result (Pc=74.75%) is from Llama-3.1-8B. The authors acknowledge this parenthetically but still claim EACP 'outperforms' Context-DPO — this is not a valid comparison (Table 1, Section 4.2)",
    "Single benchmark (ConFiQA-MC) perfectly matched to method: the benchmark is explicitly designed for entity-level counterfactual substitutions in multiple-choice format, which is exactly what EACP exploits. No evaluation on other knowledge-conflict benchmarks (FaithEval, ConFiQA open-ended), non-MC formats, or tasks like summarization/dialogue (Section 4.1)",
    "No-harm control results contradict the mechanism story: if entity names trigger parametric recall that overrides context, anonymizing should be neutral or harmful when context aligns with parametric knowledge. Instead, EACP improves accuracy from 66.40% to 88.00% on non-conflict data, suggesting the mechanism involves more than just breaking entity-triggered recall — the structured inventory format itself may guide attention, independent of the conflict mechanism (Section 4.5, Table 4, Discussion)",
    "Entity extraction uses gold annotations, not NER: the method's first step relies on 'entities explicitly annotated in the benchmark metadata' with no experiment using actual NER and no analysis of how extraction errors propagate. This is a significant deployment gap left unquantified (Section 3.2, Section 5)",
    "Self-consistency decoding (k=8, τ=0.7) is mentioned as an optimization but it is unclear whether Conditions A and B also use it — if not, this is yet another confound in the main comparison (Section 3.4, Table 1)"
  ],
  "must_fix_items": [
    "Ablate phantom entity tagging separately from anonymization: run Condition C without '[not in text]' markers to isolate the anonymization effect. Without this, the +42.28 Pc claim cannot be attributed to anonymization alone",
    "Add statistical significance tests (bootstrap confidence intervals or paired tests) for all key comparisons, especially B-vs-C (main claim) and C-vs-E (complementarity)",
    "Run Context-DPO on the same model (Llama-3.1-8B) or run EACP on Qwen2-7B to enable fair comparison, and remove or qualify the 'outperforms' claim until apples-to-apples data exists",
    "Clarify whether self-consistency decoding is applied to all conditions or only Condition C; if only C, add it to B for fair comparison"
  ],
  "runs": [
    {
      "run": 1,
      "score": 5.0,
      "verdict": "Reject",
      "confidence": 0.72,
      "strengths": [
        "Three-condition controlled experiment (A/B/C) cleanly isolates output format from anonymization in principle; Condition B as a matched control is methodologically sound design (Section 3.3, Table 1)",
        "Honest reporting that structured output format alone hurts: Condition B (Pc=32.47%) is worse than baseline A (Pc=52.43%), and the authors report this rather than suppressing it, which strengthens scientific credibility (Table 1, Section 4.2)",
        "Cross-model generalization demonstrated: both Llama-3.1-8B (+42.28 Pc) and Qwen2.5-7B (+49.20 Pc) show large improvements, suggesting entity-triggered parametric recall is a general phenomenon in instruction-tuned LLMs (Table 3, Section 4.4)",
        "No-harm control experiment on non-counterfactual contexts (Table 4, EM=88.00% vs baseline 66.40%) provides a useful safety signal for deployment considerations (Section 4.5)"
      ],
      "weaknesses": [
        "Phantom entity tagging conflated with anonymization in main results: Section 3.4 describes marking MC distractors with '[not in text]' as an 'optimization' included in Condition C but not Condition B. In a multiple-choice setting, being told an option is absent from text is a powerful answer-elimination cue — this is not a minor optimization but a fundamentally different evaluation condition. No ablation separates phantom tagging from anonymization, compromising the +42.28 Pc causal claim (Section 3.4, Table 1)",
        "No statistical significance tests anywhere: all results are point estimates without confidence intervals, p-values, or bootstrapping. The complementarity analysis (Table 2, C=72.87% vs E=76.47%, Δ=3.60pp on n=1,500) and no-harm control (Table 4, n=500) are particularly vulnerable to sampling noise. HF_NO_SIGNIFICANCE applies (Tables 1-4, Section 4)",
        "Context-DPO comparison is confounded by different base models: Context-DPO result (Pc=54.9%) comes from Qwen2-7B while EACP result (Pc=74.75%) is from Llama-3.1-8B. The authors acknowledge this parenthetically but still claim EACP 'outperforms' Context-DPO — this is not a valid comparison (Table 1, Section 4.2)",
        "Single benchmark (ConFiQA-MC) perfectly matched to method: the benchmark is explicitly designed for entity-level counterfactual substitutions in multiple-choice format, which is exactly what EACP exploits. No evaluation on other knowledge-conflict benchmarks (FaithEval, ConFiQA open-ended), non-MC formats, or tasks like summarization/dialogue (Section 4.1)",
        "No-harm control results contradict the mechanism story: if entity names trigger parametric recall that overrides context, anonymizing should be neutral or harmful when context aligns with parametric knowledge. Instead, EACP improves accuracy from 66.40% to 88.00% on non-conflict data, suggesting the mechanism involves more than just breaking entity-triggered recall — the structured inventory format itself may guide attention, independent of the conflict mechanism (Section 4.5, Table 4, Discussion)",
        "Entity extraction uses gold annotations, not NER: the method's first step relies on 'entities explicitly annotated in the benchmark metadata' with no experiment using actual NER and no analysis of how extraction errors propagate. This is a significant deployment gap left unquantified (Section 3.2, Section 5)",
        "Self-consistency decoding (k=8, τ=0.7) is mentioned as an optimization but it is unclear whether Conditions A and B also use it — if not, this is yet another confound in the main comparison (Section 3.4, Table 1)"
      ],
      "must_fix_items": [
        "Ablate phantom entity tagging separately from anonymization: run Condition C without '[not in text]' markers to isolate the anonymization effect. Without this, the +42.28 Pc claim cannot be attributed to anonymization alone",
        "Add statistical significance tests (bootstrap confidence intervals or paired tests) for all key comparisons, especially B-vs-C (main claim) and C-vs-E (complementarity)",
        "Run Context-DPO on the same model (Llama-3.1-8B) or run EACP on Qwen2-7B to enable fair comparison, and remove or qualify the 'outperforms' claim until apples-to-apples data exists",
        "Clarify whether self-consistency decoding is applied to all conditions or only Condition C; if only C, add it to B for fair comparison"
      ],
      "conference_scores": null
    }
  ]
}
