{
  "pdf": "counterfactual-reference-swap-verifier.pdf",
  "title": "REFSWAP: COUNTERFACTUAL REFERENCE-SWAP VERIFICATION FOR ROBUST LLM VERIFIERS",
  "elapsed": 78.4,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 4.5,
  "scores": [
    4.5
  ],
  "score_std": 0,
  "final_verdict": "Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.3,
    "presentation": 3,
    "contribution": 2.2,
    "overall_rating": 4.5,
    "confidence": 3
  },
  "strengths": [
    "Clear problem identification with quantitative evidence: The paper identifies and quantifies the master-key vulnerability in RLVR verifiers, demonstrating 25–29% false positive rates on state-of-the-art systems (xVerify and Qwen2.5-7B). Table 1 shows baseline FPRs of 25.50% (xVerify) and 29.07% (Qwen), providing concrete motivation for the work.",
    "Strong empirical results on xVerify: Multi-CF RefSwap achieves 96.8% relative FPR reduction (25.50%→0.81%) with only 2.74pp accuracy cost on xVerify-7B-I (Table 1). The near-perfect AUC of 0.991 (Section 4.3, Figure 2) and Cohen's d=1.74 demonstrate a large, well-measured effect size. Table 2 shows uniform effectiveness across all 10 master key types (punctuation, words, sentences, non-English tokens).",
    "Honest reporting of limitations: The paper transparently reports that RefSwap K=1 completely fails (0% FPR reduction on both backbones, Table 1), and critically, that the method does not work on Qwen2.5-7B-Instruct at all (Section 4.6). This honesty about boundary conditions is a strength that increases trust in the reported positive results."
  ],
  "weaknesses": [
    "Severe backbone dependency undermines generality: RefSwap works on only 1 of 2 tested backbones (xVerify but not Qwen, Section 4.6). The method's core mechanism—exploiting 'self-solving asymmetry'—is architecture-dependent rather than general. The paper provides no theoretical characterization of which verifier architectures will be amenable, making it unclear how broadly applicable the method is. With only 2 backbones tested and 50% failure rate, the generalizability claim is weak.",
    "The 'self-solving' explanation is under-analyzed and potentially contradictory: Section 3.2 claims true positives exhibit 'self-solving behavior' (verifier recognizes correctness even with mismatched references). But Section 4.6 says Qwen 'self-solves problems regardless of the reference, producing high max p_cf for both true positives and master-key false positives.' This means self-solving is not a reliable asymmetry—it can benefit master keys too. The paper does not provide a rigorous analysis of when self-solving helps vs. hurts, relying instead on empirical observation on a single favorable backbone.",
    "Limited attack model and evaluation scope: Only 10 hand-crafted master keys from prior work (Zhao et al., 2025) are tested. There is no evaluation against adaptive adversaries who could optimize keys to evade RefSwap, nor against other attack types (e.g., adversarially optimized responses that partially mimic self-solving behavior). The master-key stress test uses 956 questions—a relatively small set—and no statistical significance tests (e.g., confidence intervals, permutation tests) are reported for the FPR reductions."
  ],
  "must_fix_items": [
    "Add statistical significance tests (confidence intervals, bootstrap, or permutation tests) for all reported FPR reductions and accuracy differences. The current results lack any uncertainty quantification.",
    "Test on at least 1-2 additional verifier backbones to assess generalizability beyond the 2 currently tested. The current 50% success rate (1/2 backbones) is insufficient to claim a general method.",
    "Evaluate against adaptive adversaries who know about RefSwap and attempt to craft master keys that achieve high max p_cf. Without this, the robustness claims are incomplete."
  ],
  "runs": [
    {
      "run": 1,
      "score": 4.5,
      "verdict": "Reject",
      "confidence": 0.6,
      "strengths": [
        "Clear problem identification with quantitative evidence: The paper identifies and quantifies the master-key vulnerability in RLVR verifiers, demonstrating 25–29% false positive rates on state-of-the-art systems (xVerify and Qwen2.5-7B). Table 1 shows baseline FPRs of 25.50% (xVerify) and 29.07% (Qwen), providing concrete motivation for the work.",
        "Strong empirical results on xVerify: Multi-CF RefSwap achieves 96.8% relative FPR reduction (25.50%→0.81%) with only 2.74pp accuracy cost on xVerify-7B-I (Table 1). The near-perfect AUC of 0.991 (Section 4.3, Figure 2) and Cohen's d=1.74 demonstrate a large, well-measured effect size. Table 2 shows uniform effectiveness across all 10 master key types (punctuation, words, sentences, non-English tokens).",
        "Honest reporting of limitations: The paper transparently reports that RefSwap K=1 completely fails (0% FPR reduction on both backbones, Table 1), and critically, that the method does not work on Qwen2.5-7B-Instruct at all (Section 4.6). This honesty about boundary conditions is a strength that increases trust in the reported positive results."
      ],
      "weaknesses": [
        "Severe backbone dependency undermines generality: RefSwap works on only 1 of 2 tested backbones (xVerify but not Qwen, Section 4.6). The method's core mechanism—exploiting 'self-solving asymmetry'—is architecture-dependent rather than general. The paper provides no theoretical characterization of which verifier architectures will be amenable, making it unclear how broadly applicable the method is. With only 2 backbones tested and 50% failure rate, the generalizability claim is weak.",
        "The 'self-solving' explanation is under-analyzed and potentially contradictory: Section 3.2 claims true positives exhibit 'self-solving behavior' (verifier recognizes correctness even with mismatched references). But Section 4.6 says Qwen 'self-solves problems regardless of the reference, producing high max p_cf for both true positives and master-key false positives.' This means self-solving is not a reliable asymmetry—it can benefit master keys too. The paper does not provide a rigorous analysis of when self-solving helps vs. hurts, relying instead on empirical observation on a single favorable backbone.",
        "Limited attack model and evaluation scope: Only 10 hand-crafted master keys from prior work (Zhao et al., 2025) are tested. There is no evaluation against adaptive adversaries who could optimize keys to evade RefSwap, nor against other attack types (e.g., adversarially optimized responses that partially mimic self-solving behavior). The master-key stress test uses 956 questions—a relatively small set—and no statistical significance tests (e.g., confidence intervals, permutation tests) are reported for the FPR reductions."
      ],
      "must_fix_items": [
        "Add statistical significance tests (confidence intervals, bootstrap, or permutation tests) for all reported FPR reductions and accuracy differences. The current results lack any uncertainty quantification.",
        "Test on at least 1-2 additional verifier backbones to assess generalizability beyond the 2 currently tested. The current 50% success rate (1/2 backbones) is insufficient to claim a general method.",
        "Evaluate against adaptive adversaries who know about RefSwap and attempt to craft master keys that achieve high max p_cf. Without this, the robustness claims are incomplete."
      ],
      "conference_scores": {
        "soundness": 2.3,
        "presentation": 3,
        "contribution": 2.2,
        "overall_rating": 4.5,
        "confidence": 3
      }
    }
  ]
}