{
  "pdf": "context-bagging-noisybench.pdf",
  "title": "CONTEXT BAGGING: INFERENCE-TIME ENSEMBLING FOR ROBUST LONG-CONTEXT QA UNDER HARD DIS-TRACTORS FARS",
  "elapsed": 51,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.5,
  "scores": [
    3.5
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.3,
    "presentation": 2.5,
    "contribution": 1.8,
    "overall_rating": 3.5,
    "confidence": 3
  },
  "strengths": [
    "Clear problem framing and motivated hypothesis: The paper correctly identifies that self-consistency fails under hard distractors because errors are context-driven, not decoding-driven (Section 3.2, agreement rate 0.85, wrong-answer concentration 0.82), and proposes a logically coherent alternative—perturbing context rather than decoding trajectories.",
    "Honest and revealing ablation results: The ablation in Table 2 shows that order shuffling alone (+1.44 EM) is the dominant mechanism, while subset diversity provides only marginal benefit (+0.40 EM). This is a genuinely surprising finding that the authors transparently report, even though it somewhat undermines the novelty of their full CoBag method. The fact that CoBag-Vote does NOT significantly outperform Permute-Vote (p=0.641, Table 1) is clearly stated.",
    "Controlled experimental design with distractor inclusion verification: The authors verify that the hard distractor appears in 100% of CoBag contexts (11,570/11,570), ruling out the trivial explanation that improvement comes from distractor exclusion (Section 4.2). McNemar's test is used for significance, which is appropriate for paired binary outcomes."
  ],
  "weaknesses": [
    "The core proposed method (CoBag-Vote) is NOT significantly better than the much simpler Permute-Vote baseline (p=0.641, Table 1). Permute-Vote—simply shuffling paragraph order K times and voting—is a trivially implementable procedure. CoBag adds relevance-weighted subset sampling on top, but this provides only +0.39 EM (non-significant). The paper's title and framing ('Context Bagging') suggest a novel method, but the effective mechanism is just order shuffling, which is essentially permutation self-consistency (Tang et al., 2023) applied to QA contexts. This is a severe over-packaging issue.",
    "Extremely limited experimental evaluation: Only one dataset (MuSiQue), one model (Qwen2.5-7B-Instruct), and one specific noise configuration (1 hard distractor placed at the end of the sequence) are tested. There is no evaluation on other long-context QA benchmarks (e.g., HotpotQA, 2WikiMultiHopQA), no testing with multiple distractors, no testing with distractors in other positions, and no testing with different model families or sizes. The claim of 'robust long-context QA' is vastly overgeneralized from this single experimental setting.",
    "Close prior work undercited and under-differentiated: Tang et al. (2023) proposed 'permutation self-consistency' for listwise ranking—shuffling list order and voting—which is functionally identical to the dominant mechanism (order shuffling) in CoBag. The paper cites this work in Related Work but does not adequately differentiate CoBag from it, given that CoBag's additional component (subset sampling) is shown to be marginal. RePlug (Shi et al., 2023) also ensembles across different retrieved document sets with voting, which is very similar to CoBag's subset sampling component. The novelty gap is thin."
  ],
  "must_fix_items": [
    "Acknowledge explicitly in the title/abstract that the effective mechanism is order shuffling (permutation self-consistency), not the full CoBag pipeline, since CoBag-Vote is not significantly better than Permute-Vote (p=0.641). The current framing is misleading.",
    "Evaluate on at least 2-3 additional datasets and/or model families to support the general claim of 'robust long-context QA'. A single model on a single dataset with a single noise configuration is insufficient.",
    "Provide a proper differentiation from Tang et al. (2023) permutation self-consistency and RePlug (Shi et al., 2023), explaining what CoBag adds beyond these when its dominant component is equivalent to the former."
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.5,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "Clear problem framing and motivated hypothesis: The paper correctly identifies that self-consistency fails under hard distractors because errors are context-driven, not decoding-driven (Section 3.2, agreement rate 0.85, wrong-answer concentration 0.82), and proposes a logically coherent alternative—perturbing context rather than decoding trajectories.",
        "Honest and revealing ablation results: The ablation in Table 2 shows that order shuffling alone (+1.44 EM) is the dominant mechanism, while subset diversity provides only marginal benefit (+0.40 EM). This is a genuinely surprising finding that the authors transparently report, even though it somewhat undermines the novelty of their full CoBag method. The fact that CoBag-Vote does NOT significantly outperform Permute-Vote (p=0.641, Table 1) is clearly stated.",
        "Controlled experimental design with distractor inclusion verification: The authors verify that the hard distractor appears in 100% of CoBag contexts (11,570/11,570), ruling out the trivial explanation that improvement comes from distractor exclusion (Section 4.2). McNemar's test is used for significance, which is appropriate for paired binary outcomes."
      ],
      "weaknesses": [
        "The core proposed method (CoBag-Vote) is NOT significantly better than the much simpler Permute-Vote baseline (p=0.641, Table 1). Permute-Vote—simply shuffling paragraph order K times and voting—is a trivially implementable procedure. CoBag adds relevance-weighted subset sampling on top, but this provides only +0.39 EM (non-significant). The paper's title and framing ('Context Bagging') suggest a novel method, but the effective mechanism is just order shuffling, which is essentially permutation self-consistency (Tang et al., 2023) applied to QA contexts. This is a severe over-packaging issue.",
        "Extremely limited experimental evaluation: Only one dataset (MuSiQue), one model (Qwen2.5-7B-Instruct), and one specific noise configuration (1 hard distractor placed at the end of the sequence) are tested. There is no evaluation on other long-context QA benchmarks (e.g., HotpotQA, 2WikiMultiHopQA), no testing with multiple distractors, no testing with distractors in other positions, and no testing with different model families or sizes. The claim of 'robust long-context QA' is vastly overgeneralized from this single experimental setting.",
        "Close prior work undercited and under-differentiated: Tang et al. (2023) proposed 'permutation self-consistency' for listwise ranking—shuffling list order and voting—which is functionally identical to the dominant mechanism (order shuffling) in CoBag. The paper cites this work in Related Work but does not adequately differentiate CoBag from it, given that CoBag's additional component (subset sampling) is shown to be marginal. RePlug (Shi et al., 2023) also ensembles across different retrieved document sets with voting, which is very similar to CoBag's subset sampling component. The novelty gap is thin."
      ],
      "must_fix_items": [
        "Acknowledge explicitly in the title/abstract that the effective mechanism is order shuffling (permutation self-consistency), not the full CoBag pipeline, since CoBag-Vote is not significantly better than Permute-Vote (p=0.641). The current framing is misleading.",
        "Evaluate on at least 2-3 additional datasets and/or model families to support the general claim of 'robust long-context QA'. A single model on a single dataset with a single noise configuration is insufficient.",
        "Provide a proper differentiation from Tang et al. (2023) permutation self-consistency and RePlug (Shi et al., 2023), explaining what CoBag adds beyond these when its dominant component is equivalent to the former."
      ],
      "conference_scores": {
        "soundness": 2.3,
        "presentation": 2.5,
        "contribution": 1.8,
        "overall_rating": 3.5,
        "confidence": 3
      }
    }
  ]
}