{
  "pdf": "confidence-bounded-unit-test-rewards.pdf",
  "title": "CONFIDENCE-BOUNDED UNIT-TEST REWARDS REINFORCEMENT LEARNING FROM VERIFIABLE RE-WARDS FARS",
  "elapsed": 51.3,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.8,
  "scores": [
    3.8
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.2,
    "presentation": 2.8,
    "contribution": 2,
    "overall_rating": 3.8,
    "confidence": 3
  },
  "strengths": [
    "Principled statistical formulation: The LCB reward is grounded in Bayesian statistics (Beta posterior over Bernoulli trials), providing a theoretically justified conservative estimate rather than an ad hoc heuristic. The formulation in Equations 2-4 is clean and well-motivated by the finite-sample uncertainty problem described in Section 3.2.",
    "Compute-efficiency result is notable: LCB (m=5) achieving 57.0% Pass@1 on MBPP+ vs Pass-rate (2m=10) at 54.1% (Table 1) demonstrates that principled uncertainty quantification can outperform brute-force test scaling with half the compute budget. This is a practically useful finding for resource-constrained RLVR training.",
    "Honest discussion of mechanism and limitations: Section 4.5 explicitly acknowledges that LCB does not improve solution ranking (Spearman correlation identical to pass-rate at 0.87, within-group rank correlation 0.9996) and that the benefit comes from reward value compression affecting GRPO advantage normalization. The authors also candidly disclose the hyperparameter discrepancy between LCB and baselines (5 epochs, lr=1e-5 vs 3 epochs, lr=5e-6)."
  ],
  "weaknesses": [
    "Critical confound: different training hyperparameters across methods. Section 4.5 acknowledges that LCB uses 5 epochs with lr=1e-5 while baselines use 3 epochs with lr=5e-6. This makes the main comparison in Table 1 fundamentally unfair. The 3.5 percentage point improvement on MBPP+ could be entirely or partially attributable to longer training and higher learning rate rather than the LCB reward itself. This is a HF_UNFAIR_BASELINE level concern.",
    "The claimed mechanism is contradicted by the authors' own analysis. The paper frames LCB as 'principled conservative estimate that accounts for finite-sample uncertainty' (Section 3.3, 3.4), yet Section 4.5 reveals that LCB does not actually improve ranking quality over pass-rate (Spearman 0.87 identical, rank correlation 0.9996). The benefit comes from reward compression changing gradient dynamics in GRPO, which is a qualitatively different mechanism than what the paper's narrative suggests. This is over-packaging: a statistical uncertainty story is told, but the actual working mechanism is reward scale compression.",
    "Limited experimental scale and scope: Only one small model (Qwen2.5-Coder-1.5B) is tested on two benchmarks from the same domain (Python code generation). No results on larger models (7B, 14B, 70B), no other domains (math reasoning, other languages), no statistical significance tests (variance/error bars absent from Tables 1-2). The Pass@1 numbers could be within noise given the absence of confidence intervals.",
    "Pessimistic baseline is weak and unoptimized. The Pessimistic reward (max(0, npass/m - λ/√m)) with λ=1 is a crude heuristic. The paper does not tune λ, yet δ is tuned for LCB. A fairer comparison would optimize both or use Wilson score interval lower bound (which is the frequentist analogue of LCB and arguably a more standard baseline)."
  ],
  "must_fix_items": [
    "Run baselines with the same training hyperparameters (5 epochs, lr=1e-5) as LCB, or run LCB with baseline hyperparameters (3 epochs, lr=5e-6), and report both sets of results. Without this, the main claim is unsubstantiated.",
    "Add error bars or confidence intervals to Tables 1-2. Report results across multiple random seeds to establish statistical significance of the reported improvements.",
    "Revise the narrative to accurately reflect the mechanism. If reward compression drives the benefit rather than principled uncertainty quantification, the introduction and method sections should be restructured accordingly."
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.8,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "Principled statistical formulation: The LCB reward is grounded in Bayesian statistics (Beta posterior over Bernoulli trials), providing a theoretically justified conservative estimate rather than an ad hoc heuristic. The formulation in Equations 2-4 is clean and well-motivated by the finite-sample uncertainty problem described in Section 3.2.",
        "Compute-efficiency result is notable: LCB (m=5) achieving 57.0% Pass@1 on MBPP+ vs Pass-rate (2m=10) at 54.1% (Table 1) demonstrates that principled uncertainty quantification can outperform brute-force test scaling with half the compute budget. This is a practically useful finding for resource-constrained RLVR training.",
        "Honest discussion of mechanism and limitations: Section 4.5 explicitly acknowledges that LCB does not improve solution ranking (Spearman correlation identical to pass-rate at 0.87, within-group rank correlation 0.9996) and that the benefit comes from reward value compression affecting GRPO advantage normalization. The authors also candidly disclose the hyperparameter discrepancy between LCB and baselines (5 epochs, lr=1e-5 vs 3 epochs, lr=5e-6)."
      ],
      "weaknesses": [
        "Critical confound: different training hyperparameters across methods. Section 4.5 acknowledges that LCB uses 5 epochs with lr=1e-5 while baselines use 3 epochs with lr=5e-6. This makes the main comparison in Table 1 fundamentally unfair. The 3.5 percentage point improvement on MBPP+ could be entirely or partially attributable to longer training and higher learning rate rather than the LCB reward itself. This is a HF_UNFAIR_BASELINE level concern.",
        "The claimed mechanism is contradicted by the authors' own analysis. The paper frames LCB as 'principled conservative estimate that accounts for finite-sample uncertainty' (Section 3.3, 3.4), yet Section 4.5 reveals that LCB does not actually improve ranking quality over pass-rate (Spearman 0.87 identical, rank correlation 0.9996). The benefit comes from reward compression changing gradient dynamics in GRPO, which is a qualitatively different mechanism than what the paper's narrative suggests. This is over-packaging: a statistical uncertainty story is told, but the actual working mechanism is reward scale compression.",
        "Limited experimental scale and scope: Only one small model (Qwen2.5-Coder-1.5B) is tested on two benchmarks from the same domain (Python code generation). No results on larger models (7B, 14B, 70B), no other domains (math reasoning, other languages), no statistical significance tests (variance/error bars absent from Tables 1-2). The Pass@1 numbers could be within noise given the absence of confidence intervals.",
        "Pessimistic baseline is weak and unoptimized. The Pessimistic reward (max(0, npass/m - λ/√m)) with λ=1 is a crude heuristic. The paper does not tune λ, yet δ is tuned for LCB. A fairer comparison would optimize both or use Wilson score interval lower bound (which is the frequentist analogue of LCB and arguably a more standard baseline)."
      ],
      "must_fix_items": [
        "Run baselines with the same training hyperparameters (5 epochs, lr=1e-5) as LCB, or run LCB with baseline hyperparameters (3 epochs, lr=5e-6), and report both sets of results. Without this, the main claim is unsubstantiated.",
        "Add error bars or confidence intervals to Tables 1-2. Report results across multiple random seeds to establish statistical significance of the reported improvements.",
        "Revise the narrative to accurately reflect the mechanism. If reward compression drives the benefit rather than principled uncertainty quantification, the introduction and method sections should be restructured accordingly."
      ],
      "conference_scores": {
        "soundness": 2.2,
        "presentation": 2.8,
        "contribution": 2,
        "overall_rating": 3.8,
        "confidence": 3
      }
    }
  ]
}