{
  "pdf": "adaptive-mispo-acceptance-control.pdf",
  "title": "ACCEPTANCE-CONTROLLED MIS-PO: ADAPTIVE TRAJECTORY FILTERING FOR STABLE OFF-POLICY",
  "elapsed": 63.3,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 4,
  "scores": [
    4
  ],
  "score_std": 0,
  "final_verdict": "Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.3,
    "presentation": 2.8,
    "contribution": 2,
    "overall_rating": 4,
    "confidence": 3
  },
  "strengths": [
    "The paper identifies a concrete and practically important problem: fixed trajectory bounds in MIS-PO either cause gradient explosion (too loose) or waste compute (too tight), and demonstrates this with the Fixed MIS-PO crash at step 94 under s=256 (Section 4.2, Table 1). This is a meaningful engineering insight for the off-policy RLVR community.",
    "The ablation study (Table 2) is well-designed and reveals a key finding: bound magnitude alone accounts for +12.76pp improvement (Fixed-at-Final-Bound at 31.43% vs Fixed MIS-PO at 18.67%), while the adaptive controller adds only +1.14pp. This honest decomposition helps readers understand what actually drives the gains versus what is cosmetic.",
    "The method is simple and implementable: quantile estimation + EMA smoothing is a lightweight controller (Equations 6-7) with clear hyperparameters (β=0.9, Astart=0.4, Aend=0.2, Kwarmup=200). Code is publicly available, supporting reproducibility."
  ],
  "weaknesses": [
    "The ablation (Table 2) inadvertently reveals that the core proposed method—adaptive per-step control—contributes only +1.14pp over simply using the right fixed bound (32.57% vs 31.43%). The dominant gain (+12.76pp) comes from finding tighter bounds, which could be achieved by a simple grid search. This substantially undermines the novelty claim: the 'adaptive controller' is largely unnecessary if one knows the right bound magnitude a priori, and the paper provides no evidence that the controller discovers bounds that a practitioner could not find via a few trial runs.",
    "Experimental scope is extremely narrow: single model (Qwen3-1.7B), single staleness level (s=256), single task domain (mathematical reasoning), single dataset (DeepScaleR), 300-400 training steps. The paper acknowledges this limitation but does not provide any evidence of generalization. The staleness s=256 is an extreme and somewhat artificial setting; it is unclear whether AC-MIS-PO provides any benefit at more common staleness levels (e.g., s=8, s=16, s=32).",
    "Statistical rigor is absent: no standard deviations, no multiple random seeds, no confidence intervals are reported for any result. The main comparison in Table 1 appears to be single-run results. The Pearson correlation r=0.112 between acceptance rate and second moment (Section 4.4) is reported with p=0.052, which is borderline non-significant at α=0.05, yet the paper treats it as evidence they 'capture different aspects.' Without multiple runs, it is impossible to determine whether the 1.14pp gap between AC-MIS-PO and Fixed-at-Final-Bound is statistically meaningful.",
    "The acceptance rate schedule (40%→20%) and its hyperparameters (Kwarmup=200) are themselves fixed and manually chosen. The paper criticizes fixed hyperparameters in prior work but introduces its own fixed schedule. No sensitivity analysis is provided for Astart, Aend, Kwarmup, or β, making it unclear how robust the method is to these choices."
  ],
  "must_fix_items": [
    "Report results with multiple random seeds and standard deviations to establish statistical significance, especially for the narrow 1.14pp gap between AC-MIS-PO and Fixed-at-Final-Bound.",
    "Provide sensitivity analysis for key hyperparameters (Astart, Aend, Kwarmup, β) to demonstrate robustness and justify the claim of 'automatic discovery without manual tuning.'",
    "Test on at least one additional staleness level or model scale to support generalization claims."
  ],
  "runs": [
    {
      "run": 1,
      "score": 4,
      "verdict": "Reject",
      "confidence": 0.6,
      "strengths": [
        "The paper identifies a concrete and practically important problem: fixed trajectory bounds in MIS-PO either cause gradient explosion (too loose) or waste compute (too tight), and demonstrates this with the Fixed MIS-PO crash at step 94 under s=256 (Section 4.2, Table 1). This is a meaningful engineering insight for the off-policy RLVR community.",
        "The ablation study (Table 2) is well-designed and reveals a key finding: bound magnitude alone accounts for +12.76pp improvement (Fixed-at-Final-Bound at 31.43% vs Fixed MIS-PO at 18.67%), while the adaptive controller adds only +1.14pp. This honest decomposition helps readers understand what actually drives the gains versus what is cosmetic.",
        "The method is simple and implementable: quantile estimation + EMA smoothing is a lightweight controller (Equations 6-7) with clear hyperparameters (β=0.9, Astart=0.4, Aend=0.2, Kwarmup=200). Code is publicly available, supporting reproducibility."
      ],
      "weaknesses": [
        "The ablation (Table 2) inadvertently reveals that the core proposed method—adaptive per-step control—contributes only +1.14pp over simply using the right fixed bound (32.57% vs 31.43%). The dominant gain (+12.76pp) comes from finding tighter bounds, which could be achieved by a simple grid search. This substantially undermines the novelty claim: the 'adaptive controller' is largely unnecessary if one knows the right bound magnitude a priori, and the paper provides no evidence that the controller discovers bounds that a practitioner could not find via a few trial runs.",
        "Experimental scope is extremely narrow: single model (Qwen3-1.7B), single staleness level (s=256), single task domain (mathematical reasoning), single dataset (DeepScaleR), 300-400 training steps. The paper acknowledges this limitation but does not provide any evidence of generalization. The staleness s=256 is an extreme and somewhat artificial setting; it is unclear whether AC-MIS-PO provides any benefit at more common staleness levels (e.g., s=8, s=16, s=32).",
        "Statistical rigor is absent: no standard deviations, no multiple random seeds, no confidence intervals are reported for any result. The main comparison in Table 1 appears to be single-run results. The Pearson correlation r=0.112 between acceptance rate and second moment (Section 4.4) is reported with p=0.052, which is borderline non-significant at α=0.05, yet the paper treats it as evidence they 'capture different aspects.' Without multiple runs, it is impossible to determine whether the 1.14pp gap between AC-MIS-PO and Fixed-at-Final-Bound is statistically meaningful.",
        "The acceptance rate schedule (40%→20%) and its hyperparameters (Kwarmup=200) are themselves fixed and manually chosen. The paper criticizes fixed hyperparameters in prior work but introduces its own fixed schedule. No sensitivity analysis is provided for Astart, Aend, Kwarmup, or β, making it unclear how robust the method is to these choices."
      ],
      "must_fix_items": [
        "Report results with multiple random seeds and standard deviations to establish statistical significance, especially for the narrow 1.14pp gap between AC-MIS-PO and Fixed-at-Final-Bound.",
        "Provide sensitivity analysis for key hyperparameters (Astart, Aend, Kwarmup, β) to demonstrate robustness and justify the claim of 'automatic discovery without manual tuning.'",
        "Test on at least one additional staleness level or model scale to support generalization claims."
      ],
      "conference_scores": {
        "soundness": 2.3,
        "presentation": 2.8,
        "contribution": 2,
        "overall_rating": 4,
        "confidence": 3
      }
    }
  ]
}