{
  "pdf": "ef34d872-308d-4b0f-a75d-600ee6cc7e1e.pdf",
  "title": "DOES MIS-PO NEED RATIO-BASED TRAJECTORY SE-LECTION? A RANDOM-REJECTION MECHANISM TEST FARS Analemma",
  "elapsed": 63.6,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 5.0,
  "scores": [
    5.0
  ],
  "score_std": 0.0,
  "final_verdict": "Reject",
  "final_confidence": 0.72,
  "conference_scores": null,
  "strengths": [
    "Clean three-condition experimental design that properly isolates the trajectory-level filtering mechanism: MIS-PO (full), TokenOnly (ablation removing trajectory filter), RandomTraj (control matching acceptance count). This is a well-structured ablation study (Section 3.2, Eqs. 3-5).",
    "RandomTraj control is a genuine methodological contribution — it directly tests whether ratio-based selection matters versus mere acceptance-rate reduction. The finding that RandomTraj (40.85%) outperforms MIS-PO (2.85%) by 38.0pp at identical acceptance rates is a compelling negative result (Table 1).",
    "Mechanism analysis in Section 4.3 provides interpretable explanation: MIS-PO's narrow bounds [0.996, 1.001] select trajectories with σ=0.0012 variance in log ρ(τ), while RandomTraj-accepted trajectories have 22× higher variance (σ=0.026). The KS test (statistic=0.568, p<0.001) and the finding that 61.4% of RandomTraj-accepted trajectories would be rejected by MIS-PO's criterion offer concrete evidence for the gradient-starvation hypothesis (Figure 3)."
  ],
  "weaknesses": [
    "Single benchmark (MATH-500), single model (Qwen3-1.7B-Base), single staleness (s*=256), single training configuration — no evidence that findings generalize beyond this narrow setting. The paper's title claims to answer whether MIS-PO 'needs' ratio-based trajectory selection, but the evidence covers exactly one point in the staleness space. Published GRPO baseline (64.3%) is from a different setup; direct comparison is misleading (Table 1).",
    "Strawman configuration: The narrow bounds [0.996, 1.001] are tested at s*=256 staleness, where the policy has drifted 256 gradient steps from the reference. At this extreme staleness, the ratio-based filter is almost guaranteed to collapse because nearly all trajectories will fall outside the 0.4% deviation window. The paper conflates 'trajectory filtering with these specific bounds at this staleness is harmful' with 'trajectory filtering is harmful.' No experiment tests wider trajectory bounds (e.g., [0.8, 1.2] or [0.5, 2.0] to match token-level range), which is the critical missing condition to support the paper's sweeping claims (Section 3.1, Section 5 Limitations mentions this only briefly).",
    "No statistical significance testing despite the paper being a negative-result claim. Standard error for MATH-500 is approximately 2.2pp (as the paper itself notes), but no confidence intervals, no multi-seed runs, and no hypothesis tests are reported for the main accuracy comparisons. The 56.4pp and 38.0pp gaps are large enough to likely survive significance testing, but the TokenOnly vs published-GRPO gap (5pp) and RandomTraj's 40.85% are not tested. Single seed only (Table 1). HF_NO_SIGNIFICANCE applies.",
    "Acceptance-rate collapse confound: Both MIS-PO and RandomTraj experience trajectory acceptance collapse to <5% by step 300 (Figure 2b), but RandomTraj learns productively (40.85%) while MIS-PO stagnates (2.85%). This shows ratio-based selection is worse than random at the same acceptance rate — a valid conclusion. However, TokenOnly has 100% acceptance and achieves 59.25%, so the comparison between TokenOnly and MIS-PO/RandomTraj conflates two variables: (a) whether to filter trajectories at all, and (b) the acceptance rate. The paper's causal attribution to 'gradient starvation from uninformative trajectory selection' is supported by the RandomTraj comparison but the TokenOnly comparison alone is confounded by batch-size difference."
  ],
  "must_fix_items": [
    "Add at least one additional staleness level (e.g., s*=16, s*=64) to demonstrate whether the finding is specific to extreme staleness or general. The paper's own Limitations section acknowledges this gap.",
    "Test MIS-PO with wider trajectory bounds (e.g., [0.5, 2.0] or [0.8, 1.2]) at s*=256 to distinguish 'trajectory filtering is harmful' from 'these specific narrow bounds are harmful at this staleness.' This is the most critical missing condition — without it, the paper's central claim is unsupported.",
    "Report multi-seed results with confidence intervals or at minimum standard deviations across runs. A negative-result paper making strong claims about a method's failure must demonstrate the result is not a random seed artifact."
  ],
  "runs": [
    {
      "run": 1,
      "score": 5.0,
      "verdict": "Reject",
      "confidence": 0.72,
      "strengths": [
        "Clean three-condition experimental design that properly isolates the trajectory-level filtering mechanism: MIS-PO (full), TokenOnly (ablation removing trajectory filter), RandomTraj (control matching acceptance count). This is a well-structured ablation study (Section 3.2, Eqs. 3-5).",
        "RandomTraj control is a genuine methodological contribution — it directly tests whether ratio-based selection matters versus mere acceptance-rate reduction. The finding that RandomTraj (40.85%) outperforms MIS-PO (2.85%) by 38.0pp at identical acceptance rates is a compelling negative result (Table 1).",
        "Mechanism analysis in Section 4.3 provides interpretable explanation: MIS-PO's narrow bounds [0.996, 1.001] select trajectories with σ=0.0012 variance in log ρ(τ), while RandomTraj-accepted trajectories have 22× higher variance (σ=0.026). The KS test (statistic=0.568, p<0.001) and the finding that 61.4% of RandomTraj-accepted trajectories would be rejected by MIS-PO's criterion offer concrete evidence for the gradient-starvation hypothesis (Figure 3)."
      ],
      "weaknesses": [
        "Single benchmark (MATH-500), single model (Qwen3-1.7B-Base), single staleness (s*=256), single training configuration — no evidence that findings generalize beyond this narrow setting. The paper's title claims to answer whether MIS-PO 'needs' ratio-based trajectory selection, but the evidence covers exactly one point in the staleness space. Published GRPO baseline (64.3%) is from a different setup; direct comparison is misleading (Table 1).",
        "Strawman configuration: The narrow bounds [0.996, 1.001] are tested at s*=256 staleness, where the policy has drifted 256 gradient steps from the reference. At this extreme staleness, the ratio-based filter is almost guaranteed to collapse because nearly all trajectories will fall outside the 0.4% deviation window. The paper conflates 'trajectory filtering with these specific bounds at this staleness is harmful' with 'trajectory filtering is harmful.' No experiment tests wider trajectory bounds (e.g., [0.8, 1.2] or [0.5, 2.0] to match token-level range), which is the critical missing condition to support the paper's sweeping claims (Section 3.1, Section 5 Limitations mentions this only briefly).",
        "No statistical significance testing despite the paper being a negative-result claim. Standard error for MATH-500 is approximately 2.2pp (as the paper itself notes), but no confidence intervals, no multi-seed runs, and no hypothesis tests are reported for the main accuracy comparisons. The 56.4pp and 38.0pp gaps are large enough to likely survive significance testing, but the TokenOnly vs published-GRPO gap (5pp) and RandomTraj's 40.85% are not tested. Single seed only (Table 1). HF_NO_SIGNIFICANCE applies.",
        "Acceptance-rate collapse confound: Both MIS-PO and RandomTraj experience trajectory acceptance collapse to <5% by step 300 (Figure 2b), but RandomTraj learns productively (40.85%) while MIS-PO stagnates (2.85%). This shows ratio-based selection is worse than random at the same acceptance rate — a valid conclusion. However, TokenOnly has 100% acceptance and achieves 59.25%, so the comparison between TokenOnly and MIS-PO/RandomTraj conflates two variables: (a) whether to filter trajectories at all, and (b) the acceptance rate. The paper's causal attribution to 'gradient starvation from uninformative trajectory selection' is supported by the RandomTraj comparison but the TokenOnly comparison alone is confounded by batch-size difference."
      ],
      "must_fix_items": [
        "Add at least one additional staleness level (e.g., s*=16, s*=64) to demonstrate whether the finding is specific to extreme staleness or general. The paper's own Limitations section acknowledges this gap.",
        "Test MIS-PO with wider trajectory bounds (e.g., [0.5, 2.0] or [0.8, 1.2]) at s*=256 to distinguish 'trajectory filtering is harmful' from 'these specific narrow bounds are harmful at this staleness.' This is the most critical missing condition — without it, the paper's central claim is unsupported.",
        "Report multi-seed results with confidence intervals or at minimum standard deviations across runs. A negative-result paper making strong claims about a method's failure must demonstrate the result is not a random seed artifact."
      ],
      "conference_scores": null
    }
  ]
}