{
  "pdf": "justgrpo-order-robustness.pdf",
  "title": "AR-ORDER RL POST-TRAINING REDUCES ORDER ROBUSTNESS IN DIFFUSION LANGUAGE MODELS FARS Analemma",
  "elapsed": 47.6,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.2,
  "scores": [
    3.2
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.3,
    "presentation": 2.5,
    "contribution": 1.8,
    "overall_rating": 3.2,
    "confidence": 3
  },
  "strengths": [
    "The paper identifies a genuinely important and previously unexplored tension: AR-order RL post-training (JustGRPO) erodes the order robustness advantage of diffusion LMs. This is a timely and practical concern as the community rushes to apply AR-style RL to dLLMs. Evidence: robustness ratio drops of 0.192 on ReasonOrderQA and 0.138 on GSM8K (Table 1), both exceeding the stated decision threshold of 0.10.",
    "The three-model comparison design (diffusion base → AR-order RL diffusion → pure AR anchor) is well-constructed to isolate the effect of AR-order RL training while providing an informative reference point. The '53% gap coverage' finding (Section 4.1, computed as (0.698−0.506)/(0.698−0.339)) provides a clean quantitative characterization of partial AR-internalization. Evidence: Table 1, Section 4.1.",
    "The per-difficulty analysis (Section 4.2, Figure 2) provides useful granularity, showing that degradation is concentrated at D2–D3 (multi-step reasoning) rather than uniformly distributed, with D1 remaining robust and D4 showing a floor effect. This suggests the trade-off is specifically about intermediate reasoning complexity, which is an actionable finding for future training method design."
  ],
  "weaknesses": [
    "The paper is extremely thin—essentially a single comparison table with one ablation. There is no analysis of *why* AR-order RL reduces robustness (e.g., no probing of internal representations, no analysis of which tokens/positions are most affected, no examination of the reward signal's token-level impact). The 'partial internalization of AR-style order dependence' claim (Section 4.1) is purely phenomenological with no mechanistic evidence. The entire experimental contribution could fit in a 2-page workshop note.",
    "The robustness ratio metric r = Acc(AF)/Acc(CF) is misleading when Acc(CF) changes drastically between models. JustGRPO's CF accuracy rises from 68.9% to 88.5% (+19.6pp) while AF accuracy only drops 3.3pp (48.1→44.8). A ratio metric penalizes improvements in the denominator. The paper does not discuss whether absolute AF accuracy degradation (3.3pp) is practically meaningful versus the 19.6pp CF gain. The 'fundamental trade-off' framing is overstated given the small absolute AF decline. Evidence: Table 1 raw numbers vs. ratio-based claims.",
    "The paper was generated by an automated research system (explicitly stated in the abstract). This manifests in shallow analysis, formulaic structure, and absence of deeper investigation. Only two benchmarks, only one RL method (JustGRPO), only one base model (LLaDA-8B), no comparison with alternative RL methods (dUltra, MDPO, Inpainting-Guided PO—all cited in Related Work but never tested), no variance/statistical analysis beyond 'deterministic' claims, and no exploration of potential mitigations despite mentioning 'order-agnostic RL training' in the conclusion with zero experimental support."
  ],
  "must_fix_items": [
    "Report and discuss absolute AF accuracy changes alongside ratio-based metrics; the 3.3pp AF decline on ReasonOrderQA vs. 19.6pp CF gain fundamentally reframes whether this is a 'trade-off' or a net positive with minor side effect.",
    "Test at least one alternative RL method (e.g., MDPO or dUltra) to show whether the robustness reduction is specific to JustGRPO's AR-order reward or a general property of RL post-training on dLLMs.",
    "Provide mechanistic analysis or probing experiments to support the claim of 'partial internalization of AR-style order dependence' rather than relying solely on behavioral ratio comparisons."
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.2,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "The paper identifies a genuinely important and previously unexplored tension: AR-order RL post-training (JustGRPO) erodes the order robustness advantage of diffusion LMs. This is a timely and practical concern as the community rushes to apply AR-style RL to dLLMs. Evidence: robustness ratio drops of 0.192 on ReasonOrderQA and 0.138 on GSM8K (Table 1), both exceeding the stated decision threshold of 0.10.",
        "The three-model comparison design (diffusion base → AR-order RL diffusion → pure AR anchor) is well-constructed to isolate the effect of AR-order RL training while providing an informative reference point. The '53% gap coverage' finding (Section 4.1, computed as (0.698−0.506)/(0.698−0.339)) provides a clean quantitative characterization of partial AR-internalization. Evidence: Table 1, Section 4.1.",
        "The per-difficulty analysis (Section 4.2, Figure 2) provides useful granularity, showing that degradation is concentrated at D2–D3 (multi-step reasoning) rather than uniformly distributed, with D1 remaining robust and D4 showing a floor effect. This suggests the trade-off is specifically about intermediate reasoning complexity, which is an actionable finding for future training method design."
      ],
      "weaknesses": [
        "The paper is extremely thin—essentially a single comparison table with one ablation. There is no analysis of *why* AR-order RL reduces robustness (e.g., no probing of internal representations, no analysis of which tokens/positions are most affected, no examination of the reward signal's token-level impact). The 'partial internalization of AR-style order dependence' claim (Section 4.1) is purely phenomenological with no mechanistic evidence. The entire experimental contribution could fit in a 2-page workshop note.",
        "The robustness ratio metric r = Acc(AF)/Acc(CF) is misleading when Acc(CF) changes drastically between models. JustGRPO's CF accuracy rises from 68.9% to 88.5% (+19.6pp) while AF accuracy only drops 3.3pp (48.1→44.8). A ratio metric penalizes improvements in the denominator. The paper does not discuss whether absolute AF accuracy degradation (3.3pp) is practically meaningful versus the 19.6pp CF gain. The 'fundamental trade-off' framing is overstated given the small absolute AF decline. Evidence: Table 1 raw numbers vs. ratio-based claims.",
        "The paper was generated by an automated research system (explicitly stated in the abstract). This manifests in shallow analysis, formulaic structure, and absence of deeper investigation. Only two benchmarks, only one RL method (JustGRPO), only one base model (LLaDA-8B), no comparison with alternative RL methods (dUltra, MDPO, Inpainting-Guided PO—all cited in Related Work but never tested), no variance/statistical analysis beyond 'deterministic' claims, and no exploration of potential mitigations despite mentioning 'order-agnostic RL training' in the conclusion with zero experimental support."
      ],
      "must_fix_items": [
        "Report and discuss absolute AF accuracy changes alongside ratio-based metrics; the 3.3pp AF decline on ReasonOrderQA vs. 19.6pp CF gain fundamentally reframes whether this is a 'trade-off' or a net positive with minor side effect.",
        "Test at least one alternative RL method (e.g., MDPO or dUltra) to show whether the robustness reduction is specific to JustGRPO's AR-order reward or a general property of RL post-training on dLLMs.",
        "Provide mechanistic analysis or probing experiments to support the claim of 'partial internalization of AR-style order dependence' rather than relying solely on behavioral ratio comparisons."
      ],
      "conference_scores": {
        "soundness": 2.3,
        "presentation": 2.5,
        "contribution": 1.8,
        "overall_rating": 3.2,
        "confidence": 3
      }
    }
  ]
}