{
  "pdf": "stutter-invariance-worldmodel-audit.pdf",
  "title": "STUTTER-INVARIANCE METAMORPHIC AUDITS FOR TEXT WORLD-MODEL ROLLOUTS FARS Analemma",
  "elapsed": 181.3,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.5,
  "scores": [
    3.5
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.4,
    "presentation": 2.8,
    "contribution": 1.8,
    "overall_rating": 3.5,
    "confidence": 3
  },
  "strengths": [
    "The paper honestly reports a negative result: the proposed stutter-invariance metamorphic audit (AUROC 0.767) is statistically tied with the simpler sampling consistency baseline (AUROC 0.757). This transparency is commendable and avoids over-packaging—the authors explicitly acknowledge that domain-specific probes add no measurable benefit over generic stability testing (Section 5, Discussion; Table 2-3).",
    "The experimental methodology is relatively rigorous: three random seeds, bootstrap confidence intervals with 1000 resamples, pre-registered success criteria (delta AUROC ≥ 0.05 threshold, 0.02 tie zone), and per-seed breakdowns showing inconsistent direction of difference between the proposed method and B3 (Table 3). This pre-registration and statistical testing framework exceeds what many similar papers provide.",
    "The problem formulation—predicting World-to-Real (W2R) transfer failure from rollout properties alone—is practically motivated and clearly defined (Section 3.1). The 34.3% W2R failure rate demonstrates the problem is real and non-trivial, and the idea of pre-screening unreliable rollouts before costly real-environment replay has clear practical value."
  ],
  "weaknesses": [
    "The core contribution is an informative negative result with extremely limited scope: single domain (TextWorld), single world model (Qwen2.5-7B), single acting agent (GPT-4o-mini). The equivalence between structured and generic stability testing may be an artifact of this specific setting and does not generalize. The authors acknowledge this (Section 5, Limitations) but do not provide any evidence that the finding extends beyond this narrow configuration. A paper whose main finding is a negative result needs stronger generalization evidence to be impactful.",
    "Unfair baseline comparison due to post-hoc optimization: the stutter-invariance method underwent systematic search over 15 aggregation candidates (Appendix A), improving AUROC from 0.713 to 0.767, while B3 sampling consistency received no such tuning. The authors note this potential bias but do not correct for it. Had B3 received equivalent optimization (e.g., tuning temperature, aggregation strategy), it might well outperform the proposed method. This is a serious fairness concern that undermines even the negative result's reliability.",
    "The paper is generated by an automated research system (stated in the abstract), and the contribution is thin: the stutter-invariance idea is straightforward (insert look commands, measure drift), the pipeline is standard (embedding distance + weighted aggregation), and the finding—that generic perturbation works as well as domain-specific perturbation—is intuitively plausible and not surprising. The metamorphic relation itself (state-preserving commands should not change state) is a trivially obvious invariant, and confirming that violating it correlates with failure adds limited insight."
  ],
  "must_fix_items": [
    "Apply equivalent post-hoc optimization to B3 sampling consistency (tune temperature, top-p, aggregation) and report results, to ensure the negative result is not an artifact of asymmetric tuning.",
    "Test on at least one additional domain (e.g., ALFWorld or ScienceWorld) or one additional world model to provide any evidence of generalization beyond the single TextWorld + Qwen2.5-7B setting.",
    "Clarify the automated generation claim: if the paper was generated by an automated system, the novelty and intellectual contribution of the human authors should be explicitly articulated."
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.5,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "The paper honestly reports a negative result: the proposed stutter-invariance metamorphic audit (AUROC 0.767) is statistically tied with the simpler sampling consistency baseline (AUROC 0.757). This transparency is commendable and avoids over-packaging—the authors explicitly acknowledge that domain-specific probes add no measurable benefit over generic stability testing (Section 5, Discussion; Table 2-3).",
        "The experimental methodology is relatively rigorous: three random seeds, bootstrap confidence intervals with 1000 resamples, pre-registered success criteria (delta AUROC ≥ 0.05 threshold, 0.02 tie zone), and per-seed breakdowns showing inconsistent direction of difference between the proposed method and B3 (Table 3). This pre-registration and statistical testing framework exceeds what many similar papers provide.",
        "The problem formulation—predicting World-to-Real (W2R) transfer failure from rollout properties alone—is practically motivated and clearly defined (Section 3.1). The 34.3% W2R failure rate demonstrates the problem is real and non-trivial, and the idea of pre-screening unreliable rollouts before costly real-environment replay has clear practical value."
      ],
      "weaknesses": [
        "The core contribution is an informative negative result with extremely limited scope: single domain (TextWorld), single world model (Qwen2.5-7B), single acting agent (GPT-4o-mini). The equivalence between structured and generic stability testing may be an artifact of this specific setting and does not generalize. The authors acknowledge this (Section 5, Limitations) but do not provide any evidence that the finding extends beyond this narrow configuration. A paper whose main finding is a negative result needs stronger generalization evidence to be impactful.",
        "Unfair baseline comparison due to post-hoc optimization: the stutter-invariance method underwent systematic search over 15 aggregation candidates (Appendix A), improving AUROC from 0.713 to 0.767, while B3 sampling consistency received no such tuning. The authors note this potential bias but do not correct for it. Had B3 received equivalent optimization (e.g., tuning temperature, aggregation strategy), it might well outperform the proposed method. This is a serious fairness concern that undermines even the negative result's reliability.",
        "The paper is generated by an automated research system (stated in the abstract), and the contribution is thin: the stutter-invariance idea is straightforward (insert look commands, measure drift), the pipeline is standard (embedding distance + weighted aggregation), and the finding—that generic perturbation works as well as domain-specific perturbation—is intuitively plausible and not surprising. The metamorphic relation itself (state-preserving commands should not change state) is a trivially obvious invariant, and confirming that violating it correlates with failure adds limited insight."
      ],
      "must_fix_items": [
        "Apply equivalent post-hoc optimization to B3 sampling consistency (tune temperature, top-p, aggregation) and report results, to ensure the negative result is not an artifact of asymmetric tuning.",
        "Test on at least one additional domain (e.g., ALFWorld or ScienceWorld) or one additional world model to provide any evidence of generalization beyond the single TextWorld + Qwen2.5-7B setting.",
        "Clarify the automated generation claim: if the paper was generated by an automated system, the novelty and intellectual contribution of the human authors should be explicitly articulated."
      ],
      "conference_scores": {
        "soundness": 2.4,
        "presentation": 2.8,
        "contribution": 1.8,
        "overall_rating": 3.5,
        "confidence": 3
      }
    }
  ]
}