Title: CONTRACTIVE RECURRENT CORES DEPTH-
PDF: contractive-rd-vla.pdf
Score: 2.8
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 48.9s

Strengths:
1. Honest reporting of null result: The paper's central hypothesis (Jacobian regularization fixes the depth boundary) turned out to be moot because the depth boundary does not manifest on LIBERO-10. The authors transparently report this rather than cherry-picking or reframing, which is scientifically valuable. (Section 4.2, Table 1: 0% overthinking for both baseline and JacReg across K=4 to K=128)
2. Two-phase training strategy is practically useful: Fine-tuning from a converged baseline with gentle Jacobian penalty (λ=10⁻⁴, lr=10⁻⁵) achieves 1.6% MSE improvement with only 400 steps (~10 GPU-hrs), versus from-scratch training that degrades MSE by 4.2-6.1% and costs 88-160 GPU-hrs. This is a clean ablation showing gradient competition between action loss and Jacobian penalty. (Table 3)
3. Adaptive stopping analysis provides actionable inference savings: The convergence analysis (Table 2) shows k*=6 at τ=0.001 achieves 100% convergence rate with 50% compute savings vs fixed K=12, and critically reveals that overly tight thresholds (τ=0.0001) cause oscillation and negative savings. This is a practically useful finding for deployment. (Table 2, Section 4.3)

Weaknesses:
1. Single benchmark, single evaluation protocol, severely limited scope: The paper evaluates only on LIBERO-10 with offline teacher-forced evaluation. The entire motivation of the paper is addressing the 'depth boundary' from RD-VLA, yet this phenomenon does not even occur in their experimental setting. Without closed-loop evaluation, harder benchmarks, or at least one setting where the depth boundary DOES manifest, the paper cannot demonstrate that Jacobian regularization actually solves the problem it set out to solve. (Section 4.1, Limitations section)
2. 1.6% MSE improvement is marginal and lacks statistical significance testing: The claimed 1.6% improvement (0.04099 vs 0.04167) is extremely small. No confidence intervals, no repeated runs, no statistical tests are reported. With only 20 validation episodes (5,609 timesteps), this difference could easily be within noise. The paper presents no evidence this MSE gap translates to any meaningful improvement in task success rate. (Table 1, Table 3)
3. Over-packaging of a null-result ablation study: The paper frames itself as investigating 'contractive recurrent cores for depth scaling' with mathematical formalism (Equations 2-4), Hutchinson estimator, and equilibrium model references, but the actual contribution is: (1) the depth boundary doesn't occur on this benchmark, (2) a small fine-tuning trick gives marginal MSE improvement, (3) early stopping works. The Jacobian regularization machinery is largely unnecessary given the null result, and the paper's intellectual weight does not match its formal presentation. (Abstract, Section 3.2, Table 1)
4. No closed-loop or online evaluation: Teacher-forced MSE is a weak proxy for robotic control quality. Action prediction MSE does not directly correlate with task success, especially for long-horizon manipulation tasks where compounding errors matter. The paper acknowledges this limitation but does nothing to address it. (Section 4.1 evaluation protocol, Limitations)

Must Fix Items:
1. Add statistical significance tests (e.g., bootstrap confidence intervals, paired t-tests across episodes) for the claimed 1.6% MSE improvement. Without this, the improvement claim is unsupported.
2. Report results on at least one benchmark or evaluation protocol where the depth boundary actually manifests, to demonstrate whether Jacobian regularization provides any benefit in the setting it was designed for.
3. Include closed-loop evaluation (task success rate) rather than only teacher-forced MSE, to establish whether any observed MSE differences translate to meaningful robot performance differences.

Runs:
- run=1 score=2.8 verdict=Strong Reject confidence=0.6 error=None