Title: MEL-CODE: TRANSFERRING META-EXPERIENCE LEARNING TO CODE RLVR WITH UNIT-TEST RE-
PDF: mel-code-meta-experience.pdf
Score: 3.5
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 50.6s

Strengths:
1. The paper identifies a genuinely interesting finding: code RLVR produces abundant meta-experience signal (66% usable contrastive pairs vs. 5% threshold in math MEL), which is a non-obvious and potentially useful empirical observation (Section 4.2). This suggests the code domain is particularly amenable to contrastive learning from rollout pairs.
2. The ablation study is well-designed and reveals that replay validation is the most critical component, with removal dropping performance back to GRPO baseline (Table 2). This provides actionable insight about which component matters most, going beyond a monolithic comparison.
3. The paper honestly reports negative transfer results: MEL-Code matches GRPO on HumanEval+ (68.3%) and does not reach the Self-Critique baseline (73.2%), explicitly stating that meta-experiences from MBPP do not transfer across coding tasks (Section 4.3, Conclusion). This transparency is commendable.
4. Faster convergence is a practical benefit: MEL-Code peaks at step 40 vs. step 60 for baselines, a 33% reduction (Figure 2a), which has real engineering value even if final performance gains are modest.

Weaknesses:
1. The absolute performance gains are extremely small and not statistically significant. Greedy Pass@1 improves from 8.8% to 9.2% (+0.4 pp), and the authors themselves report p = 0.32 via paired bootstrap (Section 4.3). A 0.4 percentage point improvement on a single benchmark with no statistical significance is a very thin empirical basis for claiming a meaningful contribution.
2. The method is essentially a direct transfer of MEL (Huang et al., 2026) from math to code, with the primary adaptation being template-based construction instead of LLM-generated meta-experiences. The novelty is limited: the three-stage pipeline (contrastive pair construction, replay validation, NLL internalization) is inherited directly from MEL. The 'template-based' construction is described but no details or examples of the templates are provided, making it unclear what the actual adaptation entails (Section 3.2).
3. On HumanEval+ (the more rigorous and widely-adopted benchmark), MEL-Code performs identically to GRPO (68.3%) and worse than the simpler Self-Critique NLL baseline (73.2%). This means the proposed method fails to improve over a straightforward alternative on the harder, more standard evaluation. The Self-Critique baseline—which uses no contrastive pairing and no replay validation—outperforms MEL-Code on the out-of-distribution benchmark, raising questions about the practical value of the added complexity (Table 1).
4. Reproducibility concerns: the template-based meta-experience construction is not specified (no template examples in the paper or appendix), the diff-guided divergence categorization is only briefly described, and the meta-experience format (50–80 tokens) lacks concrete examples. The paper references a GitLab repository but the methodological details should be in the paper itself. Additionally, the base model performance on HumanEval+ (75.6% Base) drops dramatically after GRPO training (73.8%/68.3%), which is an unexplained and concerning regression that the paper does not discuss (Table 1).
5. The experimental setup is limited: only one model size (7B), one training dataset (MBPP-train with only 374 tasks), and 68 training steps. The small training data size and short training schedule make it difficult to assess whether the gains would scale or persist. The paper also lacks evaluation on additional code benchmarks (e.g., LiveCodeBench, BigCodeBench) that would test generalizability beyond MBPP and HumanEval+.

Must Fix Items:
1. Provide concrete examples of the template-based meta-experience construction (templates, input/output examples) so the method can be reproduced and understood.
2. Explain the dramatic HumanEval+ regression from base model (75.6%) to GRPO (73.8%) and MEL-Code (68.3%) — this is a concerning unexplained phenomenon that undermines the narrative.
3. Report confidence intervals or error bars across multiple runs to establish statistical reliability, given that the main claim rests on a 0.4 pp improvement with p = 0.32.

Runs:
- run=1 score=3.5 verdict=Strong Reject confidence=0.6 error=None