Title: SOURCEJS-LORA: SOURCE-REFERENCED JENSEN-SHANNON PDF: 1787a8f9-3036-4cff-86b5-6e5b16fa9d72.pdf Score: 5.8 Verdict: Revise Confidence: 0.72 Elapsed: 269.7s Strengths: 1. Clear identification of a real failure mode: entropy minimization for LoRA merge coefficients producing 'confidently wrong' predictions is convincingly demonstrated (Table 1: Entropy Coeff 77.55% vs Uniform Merge 79.76%, with catastrophic drops on CoLA 41.12% and MRPC 85.20%). This is a meaningful negative result about an existing method. 2. JS divergence as an anchoring objective is a principled choice: it is bounded (unlike KL), symmetric, and provides a natural teacher-student signal. The connection to preventing confidence collapse is well-motivated and the mathematical formulation (Equations 3-4) is clean. 3. Coefficient stability analysis (Figure 2) provides useful diagnostic insight: entropy minimization produces coefficients ranging from -3.04 to +2.50 with negative values, while SourceJS-LoRA produces coefficients in 0.06-0.50 with zero negative coefficients. This directly explains the performance degradation and is a concrete contribution to understanding merge dynamics. Weaknesses: 1. Core idea is incremental — minimizing JS divergence between merged model and task experts is functionally knowledge distillation (Hinton et al., 2015) applied to coefficient optimization. The paper frames this as 'source-referenced optimization' but the contribution is essentially: use KD loss instead of entropy loss to learn coefficients. The novelty is marginal and the reframing as 'source-referenced JS divergence' obscures this lineage — no citation or discussion of the KD connection anywhere in the paper. 2. Supervised Coeff baseline at 79.79% (Table 1) is deeply suspicious and appears to be a strawman. A supervised method with labeled data (64 samples/task) that performs barely above uniform merging (79.76%) and dramatically below SourceJS (83.10%) is either poorly implemented or deliberately weak. On MNLI, Supervised Coeff gets 81.99% vs SourceJS's 83.53% — a small gap — yet on CoLA it gets 58.70% (highest among all methods) but on RTE only 58.16% and SST-2 only 82.80%. The inconsistency suggests implementation issues rather than a fair comparison. No explanation is given for why supervised learning fails so badly. 3. Missing critical baselines from the paper's own related work section: TIES-Merging, Fisher-weighted averaging, KnOTS, IterIS, and PCB-Merging are all discussed in Section 2 but none appear in Table 1. DO-Merging is positioned as SOTA but the reader cannot verify this claim against the broader landscape of merging methods the paper itself cites. 4. No statistical significance tests despite reporting 3 seeds. Table 1 reports point averages only with no standard deviations. Given that the main claimed improvement over DO-Merging is +2.29 points on average — and this average is heavily driven by a single task (MNLI: +15.65) — it is impossible to assess whether observed differences are statistically meaningful or within noise. 5. Evaluation scope is extremely narrow: only T5-base (220M) on 8 GLUE tasks. No evaluation on larger models, decoder-only architectures, generation tasks, or any domain beyond NLU classification. The paper's own Limitations section acknowledges this but the experimental contribution is thin — 8 classification tasks on one small encoder-decoder model does not establish generalizability of the method. Must Fix Items: 1. Add standard deviations and significance tests across the 3 seeds for all methods in Table 1. Without this, the +2.29 improvement claim over DO-Merging cannot be validated. 2. Explain or fix the Supervised Coeff baseline: 79.79% with labeled data is implausibly low and suggests either a bug or an intentionally weak implementation. At minimum, report the same hyperparameter search budget for all methods. 3. Add at least TIES-Merging and one other related-work baseline (KnOTS or IterIS) to Table 1, since these are cited in Section 2 as relevant methods but excluded from evaluation. Runs: - run=1 score=5.8 verdict=Revise confidence=0.72 error=None