{
  "pdf": "1787a8f9-3036-4cff-86b5-6e5b16fa9d72.pdf",
  "title": "SOURCEJS-LORA: SOURCE-REFERENCED JENSEN-SHANNON",
  "elapsed": 269.7,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 5.8,
  "scores": [
    5.8
  ],
  "score_std": 0.0,
  "final_verdict": "Revise",
  "final_confidence": 0.72,
  "conference_scores": null,
  "strengths": [
    "Clear identification of a real failure mode: entropy minimization for LoRA merge coefficients producing 'confidently wrong' predictions is convincingly demonstrated (Table 1: Entropy Coeff 77.55% vs Uniform Merge 79.76%, with catastrophic drops on CoLA 41.12% and MRPC 85.20%). This is a meaningful negative result about an existing method.",
    "JS divergence as an anchoring objective is a principled choice: it is bounded (unlike KL), symmetric, and provides a natural teacher-student signal. The connection to preventing confidence collapse is well-motivated and the mathematical formulation (Equations 3-4) is clean.",
    "Coefficient stability analysis (Figure 2) provides useful diagnostic insight: entropy minimization produces coefficients ranging from -3.04 to +2.50 with negative values, while SourceJS-LoRA produces coefficients in 0.06-0.50 with zero negative coefficients. This directly explains the performance degradation and is a concrete contribution to understanding merge dynamics."
  ],
  "weaknesses": [
    "Core idea is incremental — minimizing JS divergence between merged model and task experts is functionally knowledge distillation (Hinton et al., 2015) applied to coefficient optimization. The paper frames this as 'source-referenced optimization' but the contribution is essentially: use KD loss instead of entropy loss to learn coefficients. The novelty is marginal and the reframing as 'source-referenced JS divergence' obscures this lineage — no citation or discussion of the KD connection anywhere in the paper.",
    "Supervised Coeff baseline at 79.79% (Table 1) is deeply suspicious and appears to be a strawman. A supervised method with labeled data (64 samples/task) that performs barely above uniform merging (79.76%) and dramatically below SourceJS (83.10%) is either poorly implemented or deliberately weak. On MNLI, Supervised Coeff gets 81.99% vs SourceJS's 83.53% — a small gap — yet on CoLA it gets 58.70% (highest among all methods) but on RTE only 58.16% and SST-2 only 82.80%. The inconsistency suggests implementation issues rather than a fair comparison. No explanation is given for why supervised learning fails so badly.",
    "Missing critical baselines from the paper's own related work section: TIES-Merging, Fisher-weighted averaging, KnOTS, IterIS, and PCB-Merging are all discussed in Section 2 but none appear in Table 1. DO-Merging is positioned as SOTA but the reader cannot verify this claim against the broader landscape of merging methods the paper itself cites.",
    "No statistical significance tests despite reporting 3 seeds. Table 1 reports point averages only with no standard deviations. Given that the main claimed improvement over DO-Merging is +2.29 points on average — and this average is heavily driven by a single task (MNLI: +15.65) — it is impossible to assess whether observed differences are statistically meaningful or within noise.",
    "Evaluation scope is extremely narrow: only T5-base (220M) on 8 GLUE tasks. No evaluation on larger models, decoder-only architectures, generation tasks, or any domain beyond NLU classification. The paper's own Limitations section acknowledges this but the experimental contribution is thin — 8 classification tasks on one small encoder-decoder model does not establish generalizability of the method."
  ],
  "must_fix_items": [
    "Add standard deviations and significance tests across the 3 seeds for all methods in Table 1. Without this, the +2.29 improvement claim over DO-Merging cannot be validated.",
    "Explain or fix the Supervised Coeff baseline: 79.79% with labeled data is implausibly low and suggests either a bug or an intentionally weak implementation. At minimum, report the same hyperparameter search budget for all methods.",
    "Add at least TIES-Merging and one other related-work baseline (KnOTS or IterIS) to Table 1, since these are cited in Section 2 as relevant methods but excluded from evaluation."
  ],
  "runs": [
    {
      "run": 1,
      "score": 5.8,
      "verdict": "Revise",
      "confidence": 0.72,
      "strengths": [
        "Clear identification of a real failure mode: entropy minimization for LoRA merge coefficients producing 'confidently wrong' predictions is convincingly demonstrated (Table 1: Entropy Coeff 77.55% vs Uniform Merge 79.76%, with catastrophic drops on CoLA 41.12% and MRPC 85.20%). This is a meaningful negative result about an existing method.",
        "JS divergence as an anchoring objective is a principled choice: it is bounded (unlike KL), symmetric, and provides a natural teacher-student signal. The connection to preventing confidence collapse is well-motivated and the mathematical formulation (Equations 3-4) is clean.",
        "Coefficient stability analysis (Figure 2) provides useful diagnostic insight: entropy minimization produces coefficients ranging from -3.04 to +2.50 with negative values, while SourceJS-LoRA produces coefficients in 0.06-0.50 with zero negative coefficients. This directly explains the performance degradation and is a concrete contribution to understanding merge dynamics."
      ],
      "weaknesses": [
        "Core idea is incremental — minimizing JS divergence between merged model and task experts is functionally knowledge distillation (Hinton et al., 2015) applied to coefficient optimization. The paper frames this as 'source-referenced optimization' but the contribution is essentially: use KD loss instead of entropy loss to learn coefficients. The novelty is marginal and the reframing as 'source-referenced JS divergence' obscures this lineage — no citation or discussion of the KD connection anywhere in the paper.",
        "Supervised Coeff baseline at 79.79% (Table 1) is deeply suspicious and appears to be a strawman. A supervised method with labeled data (64 samples/task) that performs barely above uniform merging (79.76%) and dramatically below SourceJS (83.10%) is either poorly implemented or deliberately weak. On MNLI, Supervised Coeff gets 81.99% vs SourceJS's 83.53% — a small gap — yet on CoLA it gets 58.70% (highest among all methods) but on RTE only 58.16% and SST-2 only 82.80%. The inconsistency suggests implementation issues rather than a fair comparison. No explanation is given for why supervised learning fails so badly.",
        "Missing critical baselines from the paper's own related work section: TIES-Merging, Fisher-weighted averaging, KnOTS, IterIS, and PCB-Merging are all discussed in Section 2 but none appear in Table 1. DO-Merging is positioned as SOTA but the reader cannot verify this claim against the broader landscape of merging methods the paper itself cites.",
        "No statistical significance tests despite reporting 3 seeds. Table 1 reports point averages only with no standard deviations. Given that the main claimed improvement over DO-Merging is +2.29 points on average — and this average is heavily driven by a single task (MNLI: +15.65) — it is impossible to assess whether observed differences are statistically meaningful or within noise.",
        "Evaluation scope is extremely narrow: only T5-base (220M) on 8 GLUE tasks. No evaluation on larger models, decoder-only architectures, generation tasks, or any domain beyond NLU classification. The paper's own Limitations section acknowledges this but the experimental contribution is thin — 8 classification tasks on one small encoder-decoder model does not establish generalizability of the method."
      ],
      "must_fix_items": [
        "Add standard deviations and significance tests across the 3 seeds for all methods in Table 1. Without this, the +2.29 improvement claim over DO-Merging cannot be validated.",
        "Explain or fix the Supervised Coeff baseline: 79.79% with labeled data is implausibly low and suggests either a bug or an intentionally weak implementation. At minimum, report the same hyperparameter search budget for all methods.",
        "Add at least TIES-Merging and one other related-work baseline (KnOTS or IterIS) to Table 1, since these are cited in Section 2 as relevant methods but excluded from evaluation."
      ],
      "conference_scores": null
    }
  ]
}