{
  "pdf": "grad-ratio-edit-location.pdf",
  "title": "GRADRATIO-SELECT: GRADIENT-BASED LAYER SE-LECTION FOR FINE-TUNING MODEL EDITING FARS Analemma",
  "elapsed": 49.1,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 2.5,
  "scores": [
    2.5
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2,
    "presentation": 2.5,
    "contribution": 1.5,
    "overall_rating": 2.5,
    "confidence": 3
  },
  "strengths": [
    "The paper honestly reports negative/partial results: GradRatio-Select does not improve upon heuristics and degrades on LLaMA-3-8B (Table 1: Capability 39.52 vs 44.78, −5.26pp). This transparency about limitations is commendable and rare in the field.",
    "The ablation on the cosine similarity term (Section 3.5) provides useful empirical insight: cosine similarities between edit and retain gradients are near-zero (0.001–0.14), making the conflict term uninformative. This negative finding is informative for future work on gradient-based layer selection.",
    "The adaptive threshold mechanism (Equation 5) is well-motivated by the observation that early layers have artificially high GradRatio scores due to near-zero retain gradients, not high editability. The ablation confirms this: without the threshold, layer 4 is selected on LLaMA-3-8B, causing capability to drop to 17.62% (Section 3.5), a −27.16pp catastrophe vs baseline."
  ],
  "weaknesses": [
    "The core contribution is extremely thin: GradRatio-Select is essentially ‖g_edit‖²/‖g_retain‖² with a hardcoded threshold ℓ_min = max(3, ⌈0.15L⌉). On Qwen2.5-7B it reproduces the heuristic (same layer), and on LLaMA-3-8B it fails (−5.26pp capability, −23.86pp on GSM8K). The method does not improve over the baseline it aims to replace, making the contribution marginal at best (Table 1, Table 2).",
    "The adaptive threshold (Equation 5) is itself a heuristic with no principled derivation. The constants 3 and 0.15 appear arbitrary. The paper critiques heuristic layer selection but replaces it with a heuristic threshold plus a gradient ratio that still fails on one of two test models. This undermines the claimed automation advantage—if the threshold needs per-model tuning, the method is no more automated than the heuristics it replaces (Section 2.3).",
    "Only 2 models are evaluated, both in the 7B-8B parameter range, with only one dataset (ZsRE). The method fails on one of two models, and there is no evidence it generalizes to other architectures, scales, or editing datasets. The paper acknowledges this is generated by an automated research system, which may explain the narrow scope but does not excuse it (Section 3.1, abstract WARNING).",
    "The GSM8K result on LLaMA-3-8B (28.05% with std=8.22) has extremely high variance across seeds, with one seed scoring 18.57%. This raises questions about statistical reliability. With only 3 seeds and such high variance, the reported −23.86pp gap may not be statistically significant, yet the paper presents it as a definitive finding without any statistical tests (Section 3.3)."
  ],
  "must_fix_items": [
    "Run statistical significance tests (e.g., paired t-test or bootstrap) on the LLaMA-3-8B capability results, especially GSM8K where std=8.22 with n=3 seeds. Report p-values to justify claims about capability degradation.",
    "Evaluate on more than 2 models and more than 1 dataset to demonstrate any generalizability of the method. At minimum, test on a third architecture (e.g., Mistral, Phi) and a second editing dataset.",
    "Justify or ablate the threshold constants (3 and 0.15) rather than presenting them as fixed. Show how sensitive ℓ_min is to these values, or derive them from model properties."
  ],
  "runs": [
    {
      "run": 1,
      "score": 2.5,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "The paper honestly reports negative/partial results: GradRatio-Select does not improve upon heuristics and degrades on LLaMA-3-8B (Table 1: Capability 39.52 vs 44.78, −5.26pp). This transparency about limitations is commendable and rare in the field.",
        "The ablation on the cosine similarity term (Section 3.5) provides useful empirical insight: cosine similarities between edit and retain gradients are near-zero (0.001–0.14), making the conflict term uninformative. This negative finding is informative for future work on gradient-based layer selection.",
        "The adaptive threshold mechanism (Equation 5) is well-motivated by the observation that early layers have artificially high GradRatio scores due to near-zero retain gradients, not high editability. The ablation confirms this: without the threshold, layer 4 is selected on LLaMA-3-8B, causing capability to drop to 17.62% (Section 3.5), a −27.16pp catastrophe vs baseline."
      ],
      "weaknesses": [
        "The core contribution is extremely thin: GradRatio-Select is essentially ‖g_edit‖²/‖g_retain‖² with a hardcoded threshold ℓ_min = max(3, ⌈0.15L⌉). On Qwen2.5-7B it reproduces the heuristic (same layer), and on LLaMA-3-8B it fails (−5.26pp capability, −23.86pp on GSM8K). The method does not improve over the baseline it aims to replace, making the contribution marginal at best (Table 1, Table 2).",
        "The adaptive threshold (Equation 5) is itself a heuristic with no principled derivation. The constants 3 and 0.15 appear arbitrary. The paper critiques heuristic layer selection but replaces it with a heuristic threshold plus a gradient ratio that still fails on one of two test models. This undermines the claimed automation advantage—if the threshold needs per-model tuning, the method is no more automated than the heuristics it replaces (Section 2.3).",
        "Only 2 models are evaluated, both in the 7B-8B parameter range, with only one dataset (ZsRE). The method fails on one of two models, and there is no evidence it generalizes to other architectures, scales, or editing datasets. The paper acknowledges this is generated by an automated research system, which may explain the narrow scope but does not excuse it (Section 3.1, abstract WARNING).",
        "The GSM8K result on LLaMA-3-8B (28.05% with std=8.22) has extremely high variance across seeds, with one seed scoring 18.57%. This raises questions about statistical reliability. With only 3 seeds and such high variance, the reported −23.86pp gap may not be statistically significant, yet the paper presents it as a definitive finding without any statistical tests (Section 3.3)."
      ],
      "must_fix_items": [
        "Run statistical significance tests (e.g., paired t-test or bootstrap) on the LLaMA-3-8B capability results, especially GSM8K where std=8.22 with n=3 seeds. Report p-values to justify claims about capability degradation.",
        "Evaluate on more than 2 models and more than 1 dataset to demonstrate any generalizability of the method. At minimum, test on a third architecture (e.g., Mistral, Phi) and a second editing dataset.",
        "Justify or ablate the threshold constants (3 and 0.15) rather than presenting them as fixed. Show how sensitive ℓ_min is to these values, or derive them from model properties."
      ],
      "conference_scores": {
        "soundness": 2,
        "presentation": 2.5,
        "contribution": 1.5,
        "overall_rating": 2.5,
        "confidence": 3
      }
    }
  ]
}