Title: LOGITGATE: PROBE-GATED OUTPUT LOGIT BIAS AS A SIMPLIFICATION OF ACTIVATION STEERING FOR TOOL CALLING FARS
PDF: asa-probe-logit-bias.pdf
Score: 4.2
Verdict: Reject
Confidence: 0.60
Elapsed: 111.5s

Strengths:
1. Clear and well-motivated practical contribution: LogitGate addresses a real deployment limitation of activation steering (mid-layer hooks unavailable in production frameworks like vLLM/TensorRT-LLM) and demonstrates a viable alternative using standard logit processor interfaces (Section 1, Section 3.4).
2. Fair comparison methodology: The authors match FPR within ±1 percentage point on the development set before evaluating on the test set, ensuring that Trigger-F1 comparisons between ActGate and LogitGate are at equivalent false positive rates (Section 3.5, Table 1 shows both at FPR=0.0833).
3. Insightful K-invariance finding: The discovery that K=1 suffices and all K values produce identical results (Table 3) supports the 'boundary calibration' hypothesis, which provides mechanistic understanding of why the simplification works — steering primarily affects the first-token commitment decision rather than sustained generation (Section 4.4).
4. Strong gate ablation demonstrating safety-critical role: Table 4 shows removing the probe-guided gate causes FPR to explode from 0.0833 to 0.875 (10.5× increase), clearly demonstrating the gate is not decorative but essential for selective intervention (Section 4.5).

Weaknesses:
1. Extremely narrow evaluation scope: Only one model (Qwen2.5-1.5B-Instruct) and one benchmark (BFCL v4 single-turn, N=248) are tested. The paper's own conclusion acknowledges this limitation. It is unknown whether LogitGate works on larger models, different architectures, or multi-turn settings. The 1.5B model may have particularly noisy trigger decisions that are easily improved by logit bias; larger models may already have well-calibrated triggers leaving little room for improvement (Section 4.1, Section 5).
2. Tiny development split raises overfitting concerns: The probe is trained on only 50 examples (40 relevance, 10 irrelevance) (Appendix A). With 10 irrelevance examples, even a single misclassification shifts FPR by 10 percentage points. The hyperparameters (τ, β) are also tuned on this same split. This small sample size undermines confidence in the generality of the reported results, and no cross-validation or statistical significance tests are reported (Appendix A, Section 3.5).
3. No statistical significance or variance reporting: All results are from a single run with greedy decoding. No confidence intervals, standard deviations, or statistical tests are provided. Given the small test set (N=248, only 48 irrelevance examples), the observed differences (e.g., Trigger-F1 0.9774 vs 0.9825, a gap of 0.0051) could be within noise. The 80.7% recovery ratio is computed from point estimates without uncertainty quantification (Table 1, Section 4.2).
4. Modest and potentially insignificant absolute improvement: The absolute Trigger-F1 gap between LogitGate (0.9774) and prompt-only (0.9561) is 0.0213, and the gap versus ActGate is only 0.0051. Given no significance testing and a small test set, it is unclear whether these differences are meaningful. The 80.7% recovery metric frames a tiny absolute gap as a large percentage, which risks over-packaging (Table 1, Equation 7).
5. Over-packaging concern — 'boundary calibration hypothesis' is overstated: The K-invariance finding (Table 3) is presented as a novel hypothesis, but it is largely a trivial consequence of the early-stop mechanism: once a trigger token is emitted at step 1, the bias stops. If the model already commits to the trigger token at step 1, of course K>1 has no effect. This is an artifact of the design rather than a deep insight about model behavior (Section 3.3 early-stop mechanism, Section 4.4).

Must Fix Items:
1. Add statistical significance tests or confidence intervals for the main comparison in Table 1, especially given the small test set size (N=248, 48 irrelevance examples).
2. Report results on at least one additional model scale or architecture to demonstrate generalizability beyond Qwen2.5-1.5B-Instruct.
3. Increase the development split size or use cross-validation to address overfitting concerns with the current 50-example split used for both probe training and hyperparameter tuning.

Runs:
- run=1 score=4.2 verdict=Reject confidence=0.6 error=None