Title: SUPPRESSION-CONTRAST TOKENS: EVALUATING RE-VERSE LAYER-CONTRAST FOR SECRET ELICITATION FARS Analemma
PDF: suppression-contrast-secret-elicitation.pdf
Score: 2.5
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 73.6s

Strengths:
1. Pre-registered success criteria demonstrate exceptional scientific rigor. The authors committed to 4 specific criteria (Section 4.2, Table 3) before running experiments, including both improvement thresholds and a negative control, which prevents post-hoc rationalization of marginal results. This is commendable and rare in ML research.
2. The DoLa-direction negative control is well-designed and informative. The near-zero performance of the DoLa-direction baseline (0.20% TR@5 vs 4.33% for logit lens, Table 1) confirms that the suppression direction (mid-minus-final) carries meaningful signal, validating Criterion 3. This provides a clean ablation that isolates the effect of contrast direction.
3. Honest reporting of negative results is valuable for the community. The paper transparently reports that SCT fails 3 of 4 pre-registered criteria (Table 3), that the suppression premise is weakly supported (~9.3% vs 30% threshold), and that SCT does not generalize to binary-attribute secrets (Table 2). This prevents other researchers from pursuing the same failed hypothesis without awareness of its limitations.

Weaknesses:
1. The core contribution is extremely incremental: SCT is simply reversing the sign of DoLa's layer contrast (Equation 1: score = mid-logprob minus final-logprob). This is a one-line change from DoLa (log pN - log pL → log pL - log pN), with no new algorithmic insight beyond the 'suppression hypothesis' which the paper's own experiments show is weakly supported (~9.3%, Table 3). The contribution is essentially a negative result on a trivially constructed variant.
2. The absolute performance levels are negligible across all methods, undermining the practical relevance. Even the best result (SCT on Taboo: 5.33% TR@5, Table 1) means the secret is recovered in only ~1 in 20 attempts. Logit lens achieves 4.33%. These rates are so low that none of these methods are useful for actual safety auditing, and the relative improvement (+23.1%) is misleading given the tiny absolute gain (+1.0pp).
3. The paper is generated by an automated research system (explicitly stated in the abstract), which raises concerns about depth of scientific reasoning. The experimental design, while rigorous in structure (pre-registration), lacks exploratory analysis that a human researcher would conduct—e.g., no analysis of which examples DO show the suppression pattern, no visualization of layer-wise probability trajectories for the secret token, no investigation of why the premise fails for ~90.7% of examples. These would be critical for understanding whether the hypothesis is wrong or just partially correct.

Must Fix Items:
1. Provide per-example analysis of the ~9.3% of cases where the suppression premise holds: what distinguishes these from the ~90.7% where it fails? Without this, the paper cannot distinguish 'hypothesis is wrong' from 'hypothesis is correct but premise is too restrictive.'
2. Report confidence intervals for all results in Tables 1 and 2, not just mention they 'overlap' in Table 3. The reader needs to see the actual CIs to assess statistical significance independently.
3. The top-200 token extraction ceiling (Section 4.6) is a major confound—report what SCT achieves on the subset where the secret IS in the top-200, to isolate the scoring method's contribution from the candidate generation bottleneck.

Runs:
- run=1 score=2.5 verdict=Strong Reject confidence=0.6 error=None