Title: ANSWER-FREE SELF-REFERENTIAL CRITICS: TRAIN-ING SOLVE-THEN-JUDGE VLM JUDGES WITH PREF-ERENCE LABELS BUT WITHOUT GROUND-TRUTH AN-SWERS
PDF: answer-free-self-referential-critic.pdf
Score: 2.8
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 149.4s

Strengths:
1. The paper identifies a real and practical bottleneck: Solve-Then-Judge critic training requires ground-truth answers for the self-prediction reward, but large-scale preference datasets (VLFeedback, MM-RLHF) only provide pairwise preferences. The proposed pseudo-label extraction from preferred responses (Eq. 2-3) is a straightforward and sensible approach to remove this dependency. (Section 3.2)
2. The group consistency gating mechanism (Eq. 4-6) is a reasonable design that filters self-prediction rewards based on permutation invariance, directly addressing the known shortcut behavior of VLMs overfitting to option letters. This connects to prior work on self-consistency (Wang et al., 2022) and option-permutation robustness (Huang et al., 2025). (Section 3.3)
3. The two-pass rollout architecture (Section 3.4) that separates the solve pass from the judge pass is a clean design choice that prevents information leakage from candidate responses into the self-prediction, which is important for the integrity of the self-referential training signal.

Weaknesses:
1. Critical experimental confound: The pseudo-labels achieve 100% accuracy by construction because the dataset is created such that 'the response containing the correct answer is labeled as preferred' (Section 4.1). This means the pseudo-labels ARE the ground-truth answers in disguise, so AF-SRC is not truly 'answer-free' in any meaningful sense—it just relabels ground-truth as 'pseudo-labels.' The 150% recovery ratio claim is therefore misleading: the method is not recovering from missing answers, it is using the same answers via a different variable name plus a gating mechanism. (Section 4.5, explicitly acknowledged: 'pseudo-labels derived from preferred responses achieve 100% accuracy by construction')
2. Extremely limited experimental scope: only 451 training pairs and 113 test pairs from a single benchmark (Cosmos-Reason1-Benchmark) on a single model (Qwen2.5-VL-7B) in a single domain (physical reasoning). No statistical significance tests (e.g., bootstrap confidence intervals, paired t-tests) are reported. The absolute numbers are very small (e.g., debiased accuracy differences of 2-5 percentage points on ~113 test items could easily be due to random variation). (Table 1, Section 4.1)
3. The debiased preference accuracy numbers are extremely low across all conditions (5.31% to 13.27%), meaning the models are barely above chance on the debiased metric. An agreement rate of 22-33% (Table 1) means the model disagrees with itself on most swapped-order evaluations, suggesting fundamental reliability issues. The paper frames these very poor absolute numbers as a success story, which is a packaging concern—the absolute performance is near-random on the primary metric. (Table 1, Section 4.2)
4. The paper claims AF-SRC 'exceeds oracle performance' but the 'why' analysis (Section 4.5) is purely hypothetical ('We hypothesize that group consistency gating acts as a curriculum'). No controlled ablation isolates the gating mechanism from the pseudo-label effect. Since pseudo-labels = ground-truth in this setup, the improvement over oracle could simply be due to the binary gating acting as a regularizer that happens to help—this is not the novel insight the paper claims. The ablation is missing: what happens with pseudo-labels but WITHOUT group consistency gating?
5. The method is narrowly applicable to MCQ settings where answers can be extracted from preferred responses (Section 3.2: 'For multiple-choice questions, this extraction identifies the selected option'). The paper acknowledges this limitation but still frames the contribution broadly as 'enabling scalable critic training on preference-only datasets' (Abstract), when it only works for MCQ preference datasets where the correct answer is always in the preferred response.

Must Fix Items:
1. Add an ablation condition with pseudo-labels but WITHOUT group consistency gating to isolate the contribution of each component. Without this, the 150% recovery claim cannot be attributed to the gating mechanism.
2. Report statistical significance tests or confidence intervals for all reported metrics, especially given the very small test set (113 items).
3. Be transparent in the framing that in the current experimental setup, pseudo-labels are identical to ground-truth answers. The 'answer-free' and '150% recovery' claims are misleading without this qualification prominently stated.

Runs:
- run=1 score=2.8 verdict=Strong Reject confidence=0.6 error=None