Title: DELTA-MAP BELIEF UPDATES FOR STABLE SPATIAL REVISION IN VISION-LANGUAGE MODELS FARS Analemma
PDF: tos-delta-map-updates.pdf
Score: 3.2
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 54.1s

Strengths:
1. Clear and well-motivated hypothesis: the paper identifies a specific failure mode (full regeneration causing transcription errors and wasteful computation) and proposes a targeted solution (sparse delta updates), with the sparse evidence premise empirically validated (~30% objects updated per step, Figure 2 left).
2. Strong improvement in false-belief identification F1: providing prior context with explicit preserve/overwrite rules yields +16.7 pp over scratch regeneration (0.310→0.477-0.479, Table 1), demonstrating that the interface design matters significantly for spatial belief revision.
3. Demonstrated efficiency gains with maintained performance: delta-map updates match full regeneration F1 (0.479 vs 0.477) while producing 52-63% smaller structured outputs (Figure 2 right), showing practical benefit of the sparse update strategy.

Weaknesses:
1. Minimal novelty in the core technique: delta-map updates amount to prompting the VLM to output only changed entries rather than the full map, with programmatic merge (Apply function in Eq. 2). This is essentially a prompt-engineering variation on top of the Theory of Space benchmark protocol, not a new learning algorithm or architectural contribution.
2. Belief inertia remains high and unaddressed across all conditions (positional inertia 0.52-0.61, Table 1), and the paper acknowledges this fundamental challenge without proposing any solution. The core claimed contribution (delta-map updates) does not actually solve the central problem of spatial belief revision—it only marginally affects inertia metrics and mainly helps with change detection identification.
3. Unfair baseline comparison due to different temperatures: Condition A uses temperature 1.0 while Conditions B and C use temperature 0.5 (Section 4.1). The 16.7 pp F1 improvement attributed to prior context may be partially confounded by the lower temperature producing more deterministic/consistent outputs. Without controlling for temperature, the claimed improvement is unreliable.
4. Tiny and non-significant differences between Conditions B and C: the F1 difference is 0.479 vs 0.477 (within standard error of ±0.019), cognitive map overall accuracy differs by only 0.011 (0.236 vs 0.225, Table 2), and inertia metrics show mixed results. No statistical significance tests are reported anywhere in the paper, making it impossible to determine whether observed differences are real or due to noise.

Must Fix Items:
1. Control for temperature across all conditions or provide strong justification for different temperatures; run Condition A at temperature 0.5 to isolate the effect of prior context from temperature effects.
2. Report statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) for all claimed improvements, especially the F1 comparisons between conditions.
3. Clarify the Apply function (Eq. 2): what happens when the delta conflicts with the prior map in ambiguous ways? How are conflicts resolved programmatically? This is the core mechanism and is left unspecified.

Runs:
- run=1 score=3.2 verdict=Strong Reject confidence=0.6 error=None