Title: FIT CARDS FOR AGENTIC MARKETPLACE SEARCH: QUERY-CONDITIONED STRUCTURED METADATA TO REDUCE WELFARE LOSS AT LARGE CONSIDERATION SETS
PDF: marketplace-search-fit-cards-scaling.pdf
Score: 3.5
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 61.7s

Strengths:
1. Clear problem identification: The paper identifies a concrete information bottleneck in agentic marketplace search — truncated descriptions lose fit signals that LLM agents cannot recover through reasoning alone. This is demonstrated with baseline contacted-fit rate of only 7.8% (Table 2, condition B), providing strong motivation for platform-side intervention.
2. Strong experimental contrast: The 11.4× welfare improvement (783.78 vs 68.51, Table 1) and nearly 10× contacted-fit rate improvement (73.6% vs 7.8%, Table 2) are dramatic and consistent across 3 seeds (882.23, 705.90, 763.21). Statistical significance is reported (p < 0.005, Welch's t-test), Section 3.2.
3. Insightful negative results on agent-side interventions: Showing that prompting hurts welfare (68.51 → −35.85) and inference scaling also hurts (−29.71) despite 4.7× more LLM calls (Table 1) is a valuable finding that pinpoints the bottleneck as informational rather than computational, Section 3.2.
4. Mechanism decomposition: Table 2 decomposes welfare into discovery-stage (contacted-fit rate, oracle utility) and proposal-stage (first-proposal acceptance) components, clearly showing that the welfare gain comes from discovery quality rather than negotiation behavior. This is a well-structured ablation analysis.
5. Efficiency gain: Fit Cards achieve dramatically better outcomes while using 28% fewer LLM calls (6,898 vs 9,613, Table 1), demonstrating that better information architecture is both more effective and more efficient.

Weaknesses:
1. Over-packaging of a simple intervention: 'Fit Cards' is essentially structured metadata with sorting and ranking — a straightforward application of query-matching against a catalog with relevance sorting. The core idea (compute intersection of query requirements with catalog items, sort by relevance, show rank) is standard in information retrieval and recommendation systems. The paper wraps this in elaborate formalization and branding ('Fit Cards') that inflates the perceived novelty. The intervention is essentially: compute |R∩M_j|, |A*∩A_j|, sum prices, sort lexicographically — this is basic database query matching, not a novel mechanism (Section 2.2).
2. Single-domain, single-LLM evaluation with a simulated environment: All experiments use one domain (Mexican restaurants, 100 businesses), one LLM (claude-sonnet-4), and one simulation environment (Magentic Marketplace). The paper acknowledges this limitation (Section 5) but does nothing to address it. There is no evidence that Fit Cards transfer to other domains (e.g., product search, job matching, housing), other LLM backbones, or real-world marketplace conditions. The 100-business scale is also very small for a 'large consideration sets' claim.
3. Unfair baseline design — truncated descriptions at L=40 tokens is an artificially weak baseline: The status quo baseline (condition B) truncates descriptions to 40 tokens for 100 results, creating an extreme information bottleneck. This is not a realistic marketplace design — real platforms show multiple information signals (ratings, prices, availability badges, category tags) in search results. A fairer baseline would include structured metadata (e.g., price range, cuisine tags) alongside descriptions, which is standard practice on platforms like Yelp or Google Maps. The 11.4× improvement is against this straw-man baseline, making the magnitude of improvement uninformative about real-world impact.
4. No significance on mechanism analysis: While Table 1 reports statistical significance, Table 2 (mechanism diagnostics) does not report p-values or confidence intervals in a comparable way. The contacted-fit rate standard errors are shown but no formal tests are conducted for the mechanism claims. Additionally, with only 3 seeds per condition, the statistical power is limited. The 89.95 standard deviation on welfare of 783.78 for Fit Cards is relatively large (CV ≈ 0.11), suggesting meaningful variability that is not well-characterized with n=3.
5. Welfare metric design advantages Fit Cards by construction: The welfare function W = Σ(U_ij - c_i) with binary fit F_ij = 1[R_i ⊆ M_j ∧ A*_i ⊆ A_j] and oracle utility U_ij = 2·V_i·F_ij - P_j heavily rewards exact requirement matching. Since Fit Cards directly compute the same signals used in the fit function (items hit, amenities hit, price), the metric is structurally aligned with the intervention. A welfare metric that values partial matches or considers negotiation outcomes would likely show much smaller gains.

Must Fix Items:
1. Add a fairer baseline that includes basic structured metadata (price, category tags, availability indicators) alongside descriptions — this is the actual status quo of real marketplace search, not 40-token text-only truncation.
2. Evaluate on at least one additional domain or dataset to demonstrate generalizability beyond Mexican restaurant search.
3. Report formal statistical tests for mechanism analysis claims (Table 2), not just for the main welfare comparison.
4. Acknowledge and analyze the alignment between the welfare metric's fit definition and the Fit Card signals — discuss whether results would hold under partial-fit or negotiation-weighted welfare metrics.

Runs:
- run=1 score=3.5 verdict=Strong Reject confidence=0.6 error=None