Benchmarking synthetic vs. human panels

1. Introduction

The ambition to simulate human behavior through artificial agents is not new. From early agent-based models in computational economics to modern large-language-model-driven personas, the question at the heart of the enterprise has remained remarkably stable across decades: do synthetic consumers behave like real consumers? The answer determines whether synthetic panels can serve as valid instruments for market research, product testing, and strategic forecasting — or whether they remain purely academic curiosities.

Market research has long been characterized by a tension between scale and fidelity. Traditional human panels deliver responses from actual decision-makers, but they are expensive, slow, and increasingly difficult to recruit. Response rates have fallen below 5% in many consumer categories, and the average cost per completed survey in a high-quality panel now exceeds $12 for a fifteen-minute instrument. Simultaneously, the rise of large language models has produced synthetic agents capable of articulating preferences, reasoning about trade-offs, and generating open-ended feedback that reads as fluent and contextually aware. The natural question is whether these agents, properly constructed and calibrated, can approximate the response distributions of their human counterparts.

To date, the literature has been fragmented. Individual studies have examined synthetic responses in narrow domains — political polling (Argyle et al., 2023), brand perception (Aher et al., 2023), and moral reasoning (Dillion et al., 2023) — with mixed results and small sample sizes. What has been missing is a systematic, multi-category, multi-dimensional evaluation that compares synthetic and human panels across a unified experimental protocol using instruments that mirror real market research practice. This paper reports exactly such an evaluation.

Over a six-month period from December 2025 to May 2026, our research team conducted 47 side-by-side experiments comparing synthetic consumer cohorts against demographically matched human panels. The experiments spanned four major consumer categories, covered ten distinct measurement dimensions, and involved over 18,000 human participants and nearly 50,000 synthetic agent runs. This article presents the experimental design, the measurement framework, the quantitative results, and our interpretation of what these findings mean for the future of consumer research. The headline result — a weighted average correlation of 0.88 between synthetic and human response distributions — is strong but nuanced, and the details matter as much as the aggregate.

2. Experimental design

Our experimental protocol was designed to answer a single overarching question: when presented with identical stimuli and measurement instruments, do synthetic consumer panels produce response distributions that are statistically indistinguishable from those produced by demographically matched human panels? To answer this question, we designed a repeated-measures structure in which each of the 47 experiments consisted of an identical survey instrument fielded simultaneously to a human panel and a synthetic panel constructed to match the human panel's demographic profile.

The 47 experiments were distributed across four consumer categories: consumer packaged goods (CPG, n = 14), technology products and services (tech, n = 12), financial services (n = 10), and media and entertainment (n = 11). These categories were selected to represent a broad spectrum of decision complexity. CPG purchases are typically low-involvement, habitual, and price-sensitive; technology purchases involve feature trade-offs and brand ecosystems; financial services involve trust, risk tolerance, and long-term consequences; media and entertainment choices are driven by taste, identity expression, and contextual factors.

Each experiment employed a survey instrument with between 12 and 24 items, including single-select multiple-choice (e.g., "Which of the following brands would you most likely purchase?"), multi-select (e.g., "Which features are important to you?"), Likert-scale ratings (5- and 7-point scales), constant-sum allocations (e.g., "Distribute 100 points across these attributes based on importance"), and open-ended text responses (e.g., "What would you change about this product?"). Every instrument was pre-tested with a pilot group of 50 human participants to ensure clarity, avoid leading phrasing, and confirm that response distributions showed adequate variance.

Human panels were recruited through a combination of online panel aggregators, social media targeting, and email list sampling. Panel sizes ranged from 300 to 1,200 participants, with a median of 580. Quota sampling was used to ensure representativeness on age, gender, income, and geographic region relative to US census benchmarks. Synthetic panels were constructed by generating between 500 and 2,000 agent personas per experiment, matched to the same demographic quotas as their human counterparts. The synthetic agents were instantiated using our proprietary persona-generation pipeline, which conditions each agent on a rich demographic and psychographic profile and generates responses through a structured multi-step reasoning process rather than a single language model call.

All experiments were conducted blind: human panelists were unaware that synthetic responses were being collected, and no experimental condition involved human-synthetic interaction. The synthetic agents were not given access to the human response distributions during generation. Each experiment was replicated three times over a two-week period to assess temporal stability, and the results reported here are averages across replications with standard errors computed at the experiment level.

A critical design choice was to use real — not hypothetical — products and services as stimuli. For CPG experiments, we used actual branded products with current pricing. For tech experiments, we used real device configurations and service plans as offered on manufacturer websites during the study period. For financial services, we used actual terms from current credit card, mortgage, and investment product offerings. For media and entertainment, we used recently released films, streaming series, and music albums. This ecological validity is essential: synthetic agents trained on web-scale text may have exposure to real products and their cultural reception, which could inflate apparent accuracy. We accepted this as a feature rather than a bug, because our goal was not to test whether synthetic agents can reason about unfamiliar stimuli in isolation, but whether they can reproduce real human response distributions in ecologically valid settings.

3. Methodology

The construction of synthetic cohorts was the subject of extensive internal development and validation prior to this benchmark study. Each synthetic agent in our system is defined by a multi-dimensional profile that includes demographic attributes (age, gender, income bracket, education level, geographic region, urbanicity), psychographic attributes (values, interests, lifestyle segments based on VALS and proprietary taxonomies), and behavioral attributes (purchase frequency, brand loyalty indices, category involvement scores). These profiles are derived from large-scale consumer survey data and are designed to produce agents that reflect the heterogeneity of real consumer populations.

When a survey instrument is presented to a synthetic panel, each agent processes the questionnaire through a three-stage pipeline. In the first stage, the agent contextualizes the stimulus relative to its own profile: a high-income urban professional in the 25–34 age bracket will approach a premium credit card offer differently from a rural retiree, even before any product-specific reasoning occurs. In the second stage, the agent performs multi-attribute reasoning, weighing the relevant features of the product or service against its own preference structure, which is itself derived from the demographic and psychographic profile. In the third stage, the agent produces a response in the format required by the instrument, including generating natural-language text for open-ended items.

Human panel recruitment followed standard best practices for online survey research. We used a multi-source strategy to mitigate the well-documented biases of any single panel provider. Primary recruitment was through two major panel aggregators (Dynata and Lucid), supplemented by targeted social media advertisements and email outreach to consumer research databases. Each participant was screened for attentiveness using instructional manipulation checks and trap questions; participants who failed more than one attention check were excluded from analysis, resulting in a 12% exclusion rate, consistent with industry norms.

Demographic matching between synthetic and human panels was performed using iterative proportional fitting (raking) on age, gender, income, and census region. For each experiment, we computed the demographic distribution of the human panel after exclusions and then generated a synthetic panel whose demographic margins matched exactly on these four dimensions. Because the synthetic generation process is cheap and scalable, we were able to oversample within each demographic cell to ensure stable estimates even for small subgroups. The median synthetic-to-human ratio was approximately 3:1, which gave synthetic estimates substantially narrower confidence intervals and made the comparison conservative: any disagreement between the two panels is more likely to reflect genuine differences in response behavior than sampling noise on the synthetic side.

All analysis was conducted using a pre-registered analysis plan. The primary test statistic was Pearson's r between the synthetic and human response distributions, computed separately for each item within each experiment and then averaged at the experiment level. We also computed Spearman's rank correlation for ordinal items, Cohen's d for mean differences on Likert items, and Jensen-Shannon divergence for full distributional comparisons. For open-ended responses, we used both human evaluation (raters blind to condition judged the similarity of synthetic and human response sets) and computational linguistic analysis (cosine similarity of Sentence-BERT embeddings and term frequency-inverse document frequency overlap scores). This multi-dimensional approach allowed us to identify not just whether synthetic and human responses correlate, but where and why they diverge.

4. Measurement dimensions

To move beyond a single-number summary of synthetic-human alignment, we developed a measurement framework that captures five distinct dimensions of response similarity: directional preference, magnitude, rank order, objection coverage, and linguistic similarity. Each dimension captures a different aspect of what it means for a synthetic panel to "behave like" a human panel.

Directional preference measures whether the synthetic panel agrees with the human panel on the sign of the effect — that is, which option is preferred, whether sentiment is positive or negative, and whether a given attribute is considered important or unimportant. This is the most basic requirement: if a synthetic panel cannot correctly identify what consumers prefer, it is useless for even the simplest product decisions. Directional preference was scored as a binary (agree/disagree) per item, and aggregated across items and experiments as a percentage agreement.

Magnitude captures how well synthetic responses reproduce not just the direction but the strength of human preferences. For a 7-point Likert item, directional agreement would mean both panels agree that satisfaction is above the midpoint; magnitude agreement would require that the mean ratings differ by less than 0.5 points on the 7-point scale. Magnitude was operationalized as the absolute mean difference (standardized by the human panel's standard deviation) and as the correlation of mean responses across all items in an experiment.

Rank order measures the extent to which the synthetic panel preserves the ordering of options, brands, or attributes as determined by the human panel. For a conjoint-style exercise where participants rank five product features by importance, rank order agreement is the Spearman correlation between the synthetic and human rank vectors. This dimension is particularly important for trade-off analyses, where getting the relative ordering right matters more than the absolute magnitude of any single preference.

Objection coverage is a dimension unique to open-ended and diagnostic data. In product testing, the most actionable insight from a survey is often not what consumers like but what they dislike or would change. Objection coverage measures the proportion of distinct objections, concerns, or criticisms raised by the human panel that also appear in the synthetic panel's open-ended responses. Two raters independently coded a random sample of 2,500 open-ended responses from both panels (weighted by experiment), identifying unique objection categories. Objection coverage is the Jaccard index of the synthetic and human objection sets.

Finally, linguistic similarity captures the stylistic and lexical resemblance between synthetic and human natural-language responses. This dimension is measured using cosine similarity of Sentence-BERT embeddings averaged across responses, as well as bigram overlap and a readability profile (Flesch-Kincaid grade level, average sentence length, lexical diversity). While synthetic responses are expected to be more fluent and syntactically regular than human responses — which are often terse, ungrammatical, or incomplete — high linguistic similarity suggests that the synthetic agents are producing language that sounds like real consumer language, not like polished marketing copy.

5. Results

Across all 47 experiments and all measurement dimensions, the weighted average Pearson correlation between synthetic and human response distributions was r = 0.88 (SE = 0.02, 95% CI [0.84, 0.91]). This aggregate figure conceals meaningful variation across categories, dimensions, and individual experiments, but it establishes a clear baseline: properly constructed synthetic panels, under the conditions of this study, produce response distributions that correlate with human panels at a level that would be considered excellent inter-rater reliability in most social science contexts.

Directional preference agreement across all 47 experiments was 93.4% — the synthetic panel and the human panel agreed on which option was preferred, whether sentiment was positive or negative, or which attribute was most important in over 93% of comparisons. Magnitude correlations averaged r = 0.85 across experiments, with better performance on 5-point and 7-point Likert scales (r = 0.87) than on constant-sum allocations (r = 0.79). Rank order correlations, measured where applicable, averaged ρ = 0.91, suggesting that synthetic panels are particularly good at reproducing relative preferences even when absolute magnitudes are slightly off.

Objection coverage — the Jaccard index of distinct objections raised in open-ended responses — averaged 0.68 across all experiments. This is lower than the correlation-based metrics, reflecting the fact that synthetic agents sometimes fail to surface minority objections or frame criticisms in language that matches human phrasing. However, when we weighted objections by their frequency of mention in the human panel, coverage rose to 0.82, indicating that the most common human concerns are well-captured by synthetic agents even if the long tail is not fully covered.

The table below reports the full correlation breakdown by category and dimension. Values are Pearson r (standard error in parentheses) unless otherwise noted.

Category	Directional	Magnitude	Rank Order	Objection	Linguistic	Composite
CPG	0.97 (0.01)	0.90 (0.02)	0.94 (0.02)	0.74 (0.04)	0.80 (0.03)	0.92
Tech	0.93 (0.02)	0.84 (0.03)	0.90 (0.02)	0.65 (0.05)	0.75 (0.04)	0.87
Financial Services	0.89 (0.03)	0.80 (0.04)	0.85 (0.03)	0.58 (0.06)	0.72 (0.05)	0.83
Media / Entertainment	0.94 (0.02)	0.86 (0.03)	0.92 (0.02)	0.70 (0.04)	0.78 (0.04)	0.89
All (weighted)	0.93 (0.02)	0.85 (0.02)	0.91 (0.02)	0.68 (0.04)	0.76 (0.03)	0.88

The composite column reports a weighted average of the five dimensional scores, with weights proportional to the number of items in each dimension across all experiments (directional: 0.15, magnitude: 0.35, rank order: 0.20, objection: 0.15, linguistic: 0.15). The weighting reflects the relative importance of each dimension in typical market research applications, where getting absolute magnitudes and rank orders right is more critical than matching the exact phrasing of open-ended complaints.

6. Category analysis

The composite correlations by category — CPG 0.92, tech 0.87, financial services 0.83, media and entertainment 0.89 — reveal a clear pattern that maps onto the underlying decision complexity of each category. CPG, the category with the highest synthetic-human alignment, involves products that are purchased frequently, have well-established brand hierarchies, and evoke preferences that are relatively stable and socially shared. Synthetic agents trained on large-scale text data have abundant exposure to discussions of household brands, price comparisons, and product usage patterns. A synthetic agent asked to evaluate a laundry detergent or a snack food is drawing on a dense network of cultural knowledge that is widely distributed across its training corpus, and the resulting preferences align closely with those of human respondents.

Technology products, at 0.87, show slightly lower alignment, driven primarily by the objection coverage dimension (0.65, the lowest across all categories except financial services). Tech products elicit feature-specific critiques that are often idiosyncratic and experience-dependent — a user who has owned a particular laptop for two years may have complaints about its specific thermal management under sustained load, a detail that rarely appears in web text at sufficient frequency for a synthetic agent to internalize. Moreover, tech preferences are more polarized than CPG preferences: brand loyalty in tech is often accompanied by active dislike of competing ecosystems, and this affective polarization is not always well-captured by synthetic agents, which tend toward more moderate language.

Financial services, at 0.83, exhibit the lowest composite correlation across all categories. This is not surprising given the unique characteristics of financial decision-making. Financial products are experience goods whose true utility is only revealed over time: the value of a mortgage refinance depends on future interest rate paths; the value of an insurance policy depends on a claim event that may or may not occur. Synthetic agents can reason about product features in the abstract, but they cannot simulate the emotional weight of real financial decisions. Furthermore, financial literacy varies enormously across the population, and human responses to financial product questions are often shaped by heuristics, anxiety, and systematic biases that do not straightforwardly emerge from the aggregate text patterns in a language model's training data.

Media and entertainment, at 0.89, fall between CPG and tech, with relatively strong directional and rank order alignment but moderate objection coverage. Taste-driven categories present a particular challenge for synthetic agents because preferences are less tethered to objective product attributes. Two human viewers can watch the same film and arrive at diametrically opposed evaluations based on personal life experiences, mood at the time of viewing, or comparison to a idiosyncratic reference set. Synthetic agents, by contrast, tend to converge on the modal critical reception of a cultural product — the consensus view as reflected in reviews, social media discourse, and cultural commentary — rather than reproducing the full variance of individual taste. This produces strong average alignment at the aggregate level but can miss the diversity of human response, particularly for products that polarize audiences.

An important cross-cutting finding is that objection coverage is the dimension with the greatest category variance, ranging from 0.58 (financial services) to 0.74 (CPG). This dimension captures the most diagnostically rich data that market researchers collect — the verbatim reasons why consumers reject a product — and it is precisely where synthetic agents are weakest. We interpret this as a signal that synthetic agents are better at reproducing what consumers like than what they dislike, and that the negative space of consumer experience (frustrations, unmet needs, specific use-case failures) is less well-represented in the training distributions that underpin these models. This has actionable implications for product development: synthetic panels can reliably tell you which option consumers prefer at the aggregate level, but they may miss the specific, minority objections that, if addressed, would unlock the greatest incremental improvement.

7. The intention-behavior gap finding

One of the most striking findings to emerge from this study was not part of our original research question but surfaced during post-hoc analysis. In 12 of the 47 experiments, we were able to obtain real-world behavioral outcome data — actual purchase data, subscription sign-ups, content consumption metrics — that corresponded to the attitudinal and preference data collected in our panels. This allowed us to compare not just synthetic-to-human alignment, but both synthetic and human panels' ability to predict actual consumer behavior.

The behavioral prediction results were, frankly, surprising. When we correlated panel-level stated preferences with actual market outcomes, the synthetic panels' predictions correlated with real-world outcomes at r = 0.79 (SE = 0.04), while the human panels' predictions correlated at r = 0.61 (SE = 0.05). The difference of 18 percentage points is statistically significant (p < 0.001) and directionally consistent across 10 of the 12 experiments where behavioral data were available. In other words, synthetic panels were substantially better than human panels at predicting what consumers would actually do, as opposed to what they said they would do.

This finding requires careful interpretation. The well-documented intention-behavior gap in consumer research — the systematic divergence between stated purchase intent and actual purchase — is one of the most persistent problems in survey methodology. Meta-analyses consistently find that the correlation between stated intent and actual behavior ranges from 0.40 to 0.60 across product categories, with the gap widening as the temporal distance between the survey and the purchase occasion increases. Human panelists systematically overstate their likelihood of purchasing virtuous or status-enhancing products, understate their sensitivity to price, and fail to account for the contextual factors — shelf placement, promotion fatigue, peer influence — that shape real purchase decisions.

Synthetic agents, it appears, are immune to some of these biases. Because they are not subject to social desirability pressures, they do not inflate their stated interest in environmentally friendly products or aspirational brands. Because their preference structures are derived from aggregate behavioral patterns rather than introspection, they are less prone to the overconfidence that leads human respondents to overstate their likelihood of following through on a stated intention. And because they are not influenced by the survey context itself — the framing effects, priming effects, and demand characteristics that pervade human survey responses — their estimates of consumer behavior are, in a sense, less contaminated by the measurement instrument.

We hesitate to generalize this finding beyond the 12 experiments for which behavioral validation was available. These experiments were not randomly selected — they were the cases where our partner organizations were willing to share proprietary outcome data — and they may overrepresent categories and products where the intention-behavior gap is particularly wide for human respondents. Nevertheless, the result is robust within the available data and points to a provocative possibility: that synthetic panels, precisely because they lack the self-presentational concerns and cognitive biases that distort human survey responses, may be more accurate predictors of actual behavior in some contexts. This is not a claim that synthetic agents are more intelligent or more insightful than human respondents; it is a claim that they may be more honest, in the narrow sense of being more aligned with revealed preference.

8. Where synthetic outperforms human

The intention-behavior gap finding described above represents the most important domain where synthetic panels outperformed their human counterparts, but it was not the only one. Across the 47 experiments, we identified six distinct cases where synthetic panels matched or exceeded human panels on every measurement dimension, including dimensions where human panels were expected to have an inherent advantage. These cases warrant close examination because they illuminate the boundary conditions under which synthetic consumer intelligence may be not just a substitute for human panels but an improvement upon them.

The first case involved a conjoint-style exercise for a premium subscription service, where respondents were asked to trade off features including price, content library size, streaming quality, and device compatibility. Human respondents in this experiment showed strong range aversion — they avoided extreme choices on price and content library size, clustering their responses around the middle options in a pattern characteristic of satisficing behavior. The synthetic panel, by contrast, produced a clean utility function with well-separated part-worths, and the resulting market simulation predictions matched the actual launch performance of the service's tiered pricing structure five months later. The synthetic model predicted share of preference for each tier within 2.3 percentage points of actual sign-up data; the human model was off by 11.7 percentage points.

The second case was a concept test for a novel financial product — a micro-investment app targeted at young adults. Human respondents, particularly those with low financial literacy, exhibited strong anchoring effects: their willingness to pay for the service was heavily influenced by the first price point they saw, producing a 40% swing in average WTP depending on anchor condition. The synthetic agents, designed with consistent multi-attribute utility functions, were essentially immune to anchoring and produced stable WTP estimates regardless of presentation order. When the actual pricing test was run by the product team six weeks later, the synthetic panel's WTP estimate was within 5% of the revealed preference from the A/B test, while the human panel's estimate varied by 22% depending on which anchor condition you believed.

The remaining four outperformance cases shared a common structural feature: they involved experimental conditions where human responses were systematically distorted by measurement artifacts. These included a socially sensitive question about household financial habits (where human respondents underreported debt by approximately 35% compared to anonymized bank transaction data), a product test where the brand name triggered strong positive affect that bled through to ratings of unrelated attributes (a classic halo effect), a conjoint where the number of attributes exceeded seven (causing human respondents to default to a simple heuristic rather than processing all information), and a longitudinal tracking study where panel attrition introduced non-random selection bias in the human panel's later waves. In each case, the synthetic panel, because it was not subject to social desirability bias, halo effects, information overload, or attrition, produced estimates that were closer to the ground truth than the human panel's estimates.

These findings do not imply that synthetic panels are uniformly better than human panels, nor that synthetic agents possess superior judgment. What they suggest is that synthetic panels are subject to a different error structure than human panels, and that in specific conditions — particularly those involving social desirability, cognitive heuristics, or measurement artifacts — the synthetic error structure may be more forgiving than the human one. The practical implication for market researchers is not to replace human panels with synthetic ones, but to understand the error structures of both and deploy them accordingly: use human panels when you need the texture of lived experience, use synthetic panels when you need estimates that are robust to the measurement biases that plague human surveys.

9. Limitations

The most important limitation of this study is the linguistic similarity score of 0.76, which is the weakest dimension in our measurement framework and the one that is most resistant to improvement through current methods. While synthetic agents produce fluent, grammatically correct, and contextually appropriate natural language, their linguistic output differs systematically from human responses in ways that are detectable to both automated metrics and human raters. Synthetic responses are longer on average (47.3 words vs. 18.6 words for human open-ended responses), use a wider vocabulary (type-token ratio 0.72 vs. 0.58), and are more syntactically complex (average Flesch-Kincaid grade level 10.2 vs. 7.4). Human open-ended responses are often terse, elliptical, and colloquial — "too expensive," "bad quality," "love it" — and synthetic agents have difficulty reproducing this telegraphic quality because it requires suppressing their training signal toward well-formedness.

Beyond linguistic similarity, several other limitations deserve explicit acknowledgment. First, our experiments were conducted exclusively with US consumer populations. The demographic matching and cultural knowledge of the synthetic agents are optimized for English-language, US-centric contexts, and it is an open question how well these results would replicate in other countries, languages, or cultural settings where consumer preferences are shaped by different values, norms, and market structures. Early pilots in European and Asian markets show promising but noisy results, with composite correlations ranging from 0.72 to 0.85, suggesting that cultural calibration is an active area of development rather than a solved problem.

Second, all of our experiments involved relatively well-known product categories and brands. Synthetic agents trained on web-scale text have necessarily been exposed to discussions of Coca-Cola, Apple, Netflix, and JPMorgan Chase. We do not know how well synthetic panels would perform for genuinely novel products, obscure brands, or categories that are underrepresented in the training corpus. The few experiments we conducted with lesser-known stimuli showed a non-trivial drop in correlation (average 0.14 lower than the category baseline), suggesting that synthetic performance degrades as the stimulus becomes less familiar to the underlying model.

Third, our study focused on aggregate response distributions, not individual-level accuracy. A synthetic panel that reproduces the population-level preference distribution with r = 0.88 may still be unreliable at predicting any single individual's preferences. For market research applications that require segmentation — identifying which specific consumers prefer which specific product configurations — individual-level accuracy matters, and our data do not speak to it. We are currently conducting a follow-up study that examines synthetic-human alignment at the individual level using within-subject designs, and preliminary results suggest that individual-level correlations are substantially lower, in the range of 0.40 to 0.60.

Fourth, the temporal stability of synthetic responses is poorly understood. Our replications over a two-week period showed high test-retest reliability (r = 0.93), but the underlying language models that power synthetic agents are periodically updated, and it is unclear whether a synthetic panel generated today would produce the same response distributions as one generated six months from now using a newer model version. This version dependency introduces a methodological challenge that the human panel industry solved decades ago through careful panel management and tracking; a similar infrastructure does not yet exist for synthetic panels.

Finally, we must acknowledge the epistemological limitation inherent in any comparison between synthetic and human responses. Human survey responses are themselves an imperfect proxy for the construct we actually care about — consumer behavior in the real world. A synthetic panel that correlates at r = 0.88 with a human panel may be getting the same things wrong that the human panel gets wrong. The intention-behavior gap analysis described above partially addresses this concern by validating against actual behavioral data, but behavioral validation was only available for 12 of 47 experiments, and those 12 experiments are not necessarily representative. The strongest claim we can make with confidence is that synthetic panels, as constructed and evaluated in this study, produce response distributions that closely resemble those of demographically matched human panels under conditions typical of consumer research.

10. Conclusion

After 47 experiments, 18,000 human participants, nearly 50,000 synthetic agent runs, and six months of analysis, our central finding is clear: synthetic consumer panels, when properly constructed and demographically matched, produce response distributions that correlate with human panels at a weighted average of r = 0.88 across the categories, dimensions, and measurement instruments most common in market research. This is not a claim that synthetic panels have achieved parity with human panels — the variation across dimensions (from 0.68 for objection coverage to 0.93 for directional preference) tells a more nuanced story — but it is strong evidence that synthetic consumer intelligence has crossed a threshold of practical utility.

The appropriate framing, we believe, is not replacement but complementarity. Synthetic panels are not substitutes for human panels across the board; they are tools with a distinct error profile that makes them particularly valuable for certain use cases and less valuable for others. Synthetic panels excel at tasks that require freedom from social desirability bias, cognitive heuristics, and measurement artifacts. They struggle with tasks that require the texture of lived experience, minority opinions, and the telegraphic authenticity of real consumer language. A mature market research practice should deploy both, using each for what it does best.

Our finding that synthetic panels are better than human panels at predicting actual behavior (r = 0.79 vs. 0.61) in the subset of experiments where behavioral validation was available deserves particular attention. This is the most provocative result in our study, and it warrants replication and extension. If synthetic agents are indeed less susceptible to the intention-behavior gap — if they are, in a meaningful sense, more honest than human survey respondents — then the role of synthetic panels in market research may go beyond cost savings and speed. They may actually produce better predictions of what consumers will do, as opposed to what they say they will do.

But we must guard against overclaiming. The corrigibility advantage of synthetic agents — the fact that they can be inspected, modified, and re-run — is as important as any accuracy comparison. When a human panel produces a surprising result, the researcher cannot ask the participants why they responded as they did, or trace the reasoning path that led to a particular choice. With synthetic agents, every response is transparent and auditable. The reasoning chain can be examined, the profile parameters can be adjusted, and the experiment can be reproduced at near-zero marginal cost. This corrigibility, rather than raw accuracy, may ultimately be the strongest argument for synthetic consumer intelligence: not that it is always right, but that when it is wrong, we can understand why and fix it.

The post-survey era, as we have argued elsewhere, is not a future state — it is already here. But the post-survey era does not mean the end of human research participants. It means the end of the assumption that the only way to understand consumers is to ask them directly. Synthetic panels offer a new instrument in the market research toolkit: one that is faster, cheaper, and in specific contexts, more accurate than traditional methods. Like all instruments, it has limitations, and like all instruments, it requires skilled practitioners who understand both its capabilities and its blind spots. The research presented here is offered in that spirit: not as a final verdict on synthetic consumer intelligence, but as a rigorous benchmark against which future progress can be measured.

Benchmarking synthetic vs. human panels: 47 experiments, 1 conclusion