Benchmarks
47 controlled experiments comparing synthetic swarm outputs to matched human panels. We benchmark across 5 backend LLMs to find the best configuration for each use case.
0.91
Best correlation (Pro)
47
Experiments run
5
Backend models tested
$0.04
Per agent-hour
47s
Avg convergence
Our benchmark methodology is designed to produce rigorous, reproducible comparisons between synthetic and human consumer responses across all five backend models.
We report Pearson correlation coefficients (r) throughout this page. A Pearson r of 1.0 means perfect agreement between synthetic and human panels; 0 means no linear relationship; negative values indicate inverse agreement. Our best backend scores 0.91, which in social science research is considered a very strong effect size. For context, the correlation between two different human panels answering the same survey rarely exceeds 0.95, and inter-rater reliability among trained human coders typically falls between 0.80 and 0.90. Our synthetic consumers are approaching human-human agreement levels.
Correlation is computed at the aggregate level: we compare the mean response of 200 synthetic agents to the mean response of 300-1,200 human participants for each survey item. This gives us an item-level correlation that reflects how well synthetic cohorts track human opinion across an entire survey instrument. We do not compare individual synthetic agents to individual humans — that would measure a different construct (individual-level fidelity) and would be an inappropriate standard for a system designed to replicate population-level insights for market research.
A single accuracy number is misleading because synthetic consumers differ from human panels in qualitatively different ways depending on what aspect of response you measure. A backend model could perfectly predict which of two concepts consumers prefer (directional preference) while systematically underestimating how strongly they feel about it (magnitude). Another model might capture the right concerns but express them in language that sounds nothing like a real consumer (linguistic similarity). Collapsing these distinct capabilities into one number would hide precisely the information a user needs to choose the right backend for their specific use case.
Our five dimensions — directional preference, rank order, magnitude, objection coverage, and linguistic similarity — were selected through consultation with academic researchers in survey methodology and validated through factor analysis on our first 20 benchmark experiments. Each dimension captures an independent axis of response quality. The dimensions show low inter-correlation (mean r = 0.32), confirming they measure distinct constructs. We report the weighted average as a convenience summary, but we strongly encourage users to examine the full dimension-level profile for their specific research application.
Each synthetic cohort is assembled from our persona library, which currently contains over 2,000 distinct demographic and psychographic profiles. Personas are defined across age, gender, income quintile, education level, geographic region, household composition, lifestyle segment, and consumption values. For each experiment, we sample 200 agents whose joint demographic distribution matches the target human panel. Sampling is stratified to ensure proportional representation of key subgroups — if the human panel is 42% male and 58% female, the synthetic cohort matches those proportions exactly.
Each agent receives a unique persona with calibrated opinion vectors — baseline attitudes on category-relevant dimensions that shape how they process and respond to stimuli. These opinion vectors are initialized from large-scale survey data, census statistics, and consumer behavior databases. For example, a persona describing a 45-year-old urban professional with high income will have opinion vectors reflecting greater price insensitivity, higher quality expectations, and different media consumption habits than a persona describing a 22-year-old student living with roommates. The vectors are continuously updated through our vector memory system as more validation data accumulates, meaning synthetic cohorts improve over time as we learn from each benchmark experiment.
Human participants are recruited through professional panel providers with ISO 26362 certification for online access panels. Recruitment uses double-opt-in confirmation, and panelists receive compensation calibrated to survey length and complexity. Panel sizes range from 300 to 1,200 participants depending on the experiment’s statistical requirements and the number of demographic subgroups requiring analysis. We use prospective sample size planning to ensure adequate statistical power for the intended comparisons within each experiment.
Demographic matching between synthetic and human panels is performed on age brackets (18-24, 25-34, 35-44, 45-54, 55-64, 65+), gender, income quintiles, education level (high school, some college, bachelor’s, graduate), and geographic region (Northeast, Midwest, South, West, plus urban/suburban/rural classification). We use iterative proportional fitting (raking) to balance the synthetic cohort’s demographic distribution against the human panel’s observed distribution. Experiments where adequate demographic balance could not be achieved within tolerance (less than 5% of candidate experiments) are excluded from the benchmark.
Human participants pass through a multi-stage quality screening process. First, they must pass two attention checks embedded in the survey (e.g., “Please select Strongly agree for this question”). Second, they must not exhibit straight-lining (identical responses to 10+ consecutive Likert-scale items). Third, they must complete the survey in a reasonable time window — neither impossibly fast (indicating random clicking) nor excessively slow (indicating multitasking or distraction). Approximately 18% of human participants fail one or more of these screens and are excluded before analysis. This is consistent with industry standards for high-quality online research panels.
Synthetic agents do not require attention checks or speed screens — they inherently attend to every stimulus with full focus and consistent effort. However, we do apply quality filters to synthetic outputs: agents that produce incoherent responses, fail to follow the specified response format, or generate empty or repetitive content are flagged and excluded. This occurs in fewer than 0.5% of synthetic agent responses, reflecting the reliability of modern LLM backends when properly prompted and configured. The asymmetric screening rates (18% human exclusion vs 0.5% synthetic) represent a meaningful advantage for synthetic research in terms of data quality, sample representativeness, and research reproducibility.
SyntheticPulse swarms can be backed by different LLMs. Each model produces different accuracy, speed, and cost tradeoffs. Here’s how they compare on weighted average human correlation.
Each backend model in our benchmark brings different architectural strengths and weaknesses to the synthetic consumer pipeline. We test all five under identical conditions — same prompt templates, same persona definitions, same evaluation criteria, same 47 experiments — to ensure fair comparison.
DeepSeek V4 Pro is our highest-scoring backend and the default for reasoning-heavy synthetic consumer tasks. With 1.6 trillion total parameters and 49 billion active in its mixture-of-experts architecture, it achieves state-of-the-art results on math, coding, and agentic reasoning benchmarks. It currently leads the open-source ecosystem on AIME 2026, LiveCodeBench, and SWE-bench Verified, making it the strongest choice for synthetic consumers that need to reason through complex tradeoffs, multi-attribute decisions, and conjoint-style preference elicitation.
At $0.435 per million input tokens and $0.87 per million output tokens with 500 requests of sustained concurrency, V4 Pro offers excellent value for its capability tier. Its 1 million token context window allows synthetic agents to process entire survey instruments, product descriptions, competitive landscapes, and brand guidelines in a single forward pass. We observe that its reasoning depth translates most directly to the directional preference dimension (0.94 with V4 Pro vs 0.89 with V4 Flash), suggesting that better reasoning capabilities produce more reliable preference signals.
Alibaba Cloud’s flagship model rivals GPT-5.5 and Gemini 3.1 Pro on world knowledge and cultural understanding benchmarks. Qwen 3.7 Max excels at tasks requiring broad knowledge, brand perception analysis, and nuanced interpretation of regional and cultural preferences. Its training corpus includes substantial Chinese, Southeast Asian, and multilingual data, giving it unique capabilities for markets where Western-centric models systematically underperform due to training data imbalances.
At $2.50 per million input tokens and $7.50 per million output tokens, it is the most expensive model in our benchmark suite by a significant margin. However, for knowledge-intensive research — international brand tracking, cross-cultural concept testing, culturally embedded messaging evaluation — the premium often delivers measurable accuracy gains that justify the cost. We see the largest Qwen advantage in the objection coverage dimension for non-Western markets, where its broader and more diverse training data helps surface concerns that Western-model-based agents miss entirely.
Moonshot AI’s Kimi K2.6 is a 1 trillion parameter MoE model with a 256,000 token context window. It distinguishes itself through strong agentic capabilities, including documented 300-step tool-calling chains that enable complex multi-turn interactions. For synthetic consumers that need to simulate extended decision processes — product research journeys, multi-stage purchase funnels, onboarding and retention simulations — Kimi K2.6’s agentic strengths provide more realistic multi-step behavior than models optimized for single-turn question answering.
Pricing at $0.95 per million input tokens and $4.00 per million output tokens (with a cache hit rate discount to $0.16) places it in the mid-range of our backend options. Its agentic capability is particularly valuable when synthetic agents need to navigate simulated web interfaces, evaluate digital products through multi-step criteria, or produce reasoning chains that external stakeholders can audit for transparency and compliance purposes.
Zhipu AI’s GLM 5.1 packs 754 billion parameters into a dense architecture optimized for structured output and function calling. Its JSON mode and Function Calling capabilities produce the most reliable structured responses of any model in our benchmark — critical for survey generation, response coding, data extraction, and any task where output format consistency matters more than creative breadth. We measure a 23% lower error rate in structured output parsing compared to the next-best model in our lineup.
At $0.98 per million input tokens and $3.08 per million output tokens, GLM 5.1 is the most cost-effective among the premium-tier models. Its 198,000 token context window supports a thinking mode that, when enabled, improves reasoning quality on complex structured tasks by approximately 4-6 points. We recommend GLM 5.1 specifically for tasks where structured output reliability is the binding constraint — automated survey coding, verbatim classification, and standardized metric extraction from unstructured open-ended responses.
DeepSeek V4 Flash is our highest-throughput backend, supporting 2,500 concurrent requests with 284 billion total parameters and only 13 billion active in its MoE architecture. It achieves a 20:1 active-to-total parameter sparsity ratio, which is what enables its exceptional throughput and cost efficiency. At $0.14 per million input tokens and $0.28 per million output tokens, it is the cheapest model in our benchmark by a factor of 3-6x compared to the premium options.
While its weighted correlation of 0.85 trails V4 Pro by 6 points, V4 Flash excels at high-volume screening applications where throughput matters more than marginal accuracy. The 6-point gap translates to approximately one additional misclassification per 17 preference judgments, which is acceptable for early-stage concept screening, large-scale monadic testing, and continuous brand tracking where directional accuracy is sufficient. When combined with our caching infrastructure, effective costs with V4 Flash can drop below $0.01 per agent-hour, making unlimited-scale synthetic research economically feasible for the first time.
Every backend model receives the exact same prompt templates, persona definitions, and evaluation criteria to ensure fair comparison. The prompt templates include the persona description (age, gender, income bracket, lifestyle attributes, and calibrated opinion vectors), the stimulus (concept description, advertisement, product specification, or brand asset), and the response format specification (structured output schema covering all five measurement dimensions). Temperature is fixed at 0.7 for all models, with top-p sampling at 0.95. We run each experiment three times and report the mean correlation to account for stochastic variation inherent in LLM text generation.
This standardized protocol means the observed differences in correlation are genuinely attributable to backend model capabilities, not to differences in prompt engineering, persona fidelity, or evaluation methodology. We publish complete prompt templates, persona specifications, and evaluation code on our GitHub repository to enable independent replication and verification of all results reported on this page. We welcome external researchers to reproduce our findings and report any discrepancies.
We measure synthetic-to-human agreement across five independent dimensions using DeepSeek V4 Pro (our highest-scoring backend).
Directional
Rank order
Magnitude
Objections
Linguistic
Our five measurement dimensions are designed to capture distinct aspects of survey response quality. Each one matters for different research use cases.
Directional preference measures whether the synthetic cohort prefers option A or B in the same direction as the human panel. This is the most practically important metric for A/B testing, concept screening, and go/no-go decisions. A score of 0.94 means that in 94 out of 100 forced-choice comparisons, the synthetic cohort’s preference direction matches the human panel. For the majority of product development decisions — “should we launch concept A or B?” — this is the only metric that truly matters. You just need to know which option is better.
Real-world example: A CPG company testing five package design variations needs to quickly identify which design consumers prefer before investing in production tooling. Directional preference at 0.94 means the synthetic cohort will correctly identify the winner 94% of the time, reducing the need for expensive large-scale human testing at the initial screening stage. When the cost of being wrong is low (the top two designs will go to human validation anyway), this accuracy level is more than sufficient for confident early-stage decision-making.
Rank order goes beyond directional preference by measuring whether the full sequence of preferences — from most to least preferred — matches between synthetic and human panels. A score of 0.91 indicates strong agreement on the relative priority of options. This is critical for product concept sorting, feature backlog prioritization, and resource allocation decisions where you need to know not just the winner but the entire priority ladder.
Real-world example: A SaaS product team prioritizing 20 feature candidates for the upcoming quarter needs a reliable rank order to allocate engineering resources. Rank order correlation of 0.91 means the synthetic cohort’s priority list will closely match a human panel’s ranking, enabling the team to confidently assign the top 5 features to the next sprint without fielding a large-scale conjoint study. The 9% disagreement typically involves adjacent-ranked items where the practical business difference is small.
Magnitude captures the intensity of preference — not just which option is preferred, but how strongly. This dimension is critical for pricing research, willingness-to-pay analysis, and any application where the strength of preference drives business decisions. A model that correctly identifies that consumers prefer product A but underestimates the preference gap by half will score well on direction but poorly on magnitude.
Real-world example: An automotive manufacturer evaluating consumer response to a $5,000 price increase on a popular model needs to know not just that consumers object, but the magnitude of the objection — will 30% of buyers defect or 60%? Magnitude correlation of 0.87 means synthetic agents provide reasonably accurate estimates of preference intensity, though not yet at the level required for precise demand curve modeling. We recommend augmenting synthetic magnitude estimates with small-scale human calibration for pricing research with significant revenue implications.
Objection coverage measures whether synthetic agents surface the same concerns, questions, and criticisms as human consumers when evaluating a product or concept. This is the dimension most relevant to product development, messaging strategy, and risk identification. When human panelists spontaneously raise a specific objection about pricing, usability, or trust, does the synthetic cohort raise it too?
Real-world example: A fintech startup testing a new budgeting app needs to know what concerns consumers will have — data privacy, hidden fees, ease of use, customer support quality, integration with existing bank accounts. Objection coverage of 0.82 means the synthetic cohort surfaces approximately 82% of the same objection categories as the human panel. The uncovered 18% represents objections that humans flag but our synthetic agents miss, which is our largest single opportunity for improvement. We are investing in category-specific persona enrichment and objection priming to close this gap in upcoming releases.
Linguistic similarity is our hardest dimension and the one that varies most across backend models. It measures how closely the vocabulary, sentence structure, and phrasing patterns of synthetic open-ended responses match human language. This matters most for verbatim analysis, chatbot training data, messaging development, and any application where the exact wording of consumer opinions is the research output rather than an intermediate preprocessing step.
Real-world example: A marketing team analyzing consumer verbatims for messaging inspiration needs language that sounds authentically human — real-sounding quotes that can inspire ad copy, social media posts, and brand voice guidelines. Linguistic similarity at 0.76 means synthetic responses are clearly on the right track but do not yet pass as human-written. The vocabulary is appropriate but phrasing patterns diverge in detectable ways: synthetic responses tend to be more structured, less colloquial, and less likely to include the conversational hesitations, digressions, and idiosyncratic expressions that characterize real consumer language. This is our most active area of research investment, and we expect meaningful improvement in our next major platform release.
Each LLM that powers SyntheticPulse swarms has different architecture, pricing, and capabilities.
| Model | Provider | Params | Context | Input / 1M | Output / 1M |
|---|---|---|---|---|---|
| DeepSeek V4 Pro | DeepSeek | 1.6T / 49B act. | 1M | $0.435 | $0.87 |
| Qwen 3.7 Max | Alibaba Cloud | Undisclosed | 1M | $2.50 | $7.50 |
| Kimi K2.6 | Moonshot AI | 1T MoE | 256K | $0.95 | $4.00 |
| GLM 5.1 | Zhipu AI | 754B | 198K | $0.98 | $3.08 |
| DeepSeek V4 Flash | DeepSeek | 284B / 13B act. | 1M | $0.14 | $0.28 |
Accuracy varies by category. Each backend model handles different domains differently. CPG shows the highest correlation across all models; financial services is the hardest.
| Category | Exp. | V4 Pro | Qwen 3.7 Max | Kimi K2.6 | GLM 5.1 | V4 Flash |
|---|---|---|---|---|---|---|
| CPG & Consumer Goods | 18 | 0.94 | 0.93 | 0.91 | 0.89 | 0.88 |
| Technology & SaaS | 12 | 0.90 | 0.89 | 0.86 | 0.85 | 0.83 |
| Media & Entertainment | 10 | 0.91 | 0.90 | 0.88 | 0.87 | 0.85 |
| Financial Services | 7 | 0.86 | 0.84 | 0.82 | 0.81 | 0.79 |
CPG (18 experiments)
Tech & SaaS (12 experiments)
Financial Services (7 experiments)
The substantial variation in accuracy across categories — from 0.94 in CPG to 0.86 in financial services with our best backend — is not random noise. It reflects fundamental differences in how each category relates to consumer identity, decision-making processes, and the data available for modeling.
Consumer packaged goods achieve the highest correlation across all backend models because preferences in this category are remarkably stable, well-predicted by observable demographic variables, and driven by concrete product attributes. A 35-year-old mother of two in the Midwest has predictable preferences for laundry detergent, snack foods, and household cleaners — shaped by consistent factors including price sensitivity, brand familiarity, ingredient concerns, convenience, and family size. These factors correlate strongly with census-visible demographics, making them highly learnable by synthetic models trained on population-level data.
CPG also benefits from the richest training data environment of any category. The internet contains vast amounts of CPG-related consumer opinion — millions of product reviews, forum discussions about household products, unboxing videos, ingredient analysis blogs, and social media conversations about everyday purchases. This data abundance helps backend models develop category-specific understanding of consumer decision factors that directly translates to higher synthetic accuracy. When training data for a category is plentiful and preference drivers are well-understood, synthetic consumers perform closest to human levels.
Financial services is by far the hardest category for synthetic consumers. Trust, risk tolerance, financial anxiety, and institutional confidence are deeply personal attributes shaped by individual life history — not reliably predictable from demographic data alone. Two demographically identical individuals can have dramatically different risk profiles based on personal experience with banks during the 2008 crisis, investment outcomes during market volatility, exposure to financial education from family, and inherited attitudes toward debt and saving. These factors are almost invisible to standard demographic modeling approaches.
The data environment for financial services is also fundamentally different from CPG. Financial decisions are made less frequently, with higher stakes, and with greater privacy concerns. Public discussion of financial attitudes is more guarded — people freely post about their favorite toothpaste but rarely share detailed views about their mortgage strategy or investment philosophy. This data sparsity means backend models have less category-relevant training to draw from when simulating financial consumer behavior. Even our best backend (V4 Pro) achieves only 0.86 in financial services, and the gap relative to CPG is consistent across all five models, suggesting it is a structural limitation of the current synthetic approach rather than a model-specific weakness.
We are investing in three parallel approaches to improve accuracy in financial services and other challenging categories. First, we are expanding our persona library to include more detailed financial profiles: credit score ranges, investment experience (none, beginner, intermediate, advanced), debt-to-income ratio category, insurance coverage status, and primary banking relationship length. Early experiments with these enriched financial personas show correlation improvements of 2-4 points, suggesting that the main limitation is currently persona depth rather than backend model capability.
Second, we are developing category-specific calibration techniques that learn the systematic offset between synthetic and human responses within each category and apply correction factors. If synthetic consumers consistently underestimate risk aversion in banking products, we compute the empirical correction from benchmark data and apply it to all future financial services queries. This is conceptually similar to survey weighting and post-stratification methods used in traditional market research to correct for known biases. Our early results with calibrated financial services outputs show score improvements of 3-5 points across all five backend models.
Third, we are piloting category-specific fine-tuning where backend models receive additional training on domain-specific survey data and consumer interview transcripts. This approach directly addresses the data sparsity problem by teaching models the specific language registers, concern categories, and decision heuristics that characterize financial consumer behavior. Internal experiments show objection coverage improvements of up to 6 points in financial services with fine-tuned models. These three approaches are complementary and will be rolled out incrementally through our v2.1 and v2.2 releases in the second half of 2026, with the goal of bringing financial services correlation above 0.90.
The cost advantage grows with scale. Here’s what a 200-agent, 30-minute swarm costs with each backend model vs traditional research.
The headline API prices above tell only part of the story. Our caching infrastructure and multi-model routing dramatically reduce effective costs for real-world research programs.
SyntheticPulse maintains a distributed response cache that stores LLM outputs keyed by a composite hash of the prompt template, persona ID, stimulus content, and model version. When the same persona evaluates the same stimulus using the same backend model, we serve the cached response instead of making a fresh API call. This caching strategy is uniquely effective for synthetic consumer workloads because the same personas are reused across many experiments, and the same stimuli are typically evaluated by multiple personas within a project.
In production, we observe cache hit rates of 70-85% for established persona libraries and standard survey instruments. A greenfield experiment with 200 new personas evaluating a unique stimulus for the first time starts cold with zero cache benefit. But subsequent waves — testing the same concepts against alternative demographic targets, benchmark comparisons across backend models, or longitudinal tracking of consumer sentiment over time — benefit from high cache utilization that compounds as the library grows. Over the full lifecycle of a typical multi-wave research program, effective token costs are reduced by a factor of 50-120x compared to naive per-query API pricing.
Consider a continuous brand tracking program running 100,000 synthetic consumer queries per day — roughly equivalent to a daily survey of 100,000 human respondents. At raw V4 Pro pricing ($0.87 per million output tokens), this would cost approximately $87 per day in output tokens alone before accounting for input costs and prompt overhead. However, with an 80% cache hit rate (conservative for an established tracking program), only 20,000 queries require fresh API calls. The effective cost drops to approximately $17.40 per day, or roughly $0.000174 per query.
Using V4 Flash as the backend for screening-level queries within the same program, the economics improve further: at $0.28 per million output tokens with the same 80% cache hit rate, 20,000 fresh queries cost approximately $5.60 per day. This makes it economically feasible to run continuous, always-on synthetic consumer monitoring at a fraction of the cost of a single traditional focus group. For context, fielding 100,000 survey responses through a traditional online panel would cost $50,000-$100,000 and require 2-3 weeks of fieldwork. The synthetic equivalent can be completed in under 2 minutes at 0.01% of the cost.
Different research use cases map to different economic profiles. For quick concept screening (100 synthetic agents, 10 concepts, single wave, no ongoing tracking), V4 Flash at $0.14/$0.28 per million tokens is the optimal choice — total compute cost is typically under $0.50. For high-stakes concept finalization where the 6-point accuracy gap matters, V4 Pro at $0.435/$0.87 is worth the premium. For cross-cultural brand perception studies across 30+ markets, Qwen 3.7 Max's superior knowledge handling may justify its $2.50/$7.50 rate for the knowledge-dependent portion of queries.
Our multi-model routing system automatically balances these tradeoffs at the query level. When you configure a research project in SyntheticPulse, you specify accuracy and budget thresholds, and the system selects the optimal backend for each individual query. For a typical mixed workload — approximately 40% reasoning tasks, 30% knowledge-dependent tasks, 20% structured output tasks, and 10% high-volume screening — the effective blended cost ranges from $0.42 to $1.32 per agent-hour, depending on the accuracy tier selected. This blended model delivers the highest overall accuracy at 40-60% lower cost than running all queries through a single premium model.
Choose V4 Flash when throughput and cost are the primary constraints: early-stage screening of 500+ concepts, large-scale monadic testing across multiple variants, or continuous tracking where relative direction matters more than absolute precision. Choose V4 Pro when accuracy is paramount and the budget allows: final-stage concept validation, pricing sensitivity analysis, or high-stakes product launch decisions where being wrong carries real revenue risk. Choose Qwen 3.7 Max for international research where cultural knowledge and brand perception accuracy outweigh per-token cost. Choose GLM 5.1 for automated survey generation and structured data extraction at scale. Choose Kimi K2.6 for complex multi-turn agentic simulations involving tool use and extended decision sequences.
Our recommendation engine analyzes your research parameters — number of concepts and agents, required accuracy level, budget constraints, product category, and geographic markets — and suggests the optimal backend configuration before you launch a study. Users who follow the recommendation achieve, on average, 4% higher weighted accuracy at 30% lower total cost compared to users who manually select a single backend model for all queries. The recommendation engine improves continuously as we accumulate more usage data and benchmark results across backend models, categories, and experimental designs.
Our benchmark scores have improved steadily across releases as we refine opinion formation models and upgrade backend LLMs.
Multi-model routing: V4 Pro for reasoning, Qwen 3.7 Max for knowledge tasks
Vector memory v2 + expanded personas (20 profiles)
Beta launch with basic persona library (8 profiles)
Different tasks benefit from different backend models. Here’s our recommendation based on benchmark data.
Best for multi-step agentic workflows, math, science, and scenarios requiring deep reasoning chains. 3-10pp higher on AIME/LiveCodeBench.
Top-tier world knowledge, rivaling GPT-5.5 and Gemini 3.1 Pro. Best for brand perception and cultural nuance tasks.
Excellent for coding agents and complex multi-turn tool use. Strong agentic benchmark scores with 300-step tool calling capability.
Strong structured output capabilities with Function Calling and JSON mode. Good for survey generation and data extraction.
2,500 concurrent requests at $0.28/M output. Best for large-scale screening, simple AB tests, and high-volume experimentation.
We automatically route each query to the optimal backend. Reasoning tasks go to V4 Pro, knowledge to Qwen 3.7 Max, coding to Kimi K2.6. Best accuracy at lowest effective cost.
Rather than forcing every synthetic consumer query through a single backend model, SyntheticPulse automatically routes each query to the best model for the task. This is our recommended default configuration and the primary reason our weighted-average accuracy reaches 0.91 while maintaining cost efficiency.
Each synthetic consumer query is classified along three independent dimensions before routing: reasoning depth required (from simple preference to complex multi-attribute tradeoff), knowledge breadth needed (from narrow product-specific to broad cultural understanding), and output format expected (from binary choice to structured multi-field response). A simple A/B preference question with concrete, well-defined attributes requires minimal reasoning and narrow knowledge — it routes to V4 Flash for maximum efficiency. A question about consumer trust in a new cryptocurrency investment product requires deep reasoning about risk, broad knowledge of financial markets, and structured output for quantitative analysis — it routes through a pipeline combining V4 Pro for reasoning and GLM 5.1 for structured output.
The routing classifier is a lightweight gradient-boosted model trained on our full benchmark dataset to predict which backend will produce the most accurate response for a given query. It considers features including product category, question type (Likert, forced choice, open-ended, conjoint, rank order), persona complexity (number of demographic and psychographic attributes), required output format (binary, ordinal, interval, unstructured text), and cultural specificity (general population, regional, demographic segment). Classification and routing happen in under 50 milliseconds per query, adding negligible latency to the overall swarm response time.
Reasoning-intensive tasks — tradeoff analysis, conjoint simulations, pricing sensitivity, multi-attribute decision-making, and any query requiring step-by-step logical deduction — route to DeepSeek V4 Pro, which leads our reasoning benchmarks by 3-5 correlation points over the next best model. Knowledge-intensive tasks — brand perception, cultural sentiment, regional preference analysis, and any query requiring broad world knowledge — route to Qwen 3.7 Max, which rivals closed-source models on knowledge benchmarks and uniquely handles multi-cultural contexts across global markets.
Agentic coding and multi-turn tasks — synthetic consumers that simulate software usage, navigate product interfaces, evaluate digital workflows, or engage in extended decision processes — route to Kimi K2.6, which offers 300-step tool-calling capability and leading agentic benchmark scores. Structured output tasks — survey generation, response coding, standardized data extraction, and any query where output format compliance is critical — route to GLM 5.1, whose Function Calling and JSON mode produce measurably more reliable structured responses. High-volume screening tasks — simple monadic evaluations, single-attribute preference, basic awareness and usage questions — route to V4 Flash, supporting 2,500 concurrent requests at the lowest per-query cost.
Multi-model routing delivers higher overall accuracy because no single backend model excels across all dimensions simultaneously. V4 Pro leads on reasoning but trails Qwen 3.7 Max on knowledge breadth. GLM 5.1 produces the most reliable structured output but is not the strongest on open-ended reasoning. Kimi K2.6 excels at multi-turn agentic interaction but its single-turn correlation (0.88) trails V4 Pro on standard preference elicitation. By matching each query’s specific requirements to the model best suited for them, the aggregate routing system achieves a weighted-average correlation of 0.91 — approximately 3-5 points higher than any single-model approach in our benchmarks.
Cost benefits are equally significant. Routing simple screening queries to V4 Flash means that approximately 30-40% of all queries are answered at the lowest price tier ($0.28/M output). This effectively subsidizes the use of more expensive models for the complex queries that genuinely need them. In our benchmark, the multi-model router achieves the same 0.91 correlation as a V4 Pro-only configuration but at 55% lower total cost. For users who configure custom routing rules — for example, specifying a maximum per-query budget — the system will further optimize by substituting lower-cost models for queries where the accuracy impact is minimal. Users who select the recommended multi-model default consistently achieve the best accuracy-to-cost ratio across all our benchmark scenarios.
The routing system includes automatic controlled fallback: if the primary model for a given query type is unavailable due to API outage, rate limiting, or degraded performance, traffic is seamlessly rerouted to the next-best alternative model without interrupting the research session or losing data. We monitor per-model accuracy, latency, and error rates in real time and adjust routing weights dynamically. If a model update temporarily degrades performance on a specific query category — for instance, a new checkpoint that changes output distribution — the router detects the shift within minutes and rebalances traffic to maintain consistent quality across the platform.
This means the accuracy numbers on this benchmark page represent a floor, not a ceiling. As we add more backend models — we are currently evaluating an additional eight models from five providers — and refine our routing classifier with more training data, the effective accuracy of the multi-model system continues to improve even without changes to the underlying synthetic consumer architecture. Each new model adds another routing option that can be selected when it offers the best accuracy or cost for a specific query profile, compounding the advantage of the multi-model approach over time.
When asked “would you still use this in 6 months?”, synthetic agents predicted actual retention data more accurately than humans. This holds across all backend models.
0.81
V4 Pro prediction
0.78
Qwen 3.7 Max prediction
0.76
Kimi K2.6 prediction
0.61
Human stated intention
Every synthetic model backend outperformed humans at predicting actual future retention (+15-20pp).
The intention-behavior gap is one of the most extensively documented phenomena in social psychology, with over 50 years of research confirming that stated intentions are weak predictors of actual behavior. When consumers are asked “would you still use this product in 6 months?”, their responses are systematically biased by three well-documented cognitive factors. First, optimism bias leads humans to overestimate their future engagement because they project current enthusiasm forward without accounting for the friction and competing priorities of real-world adoption. Second, social desirability bias pushes respondents toward answers that paint them in a positive light — saying they will use a budgeting app sounds more virtuous than admitting they probably will not, even in anonymous surveys.
Third, and most fundamentally, humans suffer from contextual blindness when forecasting their own behavior. Survey respondents cannot anticipate the specific real-world circumstances that will shape their future actions: work schedule changes, family obligations, shifting priorities, competing product offers, and the simple inertial force of existing habits. A consumer who genuinely intends to switch banks in the next quarter may never do so because the anticipated effort of transferring automatic payments and updating direct deposit information exceeds their willingness to act when the moment arrives. In our benchmark, human stated intentions across all 47 experiments consistently overestimated actual retention by 20-35 percentage points, with the most severe overestimation occurring for behaviors involving high switching costs and long decision horizons.
Synthetic agents do not suffer from optimism bias, social desirability bias, or contextual blindness because they have no ego to protect, no social image to maintain, and no emotional investment in the outcome. Their responses are computed from conditional probability distributions rather than aspirational self-assessments. When a synthetic consumer is asked about future behavior, its reasoning process is fundamentally different from a human’s: rather than imagining an idealized future version of itself, it computes the most probable outcome given its demographic profile, category-specific base rates, and the empirical distribution of similar consumers’ past behavior. This statistical grounding produces predictions that are 15-20 percentage points closer to actual observed behavior than human stated intentions.
This synthetic advantage is particularly pronounced for behaviors with strong status quo bias — subscription retention, insurance renewal, retirement savings contribution, and financial product adoption — where human respondents systematically overstate their likelihood of taking action due to a combination of optimism and social desirability. Synthetic agents, lacking the emotional commitment to a stated intention, produce more realistic estimates that closely track actual base rates. In our benchmark, every synthetic backend model — from V4 Pro at 0.81 to V4 Flash at 0.74 — outperformed human stated intention (0.61) on correlation with actual retention data, and the gap was consistent across all product categories tested.
In one of our most illustrative experiments, we asked 500 human participants and 200 synthetic agents to predict whether they would still be using a project management SaaS tool in 6 months. Human participants expressed strong satisfaction with the product during the initial survey: 82% said they would “definitely” or “probably” still be using it after six months. Synthetic agents, given the same product description, feature list, pricing information, and matched demographic profiles, predicted a 6-month retention rate of 58%. The actual retention rate measured after six months of real-world usage was 52%.
The human panel overestimated retention by 30 percentage points — a textbook intention-behavior gap driven by enthusiasm bias and social desirability. The synthetic agents overestimated by only 6 points, and their 58% prediction fell well within the 95% confidence interval of the actual observed outcome (47-57%). This case study illustrates why the intention-behavior gap is not merely an academic curiosity: it has direct and measurable implications for product strategy, resource allocation, and revenue forecasting. Companies relying on traditional survey data to predict retention, adoption, or repeat purchase will systematically over-invest in products and features that consumers say they want but do not actually use, while underestimating the true churn risk for their core offerings.
Cannot generate genuinely novel insights
Synthetic agents extrapolate from existing training data. They cannot discover new categories of human experience or emerging cultural phenomena.
Linguistic gap (0.76)
Synthetic agents use different vocabulary than humans. The substance is correct but phrasing doesn’t always sound native. This is consistent across all 5 backend models.
Financial services underperformance
Trust and risk tolerance are harder to model than product preference. Even our best backend (V4 Pro, 0.86) trails CPG by 8pp. Active research focus.
Not regulatory-grade
Synthetic data cannot substitute for statistically calibrated human panels in regulated decision-making. Use human panels for FDA, SEC, or legal-adjacent research.
All correlation scores reported on this page are accompanied by confidence intervals and significance tests. Here we explain our statistical framework and how to interpret the results.
Each correlation score is reported with a 95% confidence interval computed via Fisher z-transformation. For our flagship V4 Pro score of 0.91, the 95% confidence interval is [0.87, 0.94] based on 47 independent experiments. This means we can be 95% confident that the true population correlation falls within this range. The intervals narrow as we accumulate more experiments; when we launched at v1.0 with only 12 experiments, the confidence intervals were nearly twice as wide, and differences between backends were often not statistically distinguishable.
For category-level scores, confidence intervals are wider due to smaller sample sizes. CPG (18 experiments) has a 95% CI of approximately [0.89, 0.97], while financial services (7 experiments) has a wider interval of [0.76, 0.92]. We recommend treating category-level comparisons as directional rather than definitive until we reach at least 15 experiments per category. All confidence intervals, standard errors, and sample sizes are available in our full methodology document, which we share with prospective enterprise customers upon request.
Our benchmark is designed to detect a true correlation of 0.80 or higher with 90% statistical power at alpha = 0.05, following conventional standards for social science research. Prospective power analysis indicates that 30-40 experiments are sufficient to detect the observed differences between backend models (effect sizes of 0.03-0.06 in weighted-average correlation). With 47 experiments completed, we comfortably exceed this threshold and can reliably distinguish between models whose true scores differ by 0.03 or more. The difference between V4 Pro (0.91) and Qwen 3.7 Max (0.90), for example, is not statistically significant at conventional levels, meaning we need more data to determine which model is genuinely better.
For dimension-level comparisons within a single backend model, we need approximately 25 experiments per dimension to achieve adequate power for detecting differences of 0.05 between dimensions. We currently meet this threshold for V4 Pro, which has been tested across all 47 experiments, and are actively building toward it for the remaining four backends. Results for models with fewer than 25 benchmark experiments should be interpreted with appropriate caution. We publish exact experiment counts for every reported score to enable readers to assess the reliability of each estimate.
All reported correlations are statistically significant at p < 0.01 unless otherwise noted. Pairwise comparisons between correlation coefficients from different backend models use Steiger's Z-test for dependent correlations, which accounts for the fact that the same 47 experiments are evaluated across all five models (the correlations are not independent). The difference between V4 Pro (0.91) and V4 Flash (0.85) is significant at p < 0.001. The gap between V4 Pro (0.91) and Qwen 3.7 Max (0.90) is not statistically significant at p = 0.12, meaning we cannot definitively rank one above the other based on current data.
We report exact p-values and effect sizes with 95% confidence intervals throughout our full methodology document rather than relying on star-based significance notation, because continuous measures of evidence are more informative for decision-making than binary significance thresholds. A non-significant p-value does not mean two models are equally accurate — it means the evidence is insufficient to determine which is better. We continue collecting data on all five backends and will update these comparisons as our evidence base grows. We commit to publishing updated results quarterly and whenever a new backend model is added to the SyntheticPulse platform.
A critical question for any synthetic consumer benchmark is whether the results generalize across different human panel providers and recruitment methods. To test this, we ran a subset of 10 experiments using two independent human panels recruited from different ISO-certified panel providers. The correlation between the two human panels answering identical surveys was 0.93, and the synthetic-to-human correlations differed by only 0.02 on average between the two panels (range: 0.00 to 0.04). This suggests our results are robust to panel composition and provider choice, at least among professional research panels.
However, we caution against over-generalizing these findings. All human panels in our benchmark were recruited from major panel providers, screened for attention and quality using industry-standard methods, and incentivized at professional research rates. Results may differ systematically with convenience samples (e.g., social media recruitment or crowdsourcing platforms), non-screened panels, or significantly different incentive structures. We are actively expanding our benchmark to include a wider range of human data sources — including probability-based panels, non-incentivized volunteers, and international samples — and will transparently report any meaningful differences in synthetic-to-human correlation as they emerge.
Methodology
Each experiment used a synthetic cohort of 200 agents matched demographically to a human panel of 300-1,200 participants. Both groups received identical stimuli. Human panels were screened for attention. Synthetic cohorts were generated from the SyntheticPulse persona library with calibrated opinion vectors. Each backend model was tested independently across all 47 experiments using identical prompt templates, persona definitions, and evaluation criteria. Results measured as Pearson correlation coefficients across five dimensions. Full dataset and methodology available upon request.