...

White paper

Putting synthetic data to the test: a real-world evaluation

Synthetic data makes big promises. We put them to the test and compared the results to real human responses

Hasdeep Sethi aboutus 300x300 1

Hasdeep Sethi
Group AI Lead
STRAT7

tabita

Tabita Razaila
Head of Operations
STRAT7

Introduction

With rising interest in AI-driven research techniques, the promise of synthetic data is drawing attention across the insights industry. But can it really replace real-world survey responses—especially in hard-to-reach demographics where traditional sampling falls short?

In our new white paper, “Putting Synthetic Data to the Test: A Real-World Evaluation,” STRAT7, in partnership with Dunnhumby, conducts one of the first independent assessments of synthetic data’s performance in market research. By comparing outputs from two leading providers against real respondent data, we reveal where synthetic data succeeds, where it stumbles, and what it means for researchers looking to boost sample coverage without compromising data integrity.

“With this study, we wanted to know whether synthetic data delivers on its promises. Could it give researchers robust, cost-effective insights – or are we trading reliability for speed and scale?” said Hasdeep Sethi, Group AI Lead at STRAT7.

What is synthetic data?

Unlike traditional techniques, like weighting, synthetic data doesn’t just increase the importance of existing responses – it creates brand new ones.

Synthetic data is artificially generated information designed to mimic human responses. It can be used to create completely synthetic participants (to take part in qual or quant) or to ‘fill in’ answers to questions that might be missing within data. In the context of quantitative research, it refers to survey responses created by an algorithm to match how real respondents might answer,

These responses are generated by machine learning models trained on existing real data. It can be generated completely artificially (e.g. using information that an AI system holds), using real respondent data as the basis, or through some combination of the two. These models attempt to recreate the patterns and relationships that are found in the original dataset.

Why might we need synthetic data?

The market research industry faces ongoing challenges getting enough responses from certain demographic groups. These hard-to-reach audiences – which may include niche B2B professionals, rare patient populations, specific minority ethnic groups, or younger age cohorts (particularly 18-24s) – are both difficult and expensive to gather large samples from.

These sampling difficulties have real business implications:

Increased costs

Hard-to-reach audiences often demand premium incentives and specialised recruitment methods.

Insufficient volumes

Even with increased investment, it still may not be possible to reach the sample sizes needed for detailed analysis.

Extended timelines

Reaching quota targets for these groups can make fieldwork periods significantly longer.

Rising survey fraud

The growth in bot-based and fraudulent responses is eroding trust in traditional panel approaches – especially when it comes to niche audiences, due to the greater incentives on offer to participants.

Synthetic data could potentially offer a solution to these problems by providing computer-generated responses that mimic the behaviour of real-world respondents from these hard-to-reach demographics. By training AI models on the existing responses from other members of these groups, synthetic data providers claim they could help fill sample gaps without extending fieldwork timelines. The result would be the robust sample sizes needed for granular analysis, but in a shorter timeframe, and for a lower cost.

What are the limitations of synthetic data?

Despite the potential benefits, synthetic data comes with several inherent limitations that can’t be ignored: 

Cannot replace real data 

Synthetic data is – at its heart – derivative. It is based on patterns that have been observed in the existing data, rather than fresh human perspectives. 

Not bias-free

If there were biases present in the training data, like underrepresented perspectives or skewed distributions, even within demographics, the synthetic data will inherit these, and may even amplify them.

Model development bias

The programmers who build AI models will bring their own perspectives, assumptions, and unconscious leanings with them. These can influence the code itself, ingraining bias into the model’s logic and decision-making processes.

Cannot capture emerging trends

Synthetic data derives from historical patterns, so it can’t adapt to real-time shifts in consumer behaviour, like cultural changes, economic crises or viral trends. It will only ever be able to provide a reflection of opinions as they were, not as they are.

Cannot connect between data systems

While real survey responses can be connected to other data systems (e.g. database data) for a 360 view on customer behaviour, synthetic data cannot do this.

More data ≠ better insights

Simply generating more synthetic data won’t always lead to better decision-making. In fact, relying too much on synthetic data may lead to overfitting or false patterns that reinforce incorrect conclusions, or even mask/disrupt correct ones.

Reaching quota targets for these groups can make fieldwork periods significantly longer.

Why we wanted to create this white paper

At STRAT7, we believe that real responses, from human respondents and with strict quality checks, are the Gold Standard for market research. 

Nevertheless, and despite our own STRAT7 Audiences network of over 47 million survey respondents across 70 markets, getting respondent data from niche and hard-to-reach audiences (e.g. some B2B audiences, patient sample groups, minority demographics etc.) is
challenging. It’s more expensive, increases time in field and in some cases, samples simply can’t be reached in the volumes that are required.

Traditionally, market researchers have addressed this challenge through a technique called “weighting.” Weighting adjusts the importance of existing responses to better represent the population we’re studying. 

For example, if we only manage to survey 50 people aged 18-24 but need 200 to ensure a representative total sample, we might give each of those 50 responses four times more weight in our calculations. While weighting helps ensure our results are representative, it doesn’t actually create more data points for detailed analysis. It simply makes existing responses count for more, which can limit the depth of insights we can extract, particularly when looking at subgroups or conducting complex statistical analyses.

This is where synthetic data could potentially offer a different solution. If synthetic data proves to be an effective tool to supplement human response data and ensure complete survey volumes, this would be a huge benefit to both us and our clients. 

Unlike weighting, synthetic data would actually increase the number of data points available for analysis, potentially enabling the granular insights that weighting alone cannot provide. It could speed up results and reduce fieldwork costs while still providing the large volume of data needed for detailed analysis.

With growing excitement around AI-driven research methods, we wanted to judge whether synthetic data actually delivers on its promises. Rather than taking vendor claims at face value, we tested how synthetic data performed in the kinds of analyses that drive real business decisions. Our study is unique because we compared multiple synthetic data providers using the same baseline information – something few others have done.

Unlock the rest of the white paper

Curious which synthetic data provider comes out on top? Enter your details above to get your free copy of Putting synthetic data to the test: a real-world evaluation – and discover how leading vendors compare when tested on the same baseline data, in real-world conditions.

Let us know how we can help you win at change.