Hasdeep Sethi
Group AI Lead
STRAT7
Tabita Razaila
Head of Operations
STRAT7
With rising interest in AI-driven research techniques, the promise of synthetic data is drawing attention across the insights industry. But can it really replace real-world survey responses—especially in hard-to-reach demographics where traditional sampling falls short?
In our new white paper, “Putting Synthetic Data to the Test: A Real-World Evaluation,” STRAT7, in partnership with Dunnhumby, conducts one of the first independent assessments of synthetic data’s performance in market research. By comparing outputs from two leading providers against real respondent data, we reveal where synthetic data succeeds, where it stumbles, and what it means for researchers looking to boost sample coverage without compromising data integrity.
“With this study, we wanted to know whether synthetic data delivers on its promises. Could it give researchers robust, cost-effective insights – or are we trading reliability for speed and scale?” said Hasdeep Sethi, Group AI Lead at STRAT7.
Unlike traditional techniques, like weighting, synthetic data doesn’t just increase the importance of existing responses – it creates brand new ones.
Synthetic data is artificially generated information designed to mimic human responses. It can be used to create completely synthetic participants (to take part in qual or quant) or to ‘fill in’ answers to questions that might be missing within data. In the context of quantitative research, it refers to survey responses created by an algorithm to match how real respondents might answer,
These responses are generated by machine learning models trained on existing real data. It can be generated completely artificially (e.g. using information that an AI system holds), using real respondent data as the basis, or through some combination of the two. These models attempt to recreate the patterns and relationships that are found in the original dataset.
The market research industry faces ongoing challenges getting enough responses from certain demographic groups. These hard-to-reach audiences – which may include niche B2B professionals, rare patient populations, specific minority ethnic groups, or younger age cohorts (particularly 18-24s) – are both difficult and expensive to gather large samples from.
These sampling difficulties have real business implications:
Hard-to-reach audiences often demand premium incentives and specialised recruitment methods.
Even with increased investment, it still may not be possible to reach the sample sizes needed for detailed analysis.
Reaching quota targets for these groups can make fieldwork periods significantly longer.
The growth in bot-based and fraudulent responses is eroding trust in traditional panel approaches – especially when it comes to niche audiences, due to the greater incentives on offer to participants.
Synthetic data could potentially offer a solution to these problems by providing computer-generated responses that mimic the behaviour of real-world respondents from these hard-to-reach demographics. By training AI models on the existing responses from other members of these groups, synthetic data providers claim they could help fill sample gaps without extending fieldwork timelines. The result would be the robust sample sizes needed for granular analysis, but in a shorter timeframe, and for a lower cost.
Despite the potential benefits, synthetic data comes with several inherent limitations that can’t be ignored:
Synthetic data is – at its heart – derivative. It is based on patterns that have been observed in the existing data, rather than fresh human perspectives.
If there were biases present in the training data, like underrepresented perspectives or skewed distributions, even within demographics, the synthetic data will inherit these, and may even amplify them.
The programmers who build AI models will bring their own perspectives, assumptions, and unconscious leanings with them. These can influence the code itself, ingraining bias into the model’s logic and decision-making processes.
Synthetic data derives from historical patterns, so it can’t adapt to real-time shifts in consumer behaviour, like cultural changes, economic crises or viral trends. It will only ever be able to provide a reflection of opinions as they were, not as they are.
While real survey responses can be connected to other data systems (e.g. database data) for a 360 view on customer behaviour, synthetic data cannot do this.
Simply generating more synthetic data won’t always lead to better decision-making. In fact, relying too much on synthetic data may lead to overfitting or false patterns that reinforce incorrect conclusions, or even mask/disrupt correct ones.
Reaching quota targets for these groups can make fieldwork periods significantly longer.
At STRAT7, we believe that real responses, from human respondents and with strict quality checks, are the Gold Standard for market research.
Nevertheless, and despite our own STRAT7 Audiences network of over 47 million survey respondents across 70 markets, getting respondent data from niche and hard-to-reach audiences (e.g. some B2B audiences, patient sample groups, minority demographics etc.) is
challenging. It’s more expensive, increases time in field and in some cases, samples simply can’t be reached in the volumes that are required.
Traditionally, market researchers have addressed this challenge through a technique called “weighting.” Weighting adjusts the importance of existing responses to better represent the population we’re studying.
For example, if we only manage to survey 50 people aged 18-24 but need 200 to ensure a representative total sample, we might give each of those 50 responses four times more weight in our calculations. While weighting helps ensure our results are representative, it doesn’t actually create more data points for detailed analysis. It simply makes existing responses count for more, which can limit the depth of insights we can extract, particularly when looking at subgroups or conducting complex statistical analyses.
This is where synthetic data could potentially offer a different solution. If synthetic data proves to be an effective tool to supplement human response data and ensure complete survey volumes, this would be a huge benefit to both us and our clients.
Unlike weighting, synthetic data would actually increase the number of data points available for analysis, potentially enabling the granular insights that weighting alone cannot provide. It could speed up results and reduce fieldwork costs while still providing the large volume of data needed for detailed analysis.
With growing excitement around AI-driven research methods, we wanted to judge whether synthetic data actually delivers on its promises. Rather than taking vendor claims at face value, we tested how synthetic data performed in the kinds of analyses that drive real business decisions. Our study is unique because we compared multiple synthetic data providers using the same baseline information – something few others have done.
Curious which synthetic data provider comes out on top? Enter your details above to get your free copy of Putting synthetic data to the test: a real-world evaluation – and discover how leading vendors compare when tested on the same baseline data, in real-world conditions.
Subscribe
Get the latest market research news and insights delivered directly to your inbox. Subscribe to the STRAT7 newsletter today.