Synthetic data utilizes virtual consumers for user research, leading to cost savings and increased efficiency. However, concerns arise regarding its potential inability to accurately reflect human nature and social contexts.
The surge in demand for synthetic data, particularly with the advent of ChatGPT, heightens the possibility of a widening gap between reality and data, blurring the lines between data and truth.
Therefore, synthetic data should be grounded in high-quality real-world data and developed in collaboration with social science and humanities experts to address ethical considerations.
Test your ideas and products with confidence using AI-synthesized consumers.
Synthetic users providing user research services without actual users
Launched in February, the Synthetic Users service, as its name suggests, provides synthetic virtual consumers as the target audience for product development user research, rather than real humans. It allows for interviews and surveys with virtual individuals, providing feedback on product usage experiences. It offers functionalities like setting specific target customer scenarios, such as a long-term European couple, and achieving significant cost savings, with 100 interview datasets priced at $380. The service has garnered a diverse range of reactions within communities of anthropologists, sociologists, and other social science professionals, including discomfort, a sense of crisis, and even amusement.
This includes anxieties about whether the fundamental aspects of human identity, purpose, pleasure, and values, which are the core of qualitative research (not just ‘synthesized’ creative works, but understanding individuals themselves), can be easily copied and understood. There are also cynical views that the service may not adequately capture the intricate sociopolitical situations and interpersonal relationships that shape the complex issues people face in reality.
In fact, such synthetic data is not a novel concept. It is particularly useful when datasets are difficult to obtain. For example, it has been used in virtual car simulations by automakers to mimic driver behavior, training models in a wide array of situations. It has also been used to replicate the records of over 2.7 million COVID-19 patients, creating a dataset that is statistically identical but devoid of personally identifiable information, enabling researchers worldwide to rapidly share and study it.
However, the current rapid spread of ChatGPT across various services has triggered an explosive surge in demand for synthetic data, which was already on the rise. This has led to the emergence of services that claim human daily life experiences – the very source of insights – can be replaced by synthetic data.
Particularly, the Synthetic Users service starkly highlights a key concern regarding the use of synthetic data:The ‘gap between reality and data’ – highlighting the need for a redefined understanding of ‘data’ and ‘truth.’
We are already living in an era of misinformation, and it is becoming increasingly difficult to understand the origin and biases of all the data we encounter. The upcoming flood of synthetic data will not only blur the boundaries between ‘real’ and ‘artificial’ but also make it more challenging for regular data consumers to critically evaluate the source of original data, the methods used to collect and manipulate it, and consequently, the degree of trust that should be placed in it.
Therefore, to prevent the synthetic data revolution from inadvertently creating an unintended world, it is crucial to start by focusing on ‘small data’ rather than ‘big data.’ Many companies today exhibit a tendency towards what is known as ‘data-driven decision-making,’ where decisions are made based on all available data, even if those datasets are demonstrably biased or incomplete. Thus, synthetic data should stem from the best real-world data we can find. Furthermore, this process should be accompanied by a deep contextual understanding of what is most important within that data and why, providing the highest possible quality initial dataset.
If it is not grounded in a rigorous understanding of the most recent fundamental human phenomena, such as the discrepancy between what people say and do, or the unforeseen effects of our actions on our lives, we risk simulating a social world that threatens reality in ways that are harmful to both businesses and individuals.
Synthetic data will play an increasingly significant role in our daily lives moving forward. It has the potential to reshape everything from the algorithms that shape our experiences of the world to our understanding of data and reality. Entrusting such critical decisions solely to data scientists, no matter how well-intentioned they may be, is too risky, and collaboration with experts in the social sciences and humanities is needed. This isn't simply because synthetic data might be unhelpful or worse than some of our current datasets, but rather due to a fear of the immense potential it holds to achieve too much.