The future of enterprise-grade AI is facing a massive challenge: data. An AI model is only as good as the data it is trained on. Building models that yield meaningful results require enormous volumes of high-quality data for training and refinement. Yet, this kind of data often doesn’t exist — or at least not in the quantities needed. For perspective, the entire English-language Wikipedia accounts for only 3-5% of the data used to train OpenAI’s GPT-3 model. To build a sustainable future for enterprise AI, technologists must look towards closing the data gap.
Training AI requires more than just heaps of data — it demands precision, curation, and scale that few organizations can achieve. The data needs to be relevant, unbiased, and diverse enough to generalize across different use cases. Many developers are turning to synthetic data to address the limited availability of organic data.
What Is Synthetic Data?
Synthetic data is precisely what it sounds like. It is data that has been artificially created, usually through algorithms, statistical models, or generative AI, to mimic the key characteristics and shape of organic data. Gartner estimates that by 2024, 60% of the data used in AI and analytics projects will be synthetically generated.
Think of synthetic data as a cover band. It doesn’t replace the original artist but steps in when the actual act isn’t available. Just like a cover band plays familiar songs to keep the music alive, synthetic data enriches existing datasets by mimicking real-world scenarios, especially when there’s insufficient organic data to train AI models. This isn’t just about creating “fake” data; it’s about enhancing and augmenting the original dataset to help ensure AI models perform at their best.
Why Does It Matter?
Synthetic data enhances accuracy and fairness in AI models as organic data can be biased or unbalanced, leading to ML models failing to represent diverse populations accurately. With synthetic data, researchers can create datasets that more accurately reflect the demographics they intend to serve, thereby minimizing biases and improving overall model robustness. Fundamentally, synthetic data helps AI models perform at their best and boosts compliance, efficiency, inclusivity, and fairness, ensuring that new products and features serve everyone equitably.
How Is It Used?
As Expedia’s SVP of Data and AI, I can attest that we use synthetic data to help travelers figure out the best time to book a flight, a dark art even for frequent fliers. For any flight, costs fluctuate in seemingly unpredictable patterns that make booking at a good price seem like a game of chance. There’s little fun in scanning flight prices every day or booking a flight only to check back a few days later to see that the price has dropped. We built a price tracking and predictions feature to solve this common traveler pain point. It uses AI to analyze historical flight price data and predict future trends for specific routes. With this feature, travelers can book their trip confidently, knowing they got the best value without keeping an eye on the price.
When building this feature, we understood that we would have data gaps despite our platform’s millions of daily searches. This is because flight prices change constantly, and there just isn’t enough organic data on every possible flight variant—meaning every combination of cabin class, start and end dates, airport variations, and more—for us to confidently predict future price trends, even for popular routes like London to New York. This is where synthetic search data comes in and helps train and evaluate the AI models.
We created synthetic search data through automated processes that imitate real travelers searching for flights. The synthetic data mimics their actions, like playing around with travel dates, browsing multiple routes, airports, and cities, and toggling between multi-city itineraries. By mixing organic traveler search data with the synthetic data created, we can understand how flight prices fluctuate against these variables daily.
What’s the Best Place to Start?
The first step to deciding what synthetic data to create is understanding what specific use cases you need it to support — whether training cutting-edge ML models, rigorously testing innovative algorithms, or ensuring the robustness of data pipelines. This clarity will guide the dimensions of your synthetic data and enhance its format and quality. Start by considering what data characteristics you need, whether privacy issues exist, and what essential relationships or behaviors you want to be maintained.
Next, focus on building a relational database as your foundation. From there, a generative ML model is developed to analyze and understand the patterns within that data, generating a second set of data that mimics the original. It’s important to note that while a synthetic data set retains the same mathematical properties as the organic data set it represents, it does not contain any actual information from the original. This allows organizations to work with realistic and privacy-compliant data, making it a valuable resource for training and testing.
What To Watch Out For?
Synthetic data can be a double-edged sword. While it addresses data privacy and availability challenges, it can inadvertently carry or magnify biases embedded in the original dataset. When source data is flawed, those imperfections can cascade into the synthetic version, skewing results — a critical concern in high-stakes domains like healthcare and finance, where precision and fairness are paramount. To counteract this, having a human in the loop is super important.
While there’s a temptation to use synthetic data to fill in every gap for better accuracy and fairness, we understood that running synthetic searches for every flight combination possible globally for our price tracking and predictions feature could overwhelm our booking system and impact real travelers organically searching for flights.
Synthetic data has limitations that go beyond bias. It struggles to replicate the depth of human interactions and emotions, which is crucial for applications like emotion-AI. Nuances such as cultural differences and context-specific emotional cues often evade synthetic datasets, leading to models that can fail to deliver meaningful, human-centric outcomes.
The challenge for developers is to stay vigilant. Building diverse, context-aware datasets and embedding humans in the loop at every stage of the process makes it possible to close the data gap and uphold the ethical standards the industry aspires to achieve.
YOUTUBE.COM/THENEWSTACK
Tech moves fast, don’t miss an episode. Subscribe to our YouTube
channel to stream all our podcasts, interviews, demos, and more.