Taking the world by simulation: The rise of synthetic data in AI
Would you trust AI that has been trained on synthetic data, as opposed to real-world data? You may not know it, but you probably already do — and that’s fine, according to the findings of a newly released survey.
The scarcity of high-quality, domain-specific datasets for testing and training AI applications has left teams scrambling for alternatives. Most in-house approaches require teams to collect, compile, and annotate their own DIY data — further compounding the potential for biases, inadequate edge-case performance (i.e. poor generalization), and privacy violations.
However, a saving grace appears to already be at hand: advances in synthetic data. This computer-generated, realistic data intrinsically offers solutions to practically every item on the list of mission-critical problems teams currently face.
That’s the gist of the introduction to “Synthetic Data: Key to Production-Ready AI in 2022.” The survey’s findings are based on responses from people working in the computer vision industry. However, the findings of the survey are of broader interest. First, because there is a broad spectrum of markets that are dependent upon computer vision, including extended reality, robotics, smart vehicles, and manufacturing. And second, because the approach of generating synthetic data for AI applications could be generalized beyond computer vision.
Datagen, a company that specialized in simulated synthetic data, recently commissioned Wakefield Research to conduct an online survey of 300 computer vision professionals to better understand how they obtain and use AI/ML training data for computer vision systems and applications, and how those choices impact their projects.