2022 – Is ‘fake data’ the real deal in algorithm training? | Artificial Intelligence (AI)

sYou are behind the wheel of your car, but you are exhausted. Your shoulders begin to sag, your neck begins to droop, and your eyelids begin to droop. As your head tilts forward, veer off the road, dash through a field and crash into a tree.

But what if your vehicle’s monitoring system detects signs of drowsiness and instead tells you to get on the road and stop? The European Commission has set a legal requirement that from this year new vehicles must be fitted with systems that intercept distracted and drowsy drivers in order to avoid accidents. Now a number of startups are training AI systems to recognize gifts in facial expressions and body language.

These companies are taking a new approach to artificial intelligence. Instead of filming thousands of real drivers as they sleep and flicking this information into a deep learning model to “learn” the signs of sleepiness, they create millions of fake human avatars to mimic sleep signals.

“Big data” does not define the field of AI without reason. To accurately train deep learning algorithms, models must contain a large number of data points. This creates problems with a task such as spotting a person sleeping at the wheel, which would be difficult and time-consuming to photograph in thousands of cars. Instead, companies are starting to create virtual data sets.

Synthesis AI and Datengen are two companies that use 3D full-body scanning, including detailed face scans and motion data captured by sensors placed throughout the body, to collect raw data from real people. This data is fed through algorithms that multiply different dimensions to create millions of 3D representations of people who look like characters in a video game and exhibit different behaviors in a variety of simulations.

If someone falls asleep while driving, they can film a human actor sleeping and combine that with motion capture, 3D animation, and other technologies used to create video games and animated movies to create the simulation they want. “You can map [the target behaviour] “Over thousands of different body types, different angles, different lighting and also adding variety to movement,” says Yashar Behzadi, CEO of Synthesis AI.

Using synthetic data avoids much of the clutter of the traditional method of training deep learning algorithms. Usually, companies have to put together a slew of realistic snapshots, and low-paid workers painstakingly label each of the clips. They will be entered into the model, which will learn to recognize behaviors.

The big advantage of the synthetic data approach is that it is much faster and cheaper. But these companies also claim that they can help combat bias, which is a huge problem for AI developers. It is well documented that some AI facial recognition software is poor at recognizing and correctly identifying certain demographic groups. This is usually because these groups are underrepresented in the training data, which means that the software is more likely to misidentify these individuals.

Niharika Jain, a software engineer and expert on gender and racial bias in synthesis machine learning, highlights the infamous example of Nikon Coolpix’s “Blink Detection” feature, which, because the training data contained a majority of white faces, disproportionately judged faces as blinking Asian. “A good driver monitoring system should prevent members of a particular demographic from sleeping more wrongly than others,” she says.

The typical answer to this problem is to collect more data from underrepresented groups in real-world environments. But companies like Datagen say this is no longer necessary. The company can simply create more faces from the underrepresented groups, which means that they make up a larger percentage of the final data set. Realistic 3D facial scanning data from thousands of people is processed in millions of AI vehicles. “There is no bias hidden in the data; you have complete control over the age, gender and ethnicity of the people you create,” says Gil Elbaz, co-founder of Datagen. The scary faces that appear don’t look like real people, but the company claims they are similar enough to teach AI systems How to interact with real people in similar scenarios.

However, there is some debate about whether synthetic data can really remove bias. Bernice Hermann, a data scientist at the University of Washington Electronic Science Institute, says that although synthetic data can improve the robustness of facial recognition models among underrepresented groups, she does not believe that synthetic data alone can bridge the performance gap between these groups. and others. . Although companies sometimes publish scientific papers explaining how their algorithms work, the algorithms themselves are proprietary, so researchers cannot evaluate them independently.

In areas like virtual reality and robotics where 3D mapping is important, synthetic data companies argue that it may be better to train AI in simulations, especially as 3D modeling, visual effects, and game technologies improve. “It is only a matter of time before … you can create these virtual worlds and fully train your systems in a simulation,” says Behzadi.

This kind of thinking is gaining ground in the autonomous car industry, where synthetic data is helping to teach the artificial intelligence of self-driving vehicles how to navigate the road. The traditional approach – capturing hours of driving and fusing them into a deep learning model – was enough to make cars relatively good at navigating the streets. But the problem that vexes the industry is how to make cars reliably handle so-called “evolving states” – events so rare that they don’t happen very often in millions of hours of training data. For example, a child or a dog running down the street, complicated road work, or even some unexpectedly placed traffic cones, which was enough to catch a driverless Waymo in Arizona in 2021.

Synthetic ports from Datagen.

Using synthetic data, companies can create endless variations of virtual world scenarios that rarely occur in the real world. “Instead of waiting for millions of extra miles to collect more samples, they can artificially generate as many samples as needed from the evolving state of training and testing,” says Phil Koopman, associate professor of electrical and computer engineering at Carnegie Mellon University.

AV companies like Waymo, Cruise, and Wayve increasingly rely on real data along with simulated driving in virtual worlds. Waymo has created a simulated world using artificial intelligence data and sensors collected from its self-driving vehicles, complete with artificial raindrops and sun glare. This is used to train vehicles for normal driving situations as well as challenging situations. In 2021, Waymo told The Verge it simulated 15 billion miles of driving, compared to just 20 million miles of real driving.

An additional benefit of testing self-driving vehicles in virtual worlds is to reduce the likelihood of very real accidents. “One of the main reasons that autonomous driving is at the forefront of so much synthetic data is fault tolerance,” Hermann says. “A self-driving car that makes a mistake 1% of the time or even 0.01% of the time is probably too much.”

In 2017, Volvo’s self-driving technology, which has been trained to respond to large North American animals like deer, was shocked when it first encountered kangaroos in Australia. “If a simulator knows nothing about kangaroos, you won’t make a simulator, no matter how large, until it is seen in testing and the designers figure out how to add it,” says Koopman. For Aaron Roth, a professor of computer and cognitive sciences at the University of Pennsylvania, the challenge is to create synthetic data that is indistinguishable from real data. He thinks it’s plausible that we’re at this point with facial data, where computers can now produce realistic images of faces. “But for a lot of other things” — which may or may not include kangaroos — “I don’t think we’re there yet.”

Leave a Reply

Your email address will not be published. Required fields are marked *