Synthetic Datasets: A Missing Link to More Affordable Machine Learning?

Machine Learning (ML) is a process by which a machine is trained to make decisions. A common machine learning practice is to train ML models with data that consists of both an input (i.e., an image of a long, curved, yellow object) and the expected output that is associated with that input (i.e., the label "banana"). Over time, the machine becomes able to recognize the characteristics of the input and properly assign the correct output on its own.

Of course, machine learning scenarios are often far more complex. Therefore, the more data a machine has to learn from, the better it becomes at predicting the expected outcome or answer. Unfortunately, companies hoping to leverage machine learning often underestimate the quantity and diversity of data they might need. Here are some challenges that must be considered:

Insufficient data will severely hinder the accuracy of a machine learning model
For more sophisticated algorithms, the machine will need to consider the variables of the object itself (i.e, bananas can be yellow, green or have brown spots. They can be peeled or unpeeled, whole or chopped into pieces, etc.) as well as environmental variables (i.e., the lighting, foreground or background elements, point-of-view the banana is seen from, etc.)
Collecting the necessary amounts of data can be expensive, time-consuming, labor-intensive and (at times) even dangerous or impossible.
Traditional methods of collecting and annotating this data introduce a number of opportunities for human error.

In the discussion below, I talk with Sean Doherty (CrossComm's CTO) and Mike Harris (CrossComm's Senior Immersive Technologies Developer) about how synthetic data can help businesses increase the volume of data they have to train ML models and save time, money, and effort, all while decreasing the opportunities for human error. More specifically, we discuss how the Unity game engine—which has traditionally been used to create Virtual Reality games— can convincingly replicate real-world visual elements and scenarios as visual data that can be used to train computer vision models.

Read the full transcript of the discussion below.

Transcript

Phillipe Charles
Welcome, I am Phillipe Charles, I am the content marketing manager at CrossComm. And I'm joined by Sean Doherty; he's our chief technology officer. And also, Mike Harris; he is our senior immersive app developer—so that's augmented reality and virtual reality. And what we are talking today about is synthetic datasets for machine learning, specifically those that are generated within Unity.

If you don't know what Unity is, that's alright, we're gonna get into that. If you don't know what synthetic datasets are, also okay. You're in the right place. This talk is especially for people who may be approaching or thinking about using machine learning, not sure or perhaps have had trouble with figuring out the data side of it. And so this talk is for you.

We're going to talk about how synthetic datasets may be an interesting use case that is relevant for you. We'll explain what synthetic datasets are, the value in using them and what scenarios might make sense. So, Sean is going to help us out on the machine learning side; he's a machine learning expert. Mike, like I said, augmented and virtual reality, but specifically where he's going to help us here is in with his Unity expertise.

Let's get right into it. Let's start with the very basics. So the question that I have to start off is what exactly is machine learning?

Sean Doherty
So machine learning is when you train a machine to be able to make decisions for you. So the basic process with what is today the most popular technique in machine learning (because machine learning is a branch of artificial intelligence) is called Deep Learning, which involves a technique called neural networks. And the way that neural networks work are that you take training data—so you teach a computer by showing an example (an input), and an answer (the expected output)—and you show it over and over and over again, and the system learns to be able to look at an input and determine what the correct answer should be.

So you train what's called a model, which is the result of one of these systems being trained—the basic rules that it's learned over time based on the input it's been trained with—and you can use that model to then evaluate potential inputs for answers. So, use these for classification, for recognition, a lot of problems that sometimes exist in the real world.

Well, machine learning today has a data problem. So most projects that we see come across our door when people want to use machine learning in their systems—this conversation usually stops when it comes to data. Because data is, in some ways, the programmer involved with machine learning. When people program machine learning algorithms, most of the actual rules to decide what the answers are (evaluated by the model), are done by the data. So a machine learning algorithm and machine learning process is really only as good as the data that it has.

And so when we talk to people about "I want to use machine learning, I want to use artificial intelligence," the first question we have is, "where's your data," and a lot of the conversations end there because getting large datasets are expensive. And getting quality datasets is even more expensive. Large companies have these datasets. And they love these, they love machine learning and artificial intelligence because they have large budgets and they have large datasets to be able to use to solve those problems.

For example, somebody like Facebook evaluates 2 billion images a day to be able to evaluate if they have copyright, or if they're obscene. And so, basically, if they tried to find enough people to evaluate all those images, they would run out of humans on the world pretty quickly. And so, instead, they can train a machine learning model to recognize an image (the input) given all of the training data they've used over time, and all the images they've seen, and which ones they've determined are trademarked or explicit. And so these systems grow out of this large amount of data, and they allow them to make really quick decisions at a scale that humans can't do.

And so for smaller companies, or for companies that aren't Facebook or Google or these huge companies, what it usually means is to sometimes add a level of lowering friction for users, to allow users to sort of skip a step, to make the experience nicer. Sometimes it's within recognition and being able to see that this step or this part of the procedure is done, but oftentimes we see machine learning in applications in ways that just sort of seem like magic. Right? And everyone wants magic in their applications.

But what we're talking about today is a potential solution for the little guy; the ones that aren't the Facebooks or Google that have a billion records lying around in a database somewhere. How can you create that same scale of data that's needed to train these models accurately? And how can you do it in a way that's cost-effective and not going to cost you an arm and a leg? So, we're gonna talk about a technique that involves Unity, and Mike's gonna talk more about that. But I'm really excited to see how we can bridge this gap and get more people using these really powerful techniques in their own apps.

Phillipe Charles
Yeah, so that makes a lot of sense. So I think the term synthetic dataset seems pretty self-explanatory, kind of, after you discuss what the data problem is. But Mike, could you actually explain what are synthetic datasets and specifically looking through the lens of Unity?

Mike Harris
Yeah, I remember it was several years ago that I was preparing a talk for All Things Open in Raleigh. And I was consulting with Sean about the talk because it was about these artificial intelligence agents that you can train inside of Unity, which is what we primarily work in. And when I was describing it to him, I remember his excitement about the idea of synthetic data and realizing even back then that this was a huge opportunity for the Unity game engine.

So perhaps the best place to start is to describe what the Unity game engine is. It's, as I mentioned, a game engine. This was a software platform that originally was created so that people could create video games across multiple platforms. And it has built-in functionality for what's called rendering scenes, which is taking three dimensional or two-dimensional objects and arranging them in such a way to make the output for a video game (like what you see on an actual screen) as well as some physics simulations that go on under the hood.

And the synthetic dataset part of it is that, instead of using the Unity game engine to produce video games, you could actually use it to produce the images that Sean was mentioning, that are used in training machine learning algorithms.

Now, the reason that this has become so popular lately is that game engines such as Unity have become increasingly good at making very realistic visuals and graphics. A technique called Physically Based Rendering—which allows for subtle reflections and different levels of glossiness, and more realistic shadows and ambient lighting—have really upped the amount to which video games look realistic. And because of those capabilities, it opens up the possibility of generating synthetic images that then can be used to train machine learning algorithms.

Phillipe Charles
Awesome. And so describe that process a little bit? So what's actually happening in Unity? How are we getting the images in there? What does this look like behind the scenes?

Mike Harris
Yeah, that's a great question. Say you’re a company that wants to have an app that can recognize different types of cereal boxes and give you information about that type of cereal. What you would need first is a digital representation of that cereal box, you could imagine it like a computer-generated rectangle that's the same shape as a cereal box with images that represent the front and the back of the cereal box, and the side where you could see the nutritional information and things like that. But this would be a purely digital object.

Inside of Unity, you can then take that object and you could rotate it to any angle that you want, you could change the lighting conditions, the camera resolution, you could put objects in the background and foreground. And you can generate a whole set of images from these purely digital elements that mimic what a real-world image would look like, and use those to train the machine learning algorithm.

Why this is so powerful is because, as Sean mentioned, if you imagine having to actually go out to take an actual cereal box, adjust all of the lighting, take it from several different angles, put different types of objects in the foreground and background that mimic all the different real-world situations in which the algorithm might encounter that cereal box, that's actually a pretty labor-intensive process just to collect those images. And, on top of that, there's the issue of annotation, which I'm sure we'll get to eventually.

Phillipe Charles
And so what are the top scenarios that you would consider using synthetic datasets. It seems like object recognition—so in what situations is object recognition really going to be important?

Sean Doherty
Sometimes we've seen people come across us that have products that have, maybe, complicated readouts or different scenarios that they would want to be able to recognize with a product or with a test or with some kind of physical thing. And having a person evaluate it, you have to teach them how to evaluate it, and then rely on the error of them evaluating it properly. With this, a lot of times people come to us and say, "Well, can I get the phone camera or whatever model I've trained to just look at the image that they're showing me and tell me what the answer is?"

And so some of those require a lot of time to have someone create all the different scenarios of that device and different backgrounds, as Mike was saying, and simulating all those things. And it can sort of be a deal-breaker—where, like I said, the conversation ends a lot of times, because the expense and the time of creating that dataset is a lot.

One thing that we haven't really talked about that Unity brings up, that I think is one of the key differentiators for this technology, is the fact that when you do this in the real world, where you create a dataset on your own without this technique, what you have to do is not only create the images, but you have to create the answers to train with. Right? That's the other part of the equation.

You have to have "this image produces this output." Now that output could be "this is a cereal box," or "this is a banana" or "this is a dog" or "this is a test strip." That's one type of answer. There's also answers like, "this is where that is on the picture," "this is where this object overlaps the other object," etc. And classification, and drawing, and positioning on the screen can be useful with these techniques.

Those answers are actually usually created by humans. So if you took the pictures, you had to then take the time to create the answer for each picture and match up the picture to the answer before you could ever train the model. Now Unity knows that—since it's generating the images, it can just as easily generate the answers as well—and puts it together in a place that's ready to train. Which speeds up the process a lot. So when we're talking about these images, we're talking about a much larger dataset from the same amount of real-world time, instead of 1000 images, we're talking about having a million images (as many as you want really) because the Unity engine can create whatever combinations of things.

But getting back to the question, anytime you're dealing with needing to recognize a situation that's happening in the physical world, this technique is really good. Because Unity was built to simulate the physical world. It was used to build virtual worlds and games and stuff, and so they built-in physics engines and lighting engines, and any scenario you come up with.

If you haven't seen a video game in a while, they can make grass look like grass and trees sway like trees, and the wind blow, and leaves go around and stuff. The level of sophistication is, to one of Mike's points, is getting to the point where it's kind of tough to distinguish the two. And so Unity can be a way to sort of bridge that gap. And now we're able to feed this right into models and leverage the fact that they can create real-world scenarios.

So, beyond just looking at physical things, I think anytime you have simulations that would be expensive or dangerous in the real world is another good place. So sometimes it's expensive to recreate a certain situation, maybe the thing that is done is like a one-time thing that you don't want to have to redo over and over again because it's really expensive. You wouldn't want to, you know, crash a car 1000 times, or something, to simulate what it looks like. But you would want to be able to still recreate that. And so in the Unity environment, it's cheap to recreate these scenarios. You just push a button and say "start it again." And you can do it as many times as you need to.

So the use cases all over the place. But it's really exciting that this ability exists, that people that have low amounts of data have the opportunity to still use these techniques that larger companies have been the primary beneficiaries of.

Mike Harris
Yeah, and another thing, not to toot our horn too much, but the cross-disciplinary nature of CrossComm:—so we have artificial intelligence experts like Sean there and then myself who has been working in Unity for years and primarily develop applications inside of Unity. These sorts of problems require knowledge on both sides of the equation.

So inside of Unity, yes, we can generate these synthetic datasets. In-depth knowledge of Unity's lighting and rendering system, and textures, and how all those things work is going to be crucial to making sure that the images are tweaked in the right way for training. But then there's the entire side that, once those images leave Unity, you have to understand how to analyze datasets, you have to understand how to bridge the gap between purely synthetic training and how to test that or refine it in real-world scenarios. And those sorts of things is where Sean's expertise would come in training these models.

Sean Doherty
Yeah, and don't sell yourself short, Mike; the Unity side is very difficult. It's not like picking it up off the box and taking a picture and just snapping some photos of it. It takes a lot of creativity and the ability to really understand how to make things look real to be able to really take advantage of this. And Mike, that's what you do every day with virtual worlds.

So leveraging that same sort of expertise and that same experience, but now being able to build it into an opportunity to use it for machine learning, it's really cool. And usually, machine learning is the domain of back-end data experts dealing with databases and data lakes and data warehouses and terms like that. And so it's really interesting to see that there's this thing Unity, which is this virtual world video game kind of realm area, coming into being useful for machine learning, which is kind of a whole different set of people usually. A whole different world. So yeah, it's really exciting to see those sorts of worlds collide a little bit.

Mike Harris
Yeah, and from the Unity side, keeping up with the latest developments for the engine itself—changes to the lighting system, changes to how things are rendered—is a really substantial time investment that we have made on projects for augmented and virtual reality applications. And that's what makes this time so exciting. Because as these tools are being refined and updated several times a year, and you have to learn new systems and stay on top of them, it provides more opportunity for these really realistic renders that make this all and more valuable as a training tool, for sure.

And then also speaking to the annotations that you mentioned for training, the value that Unity has (which is just such a game-changer) is that, whereas a person would normally have to draw a box of each image around where the object to be recognized is, Unity (because it's a game engine) is rendering all of these virtual objects out to the screen. So it knows down to a pixel exactly which object is which. You can add tags to them so that it could tell objects of the same type, or where one object starts and the other one ends. And that removes that human error that you mentioned and also speeds up the process because, if a computer is generating these annotations, it can do it in a matter of milliseconds as opposed to a human who has to understand, interpret the image, draw the outlines and things like that.

Sean Doherty
Or if you're Google you just make people pick which ones are stoplights. And then that's how they get their annotations.

Phillipe Charles
Which seems really fun.

So is synthetic data enough? If I have a problem or I have a product, and I need pictures taken of it, can I just make a synthetic version of it and be done? Or is there more to it than that?

Sean Doherty
Well, the short answer is, it's never enough. Machine learning usages increase the accuracy the more data they have. Right? So generally, when you deploy one of these systems, you're never done training it. Every new image that comes in, you use to train the model further. And it continues to refine and continues to get better, which is one of the reasons why it's so hard to get started in some of these things—you need a critical mass before you can make any decisions, but you need to make decisions before you can get more data to achieve that critical mass. So you're kind of in between a rock and a hard place.

And so what we've seen with studies using these synthetic datasets is that it can be a good starting place, it can be a good place to get you to a much higher accuracy than you could on your own, and then you would continue to refine it with the real-world data you'd collect later on. So whenever something's recognized and you validate from the user that it was what you thought it was, or you have people look at them afterwards, you can then put that back into the dataset to retrain the model and continue to refine it.

Like I said, larger companies have some of these datasets already, and they're already collecting their data; but for a new company (say you're a startup trying to build a product), it wasn't in your budget to spend three months creating pictures of your product to be able to use it for machine learning. So it's really nice to be able to use as a starting point.

It's never enough, though. The answer is there's not a finite number that you can get to and say, "It's done." Never done training. It can always get better and always get more useful. I usually use the sports analogy of someone playing basketball; if you take a shot and you know that it went left or right, and you can just shoot over and over again, you get that feedback and that instant understanding of whether or not it worked. But imagine the more shots you take, the better you're going to be. Right? The more opportunities you have for training, the better the result is going to be. And that's a lot of the way machine learning works as well.

Mike Harris
Yeah, and you could imagine scenarios in which this could provide substantial cost savings for a small startup or something like that. So say they're in the early stages of product design for a product that they want to do some form of image recognition on, or object recognition; being able to get that feedback by having a 3D model of the product, and seeing how easily this is recognized, how much does using this certain type of material affect how an algorithm could pick up and recognize the object before actually going to manufacturing and creating the prototypes might save a lot of costs.

Another example that comes up is the cashier-less grocery store that we've been hearing a lot about lately. So the idea is, "where is the optimal camera placement? And how large does the store have to be? And where can the cameras be placed in terms of ceiling height" and things like that. Determining that actually by renting a physical space and going through all the permutations would be cost-prohibitive in a lot of scenarios. But if you can create a virtual environment inside of Unity, a virtual store where you could test out different sensors and locations, it could open up avenues for iteration that would probably be inaccessible if you had to have a physical space.

Sean Doherty
I think one of the other things is that usually with image-based datasets, and you're dealing with the idea of recognition, you want images from different angles and different permutations of where you're looking at. Imagine if you're trying to recognize a car and you have a bunch of pictures of you looking at the car from one place that you're standing. One camera position. And you took 1000 pictures of your car. Well, that's not really going to work because the system wouldn't be smart enough to recognize your car from the side if all it had was pictures from the front. It's going to recognize your car from the front probably pretty well, but from the side, it wouldn't know at all. And if you tilt five degrees to the left, it wouldn't recognize it either.

And so, what this allows us to do is, in Unity, you can move the camera position wherever you want. And you can take, you know, 10,000 pictures, every degree, from the front of the car to the side of the car, and get all those pictures trained together in the same model. And so there's just a lot of interesting advantages in this realm, again, to get started and to give you a place where you can make these things possible from day one.

Phillipe Charles
Another thing (I think Mike sent something in Slack that was interesting) is that there's also the issue of privacy and regulatory concerns with collecting data, and how this sort of brings us around that or past that issue because we're not really collecting real-world people or real-world data that could be attributed to people.

Sean Doherty
Yeah, data privacy is a massive issue in so many different spheres. You all have seen the cookie-accepting pop-ups that came from GDPR, which is the European Union standard for data protection. And who knows what coming legislation is going to do to data privacy and data protection. We've already seen in healthcare space, HIPAA is one of those that a lot of people know about. But when you're talking about machine learning, the more data you have, the better you can do, but sometimes that data is protected—as it should be. There's sort of a conundrum that's posed in the area that talks about, "well, should all data be private?" And everyone's like, "Yeah, my data should be private." It's like, "Well, what if I could take your medical data and then cure some form of cancer? Would you give me your medical data then?" And then so the idea of privacy kind of changes somewhat. Right? It depends on how you're going to use it. Well, could I save someone's life if I gave up my medical data privacy? Maybe.

So when we talk about this synthetic thing, this kind of bypasses a lot of that. Right? Because in a space where there is protections, where there are restrictions on how you can use the data, and what data you can store, and how long, and whether or not it can be attributed to a person or not— and there's all these kind of rules and regulations around it—this opens the door to create datasets that aren't real, but are as close to real as the real thing. And so you can use the, maybe, smaller amount of data that you do have that's anonymized and properly vetted and run through all the proper channels to be secure data-wise, but then also augment that with a whole bunch of data that's pretty close to accurate and can give your model a better chance of succeeding out of the gate.

Phillipe Charles
Awesome. And one other thing I saw in what Mike sent earlier—because we talked about how this could be really interesting for companies like startups, or companies that are just trying to get started—but I thought it was interesting that (I think it was a survey or something) many enterprise companies are trying to do machine learning and a large portion of them are blowing through their budgets because of many of the problems that we're talking about here. So problems that are affecting, yes, certainly startups, but even some of these larger organizations that are, like you said, not Google and not Facebook.

Sean Doherty
That's a lot of people.

Phillipe Charles
Yeah, and they're also blowing through their budgets because they're underestimating the cost, the time, all of the things that we said that is involved in collecting this data. I thought that was really interesting in that read.

Sean Doherty
Yeah, and I think I find that everybody wants more data. Because, again, the validity of how well the system works will increase with better data—with more data. Right?

We haven't even mentioned the fact that a lot of times real-world data has problems with it. And if humans are doing all the annotations and creating the answers, well there might be mistakes, and then you have misclassifications and that's just as bad. If the algorithm is only as good as how it's trained and you train it with incorrect or invalid data, that's a problem. So, yeah, there's a lot of things.

And I don't think it's just necessarily the big companies. The larger companies have data, but even they are probably investigating these techniques because, even if they have a lot of data, more is better. And if they can have, you know, 25 million instead of 25,000, that's going to be better.

And so we're talking about cost savings, it's not necessarily just for the little guy either. Larger companies, especially ones that (like Mike was alluding to) may be prerelease, maybe the product isn't really out there—getting the pictures right now is impossible or very cost-prohibitive. Maybe you want to train something for day one of a product, but the product's still in development. Or we have a CAD model of it, but we don't have the real thing yet. Could we train a model off of synthetic data and get image recognition working on day one in time for the product to be launched?

I know that there are companies that build hardware and manufacturing of silicon that take $10 million to make the first chip. If you're manufacturing something and it costs that much to make your first chip, you're not gonna do it to train a machine learning model. You know? It's cost-prohibitive.

So there's lots of things involved with the whole situation—it's not just the big guys. I think everyone can benefit from this technique. And so that's where it's exciting for us to be consultants, to be able to come in and to hear the issues, and to hear the real-world business objectives behind what they're trying to achieve, and then trying to find something that works. And that's where the creativity and the ability for Mike to create things in virtual worlds makes all this possible. Because I don't think, again, you fire up Unity and you wouldn't just magically push a button and get all this stuff out there. Right? It takes skill and technique. But it is possible. And that's kind of what we're excited about.

Mike Harris
Yeah, you brought up a really good point about an added value of creating these synthetic datasets inside of Unity. And it's that once you've created the dataset, if you find out that there's a problem with it, if the way that that information was collected is actually not useful in training your model, if you did that in a traditional manner of getting physical images and having them hand-annotated, and then you had to go back and recollect those images: that's a substantial expense and a substantial amount of time which could be really risky and expensive for a small startup. Whereas, on our side, if it turns out that this synthetic dataset has any issues or needs to be tweaked in some ways, we could change the parameters inside of Unity, run the rendering on the cloud in the course of an evening or a day (depending on the size of the dataset) and then crank out an entirely new one with entirely new annotations with all of those tweaks and keep the business running, and also not exponentially increase the amount of cost involved.

Phillipe Charles
That's really awesome. That's amazing. Do you guys have any other thoughts or anything else you want to add before we wrap up here?

Mike Harris
I think that this is a really exciting time to be doing things in this realm just because of, as I mentioned, the increasing capabilities of these game engines to render realistic images, which is only going to increase at faster and faster rates over the coming years. Being on the cutting edge of using that technology, and marrying it with machine learning in this way, I just find it a really exciting time to be doing this type of work.

Phillipe Charles
Absolutely. So yeah, I think the proper way to close this is to say that, yes, we are excited. And if you happen to be one of those companies that is trying to think through this yourself and potentially struggling with the collection of data, and you've heard this and you think, "You know what, this might be a real solution for us" (and it very well might be), we're happy to have that conversation with you. So feel free to reach out to us. And if there's any questions that we can answer, we're happy to help with those.

But please join me in thanking Sean and thanking Mike for this conversation and certainly educating me, and hopefully educating you, as well. Thank you, guys.

Sean Doherty
Thank you.

Mike Harris
Good chatting with you.