Synthetic datasets-Or why you can't train robust models with generic data

Training robust AI models can be challenging when using crowdsourced or publicly available datasets. The most useful neural networks are ones that replace expert knowledge to solve real problems. One example would be a visual inspection model for the factory floor.

Human level performance

Human performance comes into tiers: general or expert. An expert is someone who has spent a minimum amount of time or has received practical training in the field.

The accuracy of your AI model is directly correlated to the quality of data used to train it. When using general datasets for training, the model usually surpasses the knowledge level embedded in the training samples, but it remains well below expert-level performance. A computer is more attentive than an operator performing the same inspection task, so the model trained will probably reach an accuracy that is marginally better than the human labelled dataset.

If you think it's a good idea to use services like Amazon Mechanical Turk to label your dataset, let me ask you a question. Would you take a person and after reading a quick instruction manual get them to work inside the factory?

That does not sound like a great idea. But it turns out that is what every one of us is doing when leveraging services like these. The bulk of your dataset can be generated in this manner, but is it helping your model?

Practical considerations

Tricky samples may not make it to the dataset because someone was not able to understand your labeling instructions, and that sample may get discarded. While human-level performance is achievable with large crowd labeled datasets, the truth is that the easiest samples will be labeled first.

I had seen numerous occasions in which turks were simply targeting the lowest hanging fruit by submitting no labels for the empty samples. In that case, it is a first come first rewarded economy. The difficult samples may end up unlabeled.

Model hardening

Synthetic data allows the possibility to make the task at hand harder than it will be in real life.

Variance or noise is good for your model, while generating samples programmatically can make the model more resilient to unseen inputs. Human labeled data is often curated and less noisy than the real inputs.

This may work out, in the beginning, to quickly get started with your model. But believe me, you don't want a crowd trained model to be put in production across a factory floor without verifying that its performance surpasses expert level or that it gets as closely as possible to those expectations.

Methods for generating data

There are several methods you can leverage when generating synthetic data.

Simulator data

Use a simulator to generate new data. Custom shaders can provide RGB+D and segmentation labels. Microsoft is now focusing on aerospace-grade data generation because it is open source, AirSim 3D simulator project obtained promising results hence a lot of companies were using it for internal purposes.

Synthetic augmentation

Simple computer vision enhancements can work as well. Using the OpenCVs inpaint function the price was rewritten as per the examples shown bellow. A mixed approach was implemented here, having the human labelers define the polygon in which the price was contained.


Development focus

Validating crowd samples is going to take a similar amount of effort as developing a dataset generator.

This is the most essential part of AI and a lot of people get it wrong. If the dataset is poorly curated and not enough time is spent at this stage, your project may end up failing. The dataset should be one of your main priorities, and if you have enough quality data, you can train almost every model to perfection.


The most successful models deployed in production use a mix of the synthetic and real data to achieve great results.

Let us know what other techniques you are using to give your model that extra edge in performance.


VIEW MORE Articles >



    Skanska Green Court, 3rd floor, Building B, Bucharest, Romania
    Follow us
    Subscribe to our newsletter