Understand the potential applications and limitations of generating synthetic data using AI models.
The idea of “synthetic data,” or artificially generated information, has recently caused a stir. Data is a huge asset for businesses in this age, and knowledge often provides a decisive competitive edge. The notion of easily obtaining free data has sparked extravagant claims and controversy.
But, as your mom probably told you — if something seems too good to be true, it is.
However, the reality is a little more nuanced with synthetic data. While we certainly can’t stop collecting data and “just ask the model,” some fascinating middle-ground uses of AI-generated data exist. And judicious use of this data can help drive your business forward. In this situation, there’s no free lunch, but there is at least the possibility of a complementary side or two.
To better understand the opportunities opening up with synthetic data, I will introduce you to three primary modes you can use to generate the new data. These aren’t the only ones available, but are the most common approaches today.
1. Direct querying
The first mode is the one people most commonly associate with the idea of synthetic data — and that is direct querying. When you first used ChatGPT or one of the other AI chatbots — there was probably a point when you said to yourself, “Wait a second. I can interview this just like I would a research respondent,” and tweak the system prompt (“You are a Gen Z participant who is passionate about RPGs…”) and proceed with asking the question.
Working with this kind of data can quickly become problematic or un-insightful because training datasets can be old. Responses can be biased or have inappropriate viewpoints that can easily bubble up. Additionally, a large chunk of the training data for these models comes from services like Reddit, which can have spicier takes than you’d want in your own data.
Beyond those red flags, the main issue with this kind of data is that it is boring. By its very nature, it produces plausible answers based on the amalgam of all its training. Therefore, it has a tendency to produce obvious answers — the very opposite of the kind of insight we are usually looking for. While direct questioning of the LLMs can be interesting, large-scale generation of synthetic data in this way is likely not the best solution.
2. Data augmentation
We can move beyond data querying through the second mode, which is using the models to extract data from data that you bring to them — often called data augmentation. This method uses the reasoning and summarization power of the LLMs. Still, rather than basing the output solely on the original training data, you leverage models to help analyze your own data to generate perturbation of it as if it were original data.
The process looks something like this. First, you must know the data you are bringing to the table. Perhaps it’s data sourced from an internal system, primary research, a trusted third-party supplier or from segmentation or appended desirable behaviors. After understanding the source of your data, you can then use the LLM to analyze and provide more data with compatible characteristics.
This approach is far more promising and provides you with control you cannot get from the LLMs on their own.
Many in the martech industry might be thinking, “Like look-alikes?” and you would be correct. The new models allow us to generate look-alikes in a way that we have never been able to do before. This allows augmenting or generating data that stays consistent and comparable with the known data we already have.
Often, having a volume of data like this is helpful when testing systems or exploring some of the fringes a system might need to handle. It could also be used to provide truly anonymous data for demonstrations or presentations. Avoid the circular thinking of “Let’s generate a ton of data and analyze it,” when you are better off simply analyzing the root data.
3. Data retraining
Finally, the third mode of generating synthetic data is retaining a model to represent the data we have directly. The “holy grail” approach of taking a model and doing custom fine-tuning on a data set has been around for a long time but, until recently, has simply taken too many resources and been far too expensive to be a reasonable option for most.
But technologies change. The prevalence of smaller but high-performance models (i.e., LLaMA, Orca and Mistral) together with recent revolutionary approaches to fine-tuning (i.e., Parameter Efficient Fine Tuning, or PEFT, and the LoRa, QLoRa and DoRa sisters) means that we can effectively and efficiently produce highly customized models trained on our data. These are likely to be the techniques that truly make synthetic data shine — for the near future at least.
While there is no free lunch, and the dangers of bias, boredom and circular thinking are very real — the opportunities of synthetic data make it highly compelling. And when leveraged correctly, it can create efficiencies and exponential possibilities.
The post Unlocking the potential of synthetic data: A business game-changer appeared first on MarTech.
MarTech(7)