In today’s data -driven investment environment, the quality, availability and specificity of data can make or break a strategy. Nevertheless, investment professionals are routinely confronted with limitations: historical data sets cannot record emerging risks, alternative data is often incomplete or priceless and open source models and data sets are crooked in the direction of important markets and English content.
As companies look for more customizable and future-oriented tools, synthetic data is particularly when they are derived from generative AI (Genai)-in turnout as a strategic active, offering new ways to simulate market scenarios, models for learning machines and back test. This report investigates how Genai-powered synthetic data the reforming of investment workflows from simulating assets to improve sentiment models and what practitioners need to know to evaluate its usefulness and limitations.
What exactly are synthetic data, how is it generated by Genai models and why is it becoming increasingly relevant for cases for investment use?
Consider two common challenges. A portfolio manager who wants to optimize performance in different market regimes is limited by historical data, which cannot take into account “what -IF” scenarios that still have to take place. Likewise, a data scientist can follow sentiment in German-language news for shares with small caps that most available data sets are in English and are aimed at large CAP companies, which limits both coverage and relevance. In both cases, synthetic data offers a practical solution.
What distinguishes the synthetic data of Genai – and why it matters now
Synthetic data refers to artificially generated data sets that replicate the statistical properties of data from practice. Although the concept is not new – techniques such as Monte Carlo simulation and bootstrapping have a long -supported financial analysis – what has changed, is the How.
Genai refers to a class of deep-place models that are able to generate synthetic data with high-fidelity about modalities such as text, table, image and time series. In contrast to traditional methods, Genai models learn complex real-world distributions directly from data, which eliminates the need for rigid assumptions on the underlying generative process. This possibility opens powerful use cases in investment management, especially in areas where real data is scarce, complex, incomplete or limited by costs, language or regulations.
Ordinary Genai models
There are different types of Genai models. Variation car codes (VAEs), generative opponents (GANs), diffusion-based models and large language models (LLMS) are most common. Each model is built using neural network architectures, although they differ in their size and complexity. These methods have already demonstrated the potential to improve certain data -oriented workflows within the industry. For example, VAEs are used to create synthetic volatility surfaces to improve options trade (Bergeron et al ..2021). GANs have proven to be useful for portfolio optimization and risk management (Zhu, Mariani and Li, 2020; et al ..2023). Diffusion -based models have proven to be useful for simulating asseta tour correlation matrices under various market regimes (cubiak et al ..2024). And LLMS have proven to be useful for market simulations (Li et al ..2024).
Table 1. Approaching for generating synthetic data.
| Method | Types of data it generates | Example applications | Generative? |
| Monte Carlo | Time series | Portfolio -optimization, risk management | No |
| Functions based on Copula | Time series, table | Credit risk analysis, asset -relief models | No |
| AutoGressive models | Time series | Volatility forecast, simulation of assets return simulation | No |
| Bootstrapping | Time series, Table, Textual | Creating confidence intervals, stress testing | No |
| Variation cars | Tabular, Time series, Audio, Pictures | Simulate volatility surfaces | Yes |
| Generative opponent networks | Tabular, time series, audio, images, | Portfolio -Optimization, Risk Management, Model Training | Yes |
| Diffusion models | Tabular, time series, audio, images, | Correlation modeling, Portfolio -Optimization | Yes |
| Great language models | Text, table, images, audio | Sentiment analysis, market simulation | Yes |
Evaluation of synthetic data quality
Synthetic data must be realistic and match the statistical properties of your real data. Existing evaluation methods fall into two categories: quantitative and qualitative.
Qualitative approaches include visualizing comparisons between real and synthetic data sets. Examples are visualizing distributions, comparing spatter plots between couples variables, time series and correlation matrices. For example, a GAN model that has been trained to simulate assetaret positions for estimating the value risk must, for example, successfully reproduce the heavy tails of the distribution. A diffusion model trained to produce synthetic correlation matrices under different market regimes, the co-movement of assets should sufficiently record.
Quantitative approaches include statistical tests to compare distributions such as Kolmogorov-Smirnov, Population Stability Index and Jensen-Shannon-Divergence. These tests output statistics that indicate the parable between two distributions. For example, the Kolmogorov-Smirnov test performs a P value that, if lower than 0.05, suggests that two distributions differ considerably. This can offer a more concrete measurement to the parable between two distributions in contrast to visualizations.
Another approach includes “train-on-synthetic, test-on-real”, where a model is trained on synthetic data and tests for real data. The performance of this model can be compared with a model that has been trained and tested for real data. If the synthetic data successfully replicate the properties of real data, the performance between the two models must be comparable.
In action: improving the analysis of financial sentiment with synthetic data from Genai
To put this into practice, I have a small open-source LLM, QWEN3-0.6B, a small open-source-LLM coordinated for financial sentiment analysis with the help[1]. The data set consists of 822 training examples, with most sentences classified as “positive” or “negative” sentiment.
I then used GPT-4O to generate 800 synthetic training examples. The synthetic data set generated by GPT-4O was more diverse than the original training data, which includes more companies and sentiment (Figure 1). Increasing the diversity of the training data offers the LLM more examples that must be taught to identify sentiment from textual content, which may improve the model performance on unseen data.
Figure 1. Distribution of sentiment classes for both real (left), synthetic (right) and augmented training dataset (middle) consisting of real and synthetic data.

Table 2. Example sentences from the real and synthetic training datas sets.
| Sentence | Class | Facts |
| Inval in Weir leads FTSE down from record high. | Negative | Real |
| AstraZeneca wins FDA approval for important new lung cancer pill. | Positive | Real |
| Shell and BG shareholders to vote for deal at the end of January. | Neutral | Real |
| Tesla’s quarterly report shows an increase in vehicle supplies by 15%. | Positive | Synthetic |
| Pepsico holds a press conference to tackle the recent recall. | Neutral | Synthetic |
| The CEO of Home Depot leaves abruptly in the midst of internal controversies. | Negative | Synthetic |
After refining a second model on a combination of real and synthetic data with the same training procedure, the F1 score took almost 10 percentage points on the validation data set (Table 3), with a final F1 score of 82.37% on the test data set.
Table 3. Model performance on the Fiqa-SA-Validation Data Set.
| Model | Weighted F1 score |
| Model 1 (real) | 75.29% |
| Model 2 (real + synthetic) | 85.17% |
I discovered that increasing the share of synthetic data too many Had a negative impact. There is a Goldilocks zone between too much and too little synthetic data for optimum results.
No silver bullet, but a valuable tool
Synthetic data is not a replacement for real data, but it is worth experimenting with. Choose a method, evaluate synthetic data quality and perform A/B tests in a sandbox environment where you compare workflows with and without different relationships of synthetic data. It may be surprised by the findings.
You can view all code and data sets on the RPC Labs Github Repository And take a deeper dive in the LLM -Case Study in the research and policy center ‘Synthetic data in investment managementInvestigation report.
[1] The data set can be downloaded here: https://huggingface.co/datasets/thefinai/fiqa-sentiment-classification
#Genai #driven #synthetic #data #reform #investmentwork #flows #CFA #Institute #Enterprising #Investor


