How Genai driven synthetic data is to reform investmentwork flows – CFA Institute Enterprising Investor

In today’s data -driven investment environment, the quality, availability and specificity of data can make or break a strategy. Nevertheless, investment professionals are routinely confronted with limitations: historical data sets cannot record emerging risks, alternative data is often incomplete or priceless and open source models and data sets are crooked in the direction of important markets and English content.

As companies look for more customizable and future-oriented tools, synthetic data is particularly when they are derived from generative AI (Genai)-in turnout as a strategic active, offering new ways to simulate market scenarios, models for learning machines and back test. This report investigates how Genai-powered synthetic data the reforming of investment workflows from simulating assets to improve sentiment models and what practitioners need to know to evaluate its usefulness and limitations.

What exactly are synthetic data, how is it generated by Genai models and why is it becoming increasingly relevant for cases for investment use?

Consider two common challenges. A portfolio manager who wants to optimize performance in different market regimes is limited by historical data, which cannot take into account “what -IF” scenarios that still have to take place. Likewise, a data scientist can follow sentiment in German-language news for shares with small caps that most available data sets are in English and are aimed at large CAP companies, which limits both coverage and relevance. In both cases, synthetic data offers a practical solution.

What distinguishes the synthetic data of Genai – and why it matters now

Synthetic data refers to artificially generated data sets that replicate the statistical properties of data from practice. Although the concept is not new – techniques such as Monte Carlo simulation and bootstrapping have a long -supported financial analysis – what has changed, is the How.

Genai refers to a class of deep-place models that are able to generate synthetic data with high-fidelity about modalities such as text, table, image and time series. In contrast to traditional methods, Genai models learn complex real-world distributions directly from data, which eliminates the need for rigid assumptions on the underlying generative process. This possibility opens powerful use cases in investment management, especially in areas where real data is scarce, complex, incomplete or limited by costs, language or regulations.

Ordinary Genai models

There are different types of Genai models. Variation car codes (VAEs), generative opponents (GANs), diffusion-based models and large language models (LLMS) are most common. Each model is built using neural network architectures, although they differ in their size and complexity. These methods have already demonstrated the potential to improve certain data -oriented workflows within the industry. For example, VAEs are used to create synthetic volatility surfaces to improve options trade (Bergeron et al ..2021). GANs have proven to be useful for portfolio optimization and risk management (Zhu, Mariani and Li, 2020; et al ..2023). Diffusion -based models have proven to be useful for simulating asseta tour correlation matrices under various market regimes (cubiak et al ..2024). And LLMS have proven to be useful for market simulations (Li et al ..2024).

Table 1. Approaching for generating synthetic data.

Method	Types of data it generates	Example applications	Generative?
Monte Carlo	Time series	Portfolio -optimization, risk management	No
Functions based on Copula	Time series, table	Credit risk analysis, asset -relief models	No
AutoGressive models	Time series	Volatility forecast, simulation of assets return simulation	No
Bootstrapping	Time series, Table, Textual	Creating confidence intervals, stress testing	No
Variation cars	Tabular, Time series, Audio, Pictures	Simulate volatility surfaces	Yes
Generative opponent networks	Tabular, time series, audio, images,	Portfolio -Optimization, Risk Management, Model Training	Yes
Diffusion models	Tabular, time series, audio, images,	Correlation modeling, Portfolio -Optimization	Yes
Great language models	Text, table, images, audio	Sentiment analysis, market simulation	Yes

Evaluation of synthetic data quality

Synthetic data must be realistic and match the statistical properties of your real data. Existing evaluation methods fall into two categories: quantitative and qualitative.

Qualitative approaches include visualizing comparisons between real and synthetic data sets. Examples are visualizing distributions, comparing spatter plots between couples variables, time series and correlation matrices. For example, a GAN model that has been trained to simulate assetaret positions for estimating the value risk must, for example, successfully reproduce the heavy tails of the distribution. A diffusion model trained to produce synthetic correlation matrices under different market regimes, the co-movement of assets should sufficiently record.

Quantitative approaches include statistical tests to compare distributions such as Kolmogorov-Smirnov, Population Stability Index and Jensen-Shannon-Divergence. These tests output statistics that indicate the parable between two distributions. For example, the Kolmogorov-Smirnov test performs a P value that, if lower than 0.05, suggests that two distributions differ considerably. This can offer a more concrete measurement to the parable between two distributions in contrast to visualizations.

Another approach includes “train-on-synthetic, test-on-real”, where a model is trained on synthetic data and tests for real data. The performance of this model can be compared with a model that has been trained and tested for real data. If the synthetic data successfully replicate the properties of real data, the performance between the two models must be comparable.

In action: improving the analysis of financial sentiment with synthetic data from Genai

To put this into practice, I have a small open-source LLM, QWEN3-0.6B, a small open-source-LLM coordinated for financial sentiment analysis with the help[1]. The data set consists of 822 training examples, with most sentences classified as “positive” or “negative” sentiment.

I then used GPT-4O to generate 800 synthetic training examples. The synthetic data set generated by GPT-4O was more diverse than the original training data, which includes more companies and sentiment (Figure 1). Increasing the diversity of the training data offers the LLM more examples that must be taught to identify sentiment from textual content, which may improve the model performance on unseen data.

Figure 1. Distribution of sentiment classes for both real (left), synthetic (right) and augmented training dataset (middle) consisting of real and synthetic data.

Table 2. Example sentences from the real and synthetic training datas sets.

Sentence	Class	Facts
Inval in Weir leads FTSE down from record high.	Negative	Real
AstraZeneca wins FDA approval for important new lung cancer pill.	Positive	Real
Shell and BG shareholders to vote for deal at the end of January.	Neutral	Real
Tesla’s quarterly report shows an increase in vehicle supplies by 15%.	Positive	Synthetic
Pepsico holds a press conference to tackle the recent recall.	Neutral	Synthetic
The CEO of Home Depot leaves abruptly in the midst of internal controversies.	Negative	Synthetic

After refining a second model on a combination of real and synthetic data with the same training procedure, the F1 score took almost 10 percentage points on the validation data set (Table 3), with a final F1 score of 82.37% on the test data set.

Table 3. Model performance on the Fiqa-SA-Validation Data Set.

Model	Weighted F1 score
Model 1 (real)	75.29%
Model 2 (real + synthetic)	85.17%

I discovered that increasing the share of synthetic data too many Had a negative impact. There is a Goldilocks zone between too much and too little synthetic data for optimum results.

No silver bullet, but a valuable tool

Synthetic data is not a replacement for real data, but it is worth experimenting with. Choose a method, evaluate synthetic data quality and perform A/B tests in a sandbox environment where you compare workflows with and without different relationships of synthetic data. It may be surprised by the findings.

You can view all code and data sets on the RPC Labs Github Repository And take a deeper dive in the LLM -Case Study in the research and policy center ‘Synthetic data in investment managementInvestigation report.

[1] The data set can be downloaded here: https://huggingface.co/datasets/thefinai/fiqa-sentiment-classification

#Genai #driven #synthetic #data #reform #investmentwork #flows #CFA #Institute #Enterprising #Investor

How Genai driven synthetic data is to reform investmentwork flows – CFA Institute Enterprising Investor

What distinguishes the synthetic data of Genai – and why it matters now

Ordinary Genai models

Evaluation of synthetic data quality

In action: improving the analysis of financial sentiment with synthetic data from Genai

No silver bullet, but a valuable tool

Like this:

Related

Similar Posts

Lowering the Cost of Alpha: A CIO’s Framework for Integrating Humans and AI – CFA Institute Enterprising Investor

Singapore Savings Bonds SSB December 2025 Yield RISES to 1.85%

Leave a Reply Cancel reply

What distinguishes the synthetic data of Genai – and why it matters now

Ordinary Genai models

Evaluation of synthetic data quality

In action: improving the analysis of financial sentiment with synthetic data from Genai

No silver bullet, but a valuable tool

Share this:

Like this:

Related

Similar Posts

Leave a Reply Cancel reply