top of page

Synthetic clinical data reality check - not so fast, data replacers

  • Writer: Amanda Berra
    Amanda Berra
  • 13 minutes ago
  • 9 min read

Quick rewind: our tour of the clinical data market that ended in a cliffhanger

In Part 1 of this blog post (linked below if you missed it!), we mapped healthcare’s market for clinical data—a.k.a “Real World Data” or RWD. We looked at different types of clinical data, including who owns usable data (providers, payers, EHR vendors, multi-source aggregators), why health systems have found what seems like a proprietary “data goldmine” to be so hard to monetize, and how third-party aggregators have stitched together nationwide, longitudinal data sets that life-science, tech and payer buyers are willing to pay for. And then came the cliffhanger: we’ve reached the point where, in theory, AI can generate statistically identical, privacy-risk-free “synthetic” data sets.  Essentially generated data that has all the characteristics of RWD, but isn’t RWD. On paper, synthetic clinical data seems to solve a lot of the pain points associated with the real thing—HIPAA worries, consent barriers, data breach headlines, six-figure license fees—so … is the market for actual clinical data about to be ruined? Today we tackle that question head-on.

Missed part 1 on the "Real World Data" market? Catch up by clicking here or on the image below.

A guided tour of the synthetic clinical data market (2025)

Initial market snapshot and growth trajectory

Here’s where one feels societal pressure to cite market size estimates and projections, but as I don’t feel comfortable relying on them for anything other than generating bar charts for visual interest and gravitas (topic for another day!), let’s instead just focus on the pattern.  

The market for real-world clinical data is big and growing moderately; the market for synthetic clinical data is far smaller, but also growing far faster. This, of course, makes it attractive to investors/businesses looking for growth avenues—so, suffice it to say we are not the only people peering into this world to see how it’s currently shaping up.

Let’s start with who’s selling and buying, and for what. 

Who’s selling—four supplier archetypes

There are basically four types of organizations currently in the synthetic clinical data space:

  1. Multi-source data brokers/aggregators expanding their product portfolio. Multi-source brokers—think Truveta, Datavant, HealthVerity—have begun offering synthetic modules, either through partnerships or in-house R&D. This allows them to monetize the same raw feeds twice: once as a de-identified data set and again as a privacy-sealed synthetic twin.

  2. Cloud and big tech. AWS, Microsoft Azure, and Google Cloud each now offer synthetic-data toolkits inside their healthcare AI stacks, letting any clinical data steward with sufficient raw material generate privacy-preserving synthetic data sets on demand. 

  3. Provider-generated synthetic data. Large IDNs and academic medical centers have are beginning become suppliers themselves, generating synthetic copies of their EHRs for internal research and then licensing derivatives externally.  

  4. Pure-play synthetic data generators. There are now synthetic data companies—for instance MDClone, Syntegra, and Syntheticus. Synthetic data is their whole business, and they compete on the fidelity and privacy guarantees of their generative models.

Who’s buying—and why

In short: Life-science R&D, digital-health start-ups, med-tech manufacturers, payers, and health systems make up today’s synthetic data buyer group. They’re in the market for synthetic clinical data because they need large, diverse, shareable data sets for AI development –- and they’d love to find that resource at lower cost and legal review/regulatory compliance burden. Especially in a way that scales to satisfy regulators in different geographies. For instance when the project needs to pass the current highest current tiers of regulatory standards: the GDPR (EU General Data Protection Regulation)—the HIPAA of Europe that is commonly viewed as being a much stricter standard for controlling the processing of personal data in Europe, and the CCPA (California Consumer Privacy Act)—the privacy statute that gives California residents rights over how their personal data is collected and sold.

Synthetic data sidesteps SOME of the various obstacles these laws pose, such as consent and cross-border data-transfer provisions (GDPR), or opt-out obligations (CCPA).  (But, only some—more on this in a moment)

A graphic showing common uses for synthetic clinical data

Today’s most common synthetic data use cases  

There are a few major use cases attracting greatest current use of synthetic data in healthcare today. Including:

  • AI/ML model training and validation. For instance, a digital pathology startup could augment its limited starting slide libraries with synthetic images. An NLP vendor could bulk up its library of notes to improve rare-disease classification. Interestingly—since introducing bias is considered a risk of synthetic data sets—researchers could also use synthetic clinical data to try to correct for bias in some other AI-generated data.

  • Clinical-trial design and protocol rehearsal. Sponsors can run simulations using their real eligibility criteria and recruitment timelines on synthetic twins before touching live records, which gives them a rapid assessment of the feasibility of their projects as currently designed.

  • Digital-twin and rare-disease modeling. Researchers can create artificial cohorts for a range of purposes, from novel drug discovery/new indications for existing drugs, to testing counterfactual treatment paths, to doing capacity forecasting and facilities planning.

  • Cross-enterprise data sharing.  Payers, providers, and analytics firms can exchange synthetic versions of linked data sets when regulations or corporate policy block direct sharing.

  • Staff training and innovation: Synthetic data can enable low-risk internal employee activities, such as teaching/learning how to use tools that require clinical data, hosting “hack-a-thons” or helping staffers experiment with/collaborate on ideas for new analytical models or tools

Taken together, those channels explain why some investors see a compelling addressable market— even though actual spend is still measured in hundreds of millions.

Why synthetic clinical data adoption isn’t a slam-dunk

This is the part where the savvy reader says, “I sense a ‘however’ coming…” Yes. Here comes the ‘however’ – some of the challenges that stand in the way of synthetic data completely replacing the market for clinical data anytime soon.

1. You can’t make synthetic clinical data sets without first having access to (pricey) real-world data

High-fidelity synthetic data sets must start with—and constantly refer back to—clean, well-labeled source data. Synthetic clinical data generator companies have to train models on full EHR extracts before they can offer privacy-safe “digital twins.”  Also, to stay current, every new synthetic data set generated from that given source ideally needs to be re-anchored back to updated source data (e.g., to reflect how underlying trends are changing over time, and to keep the synthetic version from getting too far afield of the reality it’s supposed to be depicting). That adds ongoing cost and data-sharing friction for any organizations that don’t already own the raw asset.  

For end buyers, the licensing cost for that real data ends up getting factored into the price tag for the synthetic product—meaning, and this is a really important point, synthetic data is often NOT as cheap as one would think it might be.

(And that’s if making synthetic data from a given real-world data set is feasible at all under the terms of the licensure—some contracts may forbid creation of new products as a use case for the data being licensed.)

2. You can’t validate synthetic clinical data without investing in specialized resources

Deciding whether a synthetic cohort is “good enough” means running privacy-risk, fidelity, bias, and utility tests—often dozens of them—against the original data set. There are tools available to act as statistical scorecards, but someone still has to interpret those charts, probe failure cases, and iterate.

In practice that means data scientists plus domain clinicians – think, specialists who are experts in the pertinent clinical area(s) — poring over outlier distributions, longitudinal event sequences, and rare-disease prevalence.

Bottom line: Validation is a significant workload and/or cost.

3. Synthetic data’s regulatory burden is lighter—but definitely not zero

Remember how I just said that using synthetic data is a way to sidestep burdensome regulatory requirements related to use of real-world clinical data?  Well that’s true BUT, even synthetic data still comes with boxes to check.

Scanning across the different regulators/regulations, the key themes for synthetic data compliance vary, but important themes include:

  • ·Statistical fidelity: Synthetic data sets must come with validation that they preserve clinical correlations from real-world data—without introducing bias  

  • Proof of excluding real PHI: Data sets should contain no identifiable patient data—and this has to be verified through automated audits

  • Re-identification safeguards: Data set producers/users have to show that they use techniques to prevent anyone from mathematically working backwards to the original data set or otherwise working out a way to identify patients

  • Labelling: Synthetic clinical data has to be clearly identified as such

As Union’s Eric Fontana puts it, “You need to be able to unravel how the data set was created, prove that it contains no identifiable data, AND that it is sufficiently representative of real world data. What it all adds up to is that you need to satisfy regulators that your synthetic data set is both REAL and NOT REAL.”

Good times!!

4. Uncertainty makes the synthetic clinical data regulatory burden heavier

Because synthetic clinical data is a relatively new kind of item in the data market, I think it’s important to add that sometimes that means it sits in a regulatory gray zone. Many organizations will not love taking on that kind of inherent risk and/or may default to the highest-perceived compliance standard just in case. That will, of course, reduce risk—but it won’t do the thing that in theory makes synthetic clinical data so attractive. I.e., it doesn’t do anything to drive down cost or compliance burden.

Additionally, the Trump 47 administration has introduced plenty of new sources of regulatory uncertainty, including but not limited to its deregulatory stance on the use of AI. In theory, a deregulatory attitude should reduce compliance burden—but at least in the short term, it will tend to increase it as organization leaders struggle to navigate risk tradeoffs in an uncertain environment.

5. Standards wild west: no single source of truth in statistical data quality (yet)

As mentioned above, regulators vary in what they are looking for in assessing statistical data in healthcare use cases; while there are certainly themes, there currently is no single consensus or standardized approach for systematically evaluating the privacy and utility of synthetic clinical data… yet.

Various third-party firms and academic centers are currently positioning themselves as synthetic-data auditors—think HITRUST (a commonly accepted standard for data security) but for bias prevention, privacy risk, and statistical fidelity.

At some point there may well be a trusted name whose independent attestation will satisfy IRBs, legal counsel, and would-be commercial partners—but that day is not yet here.

6. Clinician skepticism will slow uptake in some use cases

A range of recent-ish studies all show that clinicians are. slowly warming up to AI in their workplaces—but, that they also have nuanced views on what use cases they do and don’t feel comfortable using AI in. The closer the use case gets to “advising on clinical practice”, the more skeptical clinicians tend to be. 

In any clinician-facing use case, clinicians will 100% be asking “how do we know that this data set is ‘good’—and, we already know there will be difficulty answering that question.

That means, unless/until there is a trusted single quality standard – clinician skepticism is going to be a general uptake/feasibility brake on adoption of synthetic clinical data for, especially, clinical guidance-related use cases. 

So…what’s the bottom line here? Will synthetic clinical data ruin the proprietary-data business?

We do NOT see synthetic data ruining market for real-world clinical data. Instead, here are three things we can see playing out:

Three trajectories we’re watching

1.    A two-tier, or two-step, market will solidify itself

The data market may separate into sandbox vs. regulatory-compliance-ready layers. Less expensive (more on that below), fast synthetic samples will satisfy algorithm tuning, early protocol design and educational use. But verified real-world data sets will retain market share and pricing power for real-world uses, like FDA submissions, label-expansion studies, pharmacovigilance and payer negotiations. These products may be commonly become used in sequence, e.g., vendors will let users prototype on synthetic clinical data and later swap in the high-test real stuff for final analyses.

2.    Synthetic generation will turn into table-stakes “super-de-ID”

If/as synthetic clinical data becomes more commonly used for the above use cases, we expect more buyers will ask for it even if they are also in the market for real-world data. We can see a world where real-world data suppliers who can’t offer a synthetic-twin set as part of the product/service package may find themselves knocked off the shortlist. Over time, we can imagine the ability to deliver both formats—plus transparent validation reports—will feel like a basic capability that every vendor would be expected to have.

3.    Real-data owners will all become synthetic purveyors—probably by way of contracting with/acquiring synthetic data companies

Because you need (expensive) real-world clinical source data to generate high-quality synthetic data sets, large health-system, provider-DNA consortia and, for the time being, EHR vendors, are structurally advantaged to offer synthetic clinical data at lower cost compared to other would-be players in the synthetic data market.

That means these players will all need to acquire, or contract for synthetic data generation capability. (Or, in theory, stand up their own capability but that seems less likely.)

That being the case, it’s hard to see stand-alone synthetic data generation companies being able to compete with real-world clinical data purveyors (who can supply synthetic versions of their real-world data) in selling to end users directly. It’s a lot easier to see them being M&A targets, or contractors to real-world clinical data purveyors

In summary...

We hereby conclude that synthetic data WILL NOT destroy the clinical-data market. Instead, real and synthetic data sets will coexist, each optimized for different users and use cases. Players who already have access to real-world clinical data will be the ones supplying synthetic data to end users. They will have to build/buy/contract for synthetic data generation capability behind the scenes in order to do that.  Oh and, someone is going to win the race to he the top validator -- the HiTRUST of synthetic data --for healthcare… we shall see who that ends up being.

Thanks for coming along on this data adventure with us.

Next steps

  • Feel free to go back and read part 1 if you missed it.

  • Don't forget to subscribe to our blog (scroll down)

  • Union members (only): Sign up for our next strategy bootcamp on digital health! Thursday July 18, 1-2 PM ET.



Comentarios


Join our mailing list to see future posts

Thanks for submitting!

bottom of page