Examining MAHA’s jump into Real World Data (RWD), and what other industry players can learn from it

Eric Fontana
Sep 24, 2025
18 min read

Updated: Oct 2, 2025

If you watched our Mid-year policy update in July, you’ll recall my co-presenter, Yulan Egan, and I discussed the MAHA report and the omission of any mention of research-ready data assets—it seemed like such a resource would be central to support plans for broad epidemiologic investigations into a variety of chronic diseases. Well, lo and behold, when the MAHA strategy document dropped on September 9, real-world data (RWD) was number two on a long list of strategic priorities, as detailed below:

"Real World Data Platform (RWDP): The NIH will link multiple datasets, such as claims information, electronic health records, and wearables data, into a single integrated dataset for researchers studying the causes of, and developing treatments for, the chronic disease crisis. The RWDP will eliminate redundancies from data collection, linkage, and compute infrastructures (including artificial intelligence (AI)/machine learning and high-throughput analytics) while maintaining rigorous privacy protections and consent protections. It will also dramatically reduce administrative overhead by relying on a unified set of data use and governance agreements."

On the surface, this is an encouraging development for at least a couple of reasons. For one, the agency (the NIH specifically) is planning on gaining access to a major repository of RWD of a variety of sources and flavors, taking a page directly out of the playbook of many large pharmaceutical and medical device companies that have been leveraging RWD for roughly two decades as an increasingly routine part of their evidence generation strategy.

For another, the announcement to move into RWD may accelerate its democratization across the industry. That said, with less than 3.5 years remaining for the current administration, agency leadership will need to get moving quickly if it’s to realize any meaningful use of RWD within that timeframe. And that may be easier said than done.

I’ve had the good fortune to have a front row seat for numerous life science companies’ initial foray into RWD in a prior role, and here’s the difficult reality: Onboarding RWD is slower and clunkier than many organizations initially realize. That makes the acquisition-to-insight chain of events rather challenging. When this announcement dropped, I started thinking through many of the possible considerations for the agency as it formally moves forward on its intent to build a RWD asset, and why a bumpy start might be inevitable. There are a wide range of good lessons that can be learned from the past experiences of life science companies jumping on the RWD bandwagon.

Before we do that, let’s parse what HHS has told us about what it plans to pull together, so we can understand what it’s seeking before diving into what it might be up against.

The RWD the HHS is looking for

“The NIH will link multiple datasets, such as claims information, electronic health records, and wearables data, into a single integrated dataset for researchers studying the causes of, and developing treatments for, the chronic disease crisis”

The intention appears straightforward (more on actual execution challenges later). The NIH will take multiple data assets (although the current proposed database seems to exclude genomics data, which is a little surprising) and, where individuals are able to be identified across datasets (via unique tokens or similar methods), link disparate sources of data together to provide a more complete picture of those individuals than any single dataset can. Many researchers already take similar approaches when working with RWD; greater insight may be gained by unifying data assets with different information across a common population.

The second part of the statement is interesting, although perhaps not surprising given that chronic disease has been a focal point of MAHA since the genesis. It appears the NIH wishes to provide these data for scientists to bolster case and treatment research on chronic disease specifically. This is an important commitment, and we’ll explore some of the nuance associated with it shortly (see section below labeled Challenge #1)

“The RWDP will eliminate redundancies from data collection, linkage, and compute infrastructures (including artificial intelligence (AI)/machine learning and high-throughput analytics) while maintaining rigorous privacy protections and consent protections. It will also dramatically reduce administrative overhead by relying on a unified set of data use and governance agreements”

The latter half of the MAHA strategy on RWD starts to get at a desired approach for working with the data. Clearly, the agency believes having a single, in-house data source will speed up its ability to run its own analytics. The MAHA strategy also states its intent to maintain rigorous privacy protections (cough…Change Healthcare); however, calling out consent protections is a little curious, given individuals within most licensable RWD assets have typically given informed consent to have their data used at the point of care. (You know that little checkbox you tick at most doctor’s offices to say “This organization can use my data?” That’s the informed consent.) Perhaps the agency is signaling its intent to take the somewhat unusual step of utilizing identifiable data or to acquire data further upstream compared to the usual data vendor/aggregator-acquired approach. Whatever the case, it's interesting verbiage, given that most of the RWD in the market today comes de-identified “out of the box.” Less surprisingly, we get confirmation that the NIH will be employing machine learning and other forms of AI to analyze the datasets—that’s consistent with this administration’s apparent enthusiasm for AI-based analytics and in line with the broader market.

Ok so we think we know what types of data MAHA is looking for. Where could it go looking?

The vast majority of the data that MAHA has stated it's seeking (clinical, claims and wearables) likely require partnerships to acquire, especially if it wants to understand the commercially insured population. This likely means it’ll turn to one or more data vendors to source elements of the data it needs in this novel asset creation effort. It should be noted that, over the past few years, under prior administrations, the NIH has increasingly laid the foundation for obtaining more data assets as part of its mission, including RFIs issued on RWD in 2023, periodic refreshes to its own strategic plan for data science, and a compilation of publicly available assets it already employs. So, the agency has been hard at work for some time on understanding the landscape and expanding available data resources, well before MAHA was a formalized effort.

Notable major assets that would effectively already be in-house are “traditional” Medicare claims data (Medicare parts A and B). CMS already facilitates licensing this data for researchers and qualified entities (QEs), who have drawn on it quite broadly over the past decade. These data have some notable limitations: The enrolled population in traditional Medicare has been rapidly shrinking over the last decade (from ~70% in 2015 to around 46% in 2024) due to the rise of Medicare Advantage. This means a big chunk of the healthcare claims for the 65+ population now fall into commercial data assets (United Healthcare, BCBS, etc.) or make their way through clearinghouses that end up being aggregated (providing data rights) and sold in the commercial market. The government presumably also has access to a wide body of Medicaid claims data that could provide insight into pediatric health encounters (unfortunately many of the links to those data asset pages are 404-ing so we can't provide them here.)

Given the likely appetite for a broader slate of data for research into conditions of interest, a reasonable supposition would be that the NIH will gauge interest from various data vendors (i.e.: IQVIA, Komodo, Truveta, Optum Life Science, OMNI, just to mention a handful off the top of my head) to fold into its available resources, given these companies hold significant data that can create the ambitiously rich asset the NIH may be seeking.

RECOMMENDED READ: If you’re looking for a deep dive into the clinical data market, Amanda Berra authored a super deep 2-part write up recently: Part 1: Welcome to the Clinical Data Market and Part 2: Synthetic Clinical Data Reality Check: Not So Fast Data Replacers

And this is where things can get a little sticky operationally for any entity seeking to acquire one or multiple large RWD assets. It’s tempting to assume that there is an express lane from data licensing to insight, but that discounts several major pitfalls that exist around acquiring and operationalizing data. So, from here on out, I’ll be mixing in commentary that is far less MAHA-specific and instead applicable to a broader audience of potential licensees looking at acquiring real world data.

Let’s touch on a few of the challenges that could be in play.

Challenge #1: Knowing if RWD is the right fit for an organization’s goals.

Neat Tip #1: Understand the limitations of using RWD for research:

If RWD is being employed to gain insight into various aspects of chronic disease, including prevalence, patient characteristics, care utilization patterns, treatments sought, common comorbidities, and care pathways—all effectively observational research—RWD can be tremendously valuable, depending on the source (more on that below).

However, RWD is inevitably going to be far more limited when it comes to providing insight into many of the factors named as chief contributors to the development of chronic disease that MAHA is considerably interested in. So, for example, if the agency is interested in using RWD to understand the impact of environmental exposure to toxins or level of consumption of certain food, RWD isn’t capable of providing such detail, largely because the majority of the RWD available today is derived from interactions with healthcare providers. Any data regarding potential environmental exposure, food consumption, or matters of daily life, would need to be sourced from avenues outside RWD—and then linked to it at the individual level—if that data is able to be sourced at all.

It almost goes without saying (although I’ll say it anyway, for GPT optimization) that RWD is not a replacement for robust experimental design, such as randomized controlled trials, and by extension, reviews of a wide body of high-quality experimental research that would help to more specifically understand what may be driving onset of certain types of disease. Researchers already know this; however, non-researchers may not.

Neat tip #2: Know what questions to ask vendors about their data assets so you can distinguish the marketing spiel from the operational “need-to-know:”

Another key consideration is developing a deep understanding of a specific vendor’s data source. Being able to ask incisive questions can make a big difference in what insight you get about a given dataset's composition. For example: A quick skim of descriptive detail from a variety of vendors (here, here, here, here, and here) shows subtle-to-dramatic differences in data substrate given vendors commonly draw from overlapping, but not identical, primary sources. Having said that, vendors love to trumpet the uniqueness and superiority of their assets vs. competitors', such as how many “total lives” are available in their data assets, but:

How many of those lives have a robust set of clinical and financial detail?
How longitudinal is the data and for how much of the population included in the dataset?
- For example, for how many lives do you have at least five years of continuous clinical and financial data?
- For what sites of care?
- What kinds of visibility gaps in patient care knowingly exist, and how may that impact our ability to confidently draw research conclusions?
Does the data include eligibility (insurance enrollment) controls? An important detail BTW, given frequent population churn.
What is the data lag (a.k.a. the time it takes for real world activity to appear in the dataset that our organization would be working with)?
What information does the vendor have access to versus what the licensee receives?
Is the asset growing or shrinking, and why? (Including both richness of detail and clinical lives)
What do the vendors’ relationships with the organizations supplying data look like?
- Could that change? If so, how and on what timeline? How would that impact the data asset as currently constructed?
- How many of the sources could be found in other vendors data sets?
What types of services are not captured in this asset and how much does that matter?
- For example, vaccines administered in community pharmacies (think COVID) likely won’t be resident in a health system’s EHR (or claims) if those pharmacies are part of a different network or patients are paying out of pocket. Fleshing out as much of the data characteristics as possible is a critical step to truly understanding if a given asset will be right for a particular use case and treat black box answers with a high degree of skepticism.
What enrichments (additional detail derived by the vendor designed to enhance data usability) does a given vendor applying to the data before it gets to me?
What specific QA and QC is (and isn’t) performed by the vendor prior to data release?

These are a sliver of the types of questions potential end users of the data should be asking well in advance of any data licensing and finger-to-keys analyses, as they compare sources, develop perspectives on the strengths/weaknesses of each and generally improve their own knowledge about the RWD landscape.

Challenge #2: Not doing sufficient front-end work ahead of procurement to ensure a given RWD asset adequately meets specific research needs.

Neat tip #1: Be thorough with due diligence up front to understand if a given RWD asset is likely to support your research objectives:

Value of RWD aside, whether any given data asset can support an organization’s research goals is another matter. Understanding the data asset—as detailed above—is only one component of the equation. Equally important is knowing what your organization wants to do with it, as specifically as possible, prior to licensing.

A common pitfall that many life science companies have fallen prey to is wasting the opportunity presented by an RFI/RFP from one that could yield valuable insight into a given RWD asset to support their research, into a less useful, surface-level accounting exercise. In short, make sure the questions you’re asking a vendor reflect your specific research objectives.

A common situation we’ve heard from data vendors is just how light some of the pre-licensing due diligence can get: “…(the potential licensee) sent me an excel sheet and asked me for fill rates of columns, along with some broad descriptive characteristics….” Shallow investigation such as this will miss the forest for the trees and likely lead to a self-inflicted case of buyer’s remorse. Ideally, licensees should attempt to gauge how well a data asset’s content can support actual research through the development of a series of specific, research-based questions. At a very high level, that should involve taking time to develop a comprehensive set of research questions, (including specific metrics that will most meaningfully define success); using those metrics refine research questions; and then assessing whether the data asset can support key metrics of interest through a defined set of systematic feasibilities.

Neat tip #2: Have a consistent point person overseeing data procurement and make sure they’re tightly aligned with researchers (with insight into research objectives):

Any organization will be in a better position if it has a point person driving the licensing process. Ideally, people overseeing data licensing are not merely procurement jockeys who coordinate meetings and make sure that the right people are in the room. Top performing procurement processes will involve productive coordination with epidemiologists and other researchers who will be working with the data and can evaluate the vendor proposal through the lens of several key questions, as opposed to merely tying perceptions of value to price. Another common mistake that organizations make when licensing RWD is having data scientists—analytically adept but lacking sufficient subject matter expertise—leading data evaluation. This can result in wasted time and, worst case scenario, incorrect conclusions.

Challenge #3: Licensing and working with multiple external data vendors to build a single integrated dataset can get complex quickly.

Working with multiple data vendors takes some of the aforementioned complexity we’ve discussed and add several new layers to a potential challenge. Given it seems the agency has ambitions for an incredibly robust asset, it (or any purchasing organization for that matter) should be prepared to work around some business realities of real-world data.

Neat tip #1: Don’t skimp on budget if you want best-in-class data assets:

As most life science companies know, RWD is expensive, often to the tune of tens of millions of dollars per year. Government agencies aren’t always known for allocating massive budgets to pay for RWD at the same level as pharma or device companies will. A notable exception: The FDA Biologics Effectiveness and Safety (BEST) System, which allocated major funding to a variety of stakeholders for data.

Low public sector budgets can be a problem for RWD vendors, many of them for-profit entities, who don’t like to give their often-costly-to-compile assets away for deep discounts. Sometimes, RWD vendors make exceptions through lower “academic” rates. However, if it turns out that such licensing arrangements result in portions of a vendor’s commercial clientele having access through public sector channels (especially to the extent there is significant crossover between public and private industry working on a given project) some RWD vendors may get skittish about licensing due to the perceived risk of cannibalizing future data deals. Thus, the contractual agreements come into play, specifying restrictions that may or may not be palatable to the licensees. It’s not unusual for a RWD company to walk away from government initiatives because the financial parameters of the deal didn’t make sense. And that’s just considering the “typical” RWD of clinical and claims data. If one is looking to get into more difficult to obtain, technically complex data, like genome sequencing, be ready to "back up the Brink’s truck" just a little bit further.

Neat tip #2: Have a clear understanding of whether you need to work with data in your local technical environment and specifically who will be working with it:

Does the agency desire possession of the data “on prem” so researchers can access, move, and manipulate it as they desire? Or does the RWD vendor prefer that any asset remains in their environment for security reasons? (Change healthcare…cough).

There are pros and cons to both approaches. For example, keeping data in a vendor’s environment dramatically reduces the licensees’ liability in the event of a major breach. However, speed and control of analyses may be hampered if the vendor’s environment isn’t workable with the licensee’s research preferences (software, statistical packages etc.). These questions become more complex if multiple vendors are engaged, as would be the case with a blended asset, as different vendors may have different policies. These preferences—and upshot about where the data “lives”—should be discussed as part of the early-stage licensing conversation.

Additionally, know who on your team is likely to work with the data. While most companies have licensing options for unlimited users, that may not extend to contractors or external parties outside the organization. Also have a solid understanding about what geographic regions your organization will be accessing the data in; some data assets have strict geographic restrictions due to IP code sets. And recognize that at least some vendors will not treat data access in a forbidden territory, via a US-based VPN, as an acceptable workaround; it may still result in a breach of contract.

Neat tip #3: Know how vendors may react to certain requirements, like data release under the Freedom of Information Act (FOIA):

Here’s some wild speculation (…ready?): Any report MAHA produces employing RWD will be of at least mild-interest to the public. Such interest probably increases the likelihood of such reports being “FOIA-ed,” meaning that members of the public can request access to that information, including the data used to produce an analysis. Given the likely high-profile nature of MAHA’s work, such a possibility is heightened. While FOIA rules typically prevent the release of individually identifiable information, it remains to be seen whether any vendors get skittish regarding the potential for intellectual property infringements (pertaining to the structure of their data assets or any differentiated contents within). Thus, if a vendor is reluctant to license due to strong concerns about the potential for IP violations in the case of a FOIA request, and the agency really needs that data, the agency may need to provide additional assurances of what protection it can provide (or revisit Neat tip #1 in this section).

Neat tip #4: Know how you are going to work with the data, including any analytical methods, technology and outputs to streamline contracting and aim to get clarity ahead of licensing:

As we read earlier, NIH is planning on using AI/ML methods on any RWD asset to conduct research. This is not surprising; many life science companies use RWD this way. Typically, most RWD vendors will allow their data to be used with AI/ML for observational research. However, where use cases may get a little murkier is when the AI/ML is used to produce something that borders on a device, tool, or development of new in-kind intellectual property (e.g., tools for heart failure prediction or type 2 diabetes management). In these cases, at least some RWD vendors may not be so laissez-faire given such products often go on to become businesses, powered by the licensed data. Expect that vendors may seek compensation, including additional agreements that may require equity shares or additional fees over and above a base fee, especially if the intent of such outputs are to be commercialized. Nailing these details down ahead of time is an important consideration to avoid contracting becoming an acrimonious process or worst case, non-allowed uses of data cause a breach of contract.

Neat tip #5: Wade into the waters of blended assets carefully, because there can be some undesirable downsides for end users:

You'll sometimes hear the term "common data model" used. It's a term to describe blending different vendors assets together - think clinical data from different health system sources that may be in different formats - into a singular, standard format that gives researchers much more data to work with. Blended data assets (as one example, data from different healthcare systems EHRs acquired and “stacked” together into an uber asset) can present a challenge to construct. Often each vendor’s data is formatted differently and may even contain vastly different fields, depending on “enrichments” that each data vendor introduces. This can make combining different assets exceedingly complex, requiring substantial work to determine which fields match, including having to account for subtly different formats in structured data. However, the challenges don't stop there. While most vendors have expert support teams that work with licensees, many companies may decline supporting blended assets, because no one entity specifically "owns" the asset which is now fundamentally different from what they licensed. And they may not be allowed to see it given IP considerations across various properties. That may leave the licensee in a pinch if they require support. Existing documentation may not fit in the same way, and the licensee may need to do the heavy lifting of creating their own custom documentation. Such a situation can also make the question of “who to turn to?” difficult when researchers require data support or interpretation guidance, slowing the research process down significantly.

Neat Tip #6: Be wary of the emerging privacy concerns when attempting to integrate too many data sources. Potential for identifiability rises quickly as the dimensions of the data expand:

Data linking is increasingly desirable by researchers for enhanced insight. For example, if a researcher wanted to take clinical data and layer socio-economic status onto it, this could provide a powerful source of insight into drivers of clinical outcomes and perhaps hint at approaches towards solutioning. A notable challenge with linking data assets is that the more data fields are combined, the easier it becomes to identify an individual, particularly with the enhanced sophistication of AI and ML techniques. As a result, data with different variables that present a risk for identification (think race, socioeconomic detail or geography detail) may be partitioned into separate assets and are unable to be recombined. To ensure only the most valuable data is well understood, data procurement leads should work collaboratively with researchers to determine trade-offs that must be made on specific variables. Have a good sense of what trade-offs your organization would be willing to accept when these situations arise. And be aware that, as AI/ML techniques increase in sophistication, recertification of what is low risk for reidentification today may change in future meaning that the ability to retain certain valuable fields within a data asset may actually be stripped away over time.

Challenge #4: Understand whether the organization is positioned to derive immediate value from the data.

Neat tip #1: Have a team of experts with RWD experience dedicated to making it work for the organization:

For those contemplating dipping a toe into RWD, here’s something else to think about: Almost every RWD vendor we spoke with had at least one story where data was licensed (paid for) and then sat unused on a server for an extended period of time as the organization figured out what to do. To avoid wasting time and money, start planning early. Use a team of experts, make plans for how data will be accessed, agreements on how it will be used, servers/cloud and software, etc. Once you have the relevant infrastructure and work plan in place, then move to licensing. Try to avoid entering licensing discussions without having the back-end logistics figured out.

Further, ensure you have sufficient on-staff expertise to navigate the data and coach staff with less preexisting expertise. For example, claims data is relatively “old hat” at this stage, and many people can navigate them. Clinical data, especially unstructured free text notes, require a different set of skills, given the relatively free-form nature of this information, meaning specific expertise such as Natural Language Processing (NLP) or other text based analytic skills, may be valuable. Genomic data remains quite specialized, and the list goes on. The bottom line: Have someone with enough expertise to help your team hit the ground running rather than needing to delay and improvise.

Closing thoughts…

Yikes, that was a lot of words! And yet, we've really only scratched the surface of some of the challenges an organization may face as it embarks on data licensing, which is the earliest stage of a pathway to evidence generation. Best practices around data acquisition are critical to ensure high-quality research emerges as an end result. However, if you speak to any RWD/RWE experts in life science, you'll also hear some consistent themes that having data alone isn't nearly enough: there is ample opportunity to evolve today's evidence generation capabilities in ways that reflects the rigor of clinical trials capabilities. For RWD that would involve such hallmarks as integrated data sources, high quality, "fit-for-purpose" studies, data transparency and replicability and building an infrastructure (including multidisciplinary teams, technology, platforms, performance metrics and more) that enables rigorous evidence to emerge more quickly. All of these elements are essential to optimizing how RWD is leveraged and may be fertile ground to explore in a future post.

…and a few open questions to watch for across various players in RWD.

For the NIH: The agency has track record of being thoughtful around data strategy, as evident in its detailed strategic planning. How is the agency thinking long term about the support mechanisms and infrastructure required to drive success with such a large data asset? Could efforts, like the COVID NC3 initiative, be expanded to pull broader ranges of clinical data directly from providers, effectively bypassing vendors altogether, while also enabling a more democratized, and standardized research asset that could be utilized to study a wide range of disease states? And how will transparency and consistency be facilitated in the near term to ensure public trust?

For RWD vendors: Vendors will need to think through some interesting decisions if MAHA/NIH has a major data budget to compile its asset. How may the attractiveness of a potentially large government data budget clash with some of the less desirable licensing challenges, such as potential IP issues or loss of control over assets, that may drive vendors to walk away from dealmaking?
For A.I. healthcare companies: As A.I. driven healthcare companies appear to exponentially proliferate in the marketplace, how will these emerging players balance the reality that many of their offerings are limited without robust healthcare data underpinning model training, while also recognizing that some RWD vendors will look to impose stricter terms (potentially including equity share or higher licensing fees) compared to researchers engaged in non-commercial endeavors?
For providers: As many health systems contribute data to third parties who aggregate and commercialize their data amongst other sources, how will providers continue to view the value of their data assets in the landscape of RWD, and will there be shifts in willingness in the way they allow third parties to monetize their data on the horizon? Or will they look to collaboratives outside these commercial silos?

For patients: Do very public efforts like MAHA’s move into RWD increase patient awareness about the use—and commercialization—of their personal health data? And will that spur a pullback for willingness to more individuals allowing their data to be used in RWD licensing?