In the era of big data, there aren’t many public entities with bigger data than the Veterans Health Administration (VHA). The nation’s largest integrated health care network, serving about 9 million veterans, was an early adopter of electronic medical records, beginning in the 1970s. This digital platform has evolved over the years and now constitutes an electronic health record that contains inpatient and outpatient diagnoses and procedures, lab results, prescriptions, and other veteran medical data. In total, VHA has compiled more than 78 billion records from all of its VA medical centers.
At the same time, the VHA’s groundbreaking efforts to understand how genes affect health and illness continue to grow. The most prominent of these efforts, the Million Veteran Program (MVP), was launched in 2011 to learn how genes, lifestyle, and military exposures interact to affect health and illness. To date, about 850,000 veteran volunteers have donated genetic samples – vials of blood – to aid in studies that look for associations between genes and medical record data, as well as self-reported survey data on lifestyle and military exposures.
The investigations enabled by the collection of genetic material are generally of two types. The first, genotypic or “candidate gene” studies, focus on associations between pre-specified genetic variations and disease states. VHA investigators were pioneers in these kinds of studies; beginning in the mid-1990s, investigators at the Puget Sound Health Care System made a series of discoveries linking genetic mutations to Werner’s syndrome (a hereditary disease that causes premature aging and death), schizophrenia, and dementia.
When the international Human Genome Project showed how to map the entire sequence of chemical base pairs that make up human DNA, it opened the door to a new kind of investigation: the genome-wide association study, or GWAS, in which entire genomes from a large cohort of people are scanned for common genetic variations. With the help of powerful supercomputers, investigators discovered which genes appear in one group – such as people with breast cancer – and not another.
A useful GWAS requires several elements: a large sample of genetic material, the infrastructure necessary to store and examine it, and powerful computational tools for comparing hundreds of thousands, maybe even millions, of variants. Samples from MVP participants are collected and stored in a massive freezer at the Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC) within the Boston VA facility. The 850,000 DNA samples, taken from veterans enrolled in MVP, are sorted, stored, and retrieved by a sophisticated robotic system.
Great pains have been taken to guarantee the security of this data and the anonymity of donors. Names are removed from samples immediately; each participant becomes an anonymous blood vial, and data collected from other potentially identifying information, such as surveys, are assigned codes.
According to Sumitra Muralidhar, PhD, who directs the Million Veteran Program, the point of building such a massive trove of genetic samples is to enable comparison studies on a scale that will reveal hidden secrets. “We’ll have sufficient numbers of people with a disease and without a disease,” she said, “and we’ll compare their genetics, so we can then identify what genetic risk factors may be associated with a certain disease.”
Most MVP studies to date have compared genotypes – differences in the DNA sequences stored in the MAVERIC – with phenotypes, which Muralidhar defines as simply any observable characteristic or trait. “A phenotype could be as simple as the color of your hair or the color of your eye,” she said. “It could be a disease like diabetes. PTSD is a phenotype.” Post-traumatic stress disorder is a complex phenotype with many variations, Muralidhar said, as is a multi-variable condition like cardiovascular disease. “Lipid level, blood pressure – all of these are what we call phenotypes: biochemical, biomedical characteristics that result from the genes we have.”
In less than a decade of existence, the power of the VHA’s considerable data stockpile has been made evident in landmark studies comparing genetic information to the phenotypes described in electronic health records and other data sources.
In January 2020, for example, VA researchers reported new evidence of the underlying biological causes of anxiety disorder, which affects about 10 percent of all Americans. Derived from the genetic and health data from 200,000 MVP volunteers, the study was the largest GWAS of anxiety traits to date.
In 2019 alone, said Muralidhar, MVP investigators “had 19 high-impact papers released.” In December 2019, when the American Heart Association published its annual list of the 10 greatest advances in heart- and stroke-related research, it included two MVP studies that offered insights into the genetic basis of both venous thromboembolism (VTE) and peripheral artery disease (PAD).
Identifying risk factors, Muralidhar said, is only a starting point in putting this data to work for veterans. “Being associated with something doesn’t mean there’s a cause and effect,” she said. “You have to then do more functional studies to actually show how that change in your DNA is making you more susceptible to an illness or more responsive to a medication.” The ultimate goal of the MVP is to practice what’s known as “precision medicine,” an approach for disease treatment and prevention that takes into account individual variability in genes, environment, and lifestyle for each person. In the future, VA clinicians will take what’s learned from these large-scale genomic studies and either prevent diseases entirely or optimize treatments. But it remains to be seen how soon that future will arrive.
BUILDING ON THE MVP’s POTENTIAL
Putting data to work for veterans is one of the strategic priorities identified by the Department of Veterans Affairs’ (VA) Office of Research and Development (ORD) in the department’s 2021 budget proposal – an acknowledgement that the vast, rich datasets at the VA’s disposal are resources that remain largely untapped. Because these datasets hold information that will benefit veterans and all Americans, VA is taking steps to accelerate the pace at which this data can be discovered, extracted, and used for research that will unlock advances in precision medicine.
There’s a good reason, said Muralidhar, that most studies of MVP genetic data are genotypic studies that scan DNA for changes called single nucleotide polymorphisms, or SNPs. “Genotyping is the first level,” she said. “We’re only looking at certain points along the DNA.” Depending on how a test is designed, investigators can look for any number of markers – but it’s still scanning for variations, rather than mapping the entire sequence of the genome.
Obviously, having the complete picture of variants across the genomes of many individuals is likely to yield better results than scans of candidate segments. But moving to this next level is exponentially more expensive: Genotyping costs about $45 per sample, plus some additional costs for analysis, compared to about $1,000 to $1,200 for a research-grade whole-genome sequence.
At that rate, the VA obviously will not be able to sequence a million genomes – but it has invested in a significant increase in investigators’ ability to conduct genome-wide association studies. “We have done about 53,000 [whole-genome sequences], so we are halfway there,” Muralidhar said. “We’ve already invested the funding and committed to doing 100,000 whole genomes.”
A fuller picture of genetic data is one piece of the puzzle. VA investigators also need a more complete picture of the phenotypes against which these genes are being compared. “We’re looking now at how we can translate some of these findings, move them along the pipeline to actually become useful in the clinic … and that will involve functional studies and validation of the research findings,” said Muralidhar.
The most basic indicator of a phenotype in the electronic health record is a six-character code known as the ICD (International Classification of Diseases) code, which is a fairly blunt instrument; as Muralidhar pointed out, many conditions, such as PTSD, are so multi-faceted that a single code is insufficient for any but the most basic analysis. A primary care physician’s entry of the PTSD code into a record, for example, doesn’t necessarily mean the patient has PTSD; it’s a diagnosis that must be confirmed in follow-up visits to mental health clinicians.
The VA is fortunate in that it has one of the most complete and accessible electronic health records in the world. “They are deep and go back almost 30 years,” said Muralidhar. “We know what medications people are on. We know when they were diagnosed with certain illnesses, from their labs what kind of tests were done, what the results of those were – all of that information is there.”
To unlock the potential contained within the VA’s datasets, the ORD has launched an ambitious effort to establish a centralized VA Phenotype Library that processes and curates phenotypic data and associated “metadata,” such as statistics or references that describe or otherwise help to understand that data. As it happens, a team of VA data experts have curated the PTSD phenotype for studies of the condition: They developed an algorithm that combed through all available data and metadata including natural language processing that could read, decipher, and understand clinicians’ written notes – and used it to identify patients with the PTSD phenotype – essentially, digging the phenotype out of the medical record, whether or not there had been an official diagnosis. Follow-up calls to about 200 clinicians validated the instrument. “So now with great confidence – 95 to 98 percent confidence,” Muralidhar said, “we can say that if you use this algorithm, you’ll pick out all the people who have that illness.”
Curating phenotypes using artificial intelligence and algorithms, unsurprisingly, is a meticulous and lengthy process, but Muralidhar said the VA’s data experts have already curated more than 200 phenotypes to the Phenotype Library – which will make things considerably easier for new studies. “You don’t have to go and reinvent the wheel,” Muralidhar said. “You can look at our library and say: ‘Here it is, and this is how they did it, so I’m going to use that same algorithm to find the patients I need for a clinical trial,’ or for any purpose you may want to identify patients with a certain illness. … In the future, the VA Phenotype Library will be available even to external researchers – but we are going to start with VA researchers.”
UNLOCKING THE POTENTIAL OF THE VA’s COMPUTING INFRASTRUCTURE
One of the challenges associated with maintaining such vast datasets, obviously, is making them more widely available to researchers while protecting data security. The anonymity of MVP volunteers is protected by the data platform, GenISIS, and investigators search through that data using the VA’s specialized health services research platform, essentially a set of servers known collectively as VINCI (VA Informatics and Computing Infrastructure). From the outset, said Muralidhar, MVP leaders decided that genetic data would not be distributed to researchers. “We put the data in a central secure repository,” she said, “and we bring researchers to the data.”
The man in charge of VINCI is Scott DuVall, PhD, who describes it as “a central, secure platform that allows veteran data, mostly medical record data, to be accessed in a controlled way where we are limiting people to just what is approved and just what’s necessary to perform a certain study. And that data is combined with tools that allow people to analyze and explore the data looking for scientific discoveries.” VINCI also includes analytical tools that help researchers evaluate this data in a computing environment.
DuVall is leading the effort to put data to work for veterans and, ultimately, for all Americans. The first step, he said, is to prepare the data to go to work. “The VA has some of the most complex and comprehensive medical record data in the entire world, and it’s spread across a national health care system with close to 400,000 employees,” he said. “Of those, there are tens of thousands of physicians, and 100,000-plus nurses and other personnel. These are all people who are authoring new data on our veterans. And the veterans themselves submit information, or the devices they wear, like cardiac devices and others that collect and author data. We’ve got more than 20 years’ worth of longitudinal medical record data.” Getting this data ready for work, DuVall said, is akin to brushing its teeth and sending it to the shower and driving it to the office, where it’s ready for action. “What that means,” he said, “is we’re standardizing data across time, and standardizing across different parts of the country.”
Though it seems like common sense for “diabetes” to mean the same thing in Salt Lake City, where DuVall works, as it does in Atlanta – or to mean the same thing today that it did 10 years ago – codes and conventions often vary from place to place (i.e., in clinical and administrative settings) and from time to time. Preparing VA data for work means transforming and mapping it onto a community-supported common data model – the Observation Medical Outcomes Partnership model, developed by the Food and Drug Administration. “Doing it like this allows us to use the same code sets that they use at Harvard, and the same ones they use at Johns Hopkins and the Mayo Clinic,” DuVall said, “so that we can run the same type of studies that they do and we can share our studies with them.” A common data model also allows for data to be retrieved from sources such as the Centers for Disease Control and Prevention, the National Institutes of Health, the Department of Defense (DOD), and, for veterans who are receiving care outside the VA, from the Centers for Medicare and Medicaid Services.
According to Grant Huang, PhD, director of VHA’s Cooperative Studies Program, a common data model will help VA investigators more precisely design multicenter clinical trials from its five Coordinating Centers. If the CSP is developing a study of a diabetes treatment, for example, it may help to target veteran patients who share certain genetic or lifestyle characteristics. “So, what we can do from an informatics standpoint,” Huang said, “is work with Scott’s team and say: ‘We need to know not only which patients have diabetes, but which patients have diabetes and some other characteristics.’ And he has the data that can tell us where they are. So that helps us to be a little more precise and target, say, the 20 sites where the data show us that’s where the veterans are.”
Brushing data’s teeth and getting it to the office is only the first step outlined by DuVall; VA’s data experts also want to expand data’s workplace, building a computing environment beyond GenISIS and VINCI to include cloud-computing environments such as the VA Enterprise GovCloud and the VA Data Commons, a pilot project conducted in collaboration with the University of Chicago to integrate clinical, genomic, imaging, and other data from VA records within a scalable infrastructure. As the electronic health record evolves into a shared platform with the DOD’s medical records, researchers will be able to capitalize on some of the built-in capabilities of the vendor, Cerner, for accessing and sorting complex data sets. For certain applications, such as the complex algorithms involved in accessing the Phenotype Library, the VA has an agreement with the Department of Energy (DOE), to make use of the expertise and supercomputing capacity at several national laboratories.
“They have the horsepower needed to make scientific discoveries with as much data as we have,” DuVall said. One VA/DOE project involves building a predictive model that takes into account all pieces of information that may indicate suicide risk – not primarily for research but in order to enable early interventions to provide support. “You combine codes, medications,” said DuVall. “You combine that with genetic data. And then you combine that with the language written in clinical notes and mental health notes, discharge summaries and depression surveys, and all of these other pieces. And you put it in these supercomputers and crunch it, looking for some associations.” Additional DOE projects are underway to assess risks for cardiovascular disease and metastatic prostate cancer.
The VA is forming numerous such partnerships, to support planning and policy related to artificial intelligence and data science. These efforts include subject matter experts from universities, nonprofit foundations, health care systems, and industry partners to accelerate discovery and enhance the VA’s ability to analyze data. In DuVall’s data workplace analogy, these are data’s coworkers, laboring every day to read and interpret data in ways that will improve clinical care – making it more streamlined, cohesive, and effective – and unravel the mysteries that remain encoded in veteran’s genes and documentary records.
“Everything we do in research,” DuVall said, “is trying to connect the dots: why people get sick, and when they do, which treatments or medications or lifestyle changes can help them feel better. The more we can look at the whole picture, the more accurately those associations can be made.” And as the VHA becomes increasingly able to turn its data into knowledge, the more it will be able to improve the health and well-being of American veterans.