A Data Science Central Community
This articles discusses some of the data challenges that the healthcare industry faces. It also revisits how Statice's collaboration with the leading health organization Roche to test the use of synthetic medical data for clinical research and what opportunities we see from this.
Maybe more than for other industries, research and innovation in healthcare rely on the ability to access and analyze data. It fuels the machine learning models that help to discover new diseases. It powers personalized medicine and fosters the research on drug efficacy. So being able to use granular, statistically representative data is essential. Unfortunately, the road to data-driven healthcare is paved with obstacles.
Firstly, the healthcare data landscape is mostly made of EHR systems with proprietary formats and siloed IT infrastructures. These systems prevent researchers from quickly accessing and sharing clinical data for innovation or analysis. Combining and formatting these segmented and dispersed sources is cumbersome and costs organizations time and money.
Even when organizations manage to create aggregated data views, where information could be made available to internal stakeholders, other challenges arise. Long internal governance and sharing processes prevent data from flowing seamlessly. When medical researchers request data, processing these queries can take weeks and not even return the desired data points. Unfortunately, in the context of crisis management, slow data access is even more detrimental.
Then, organizations have to comply with stringent regulatory requirements for the processing of personal medical data. In Europe, the GDPR strictly regulates health data processing and is supplemented by national sets of regulations. Additional guidelines supervise the use of data for specific use-cases, such as the European Medicines Agency's guideline for the publication of clinical data. In the US, The Health Insurance Portability and Accountability Act (HIPAA) and the Health Information Technology for Economic and Clinical Health Act (HITECH) regulate the storing and processing of personally identifiable medical data. These regulations are essential to help protect patient's privacy. But they also cause data inertia. Some companies prefer not to do anything with the data by fear of not complying with the regulations.
To all of this, we must add privacy and cybersecurity concerns. While collaborative research across institutions is a significant innovation enabler, it raises challenges for the protection of patient's privacy. Data breaches and re-identifications risks pose threats. Organizations' corporate and financial risks keep on rising with the growing amount of healthcare data breaches every year. In the UK alone, more than half of the healthcare organizations experienced a cybersecurity incident during 2019.
And last but not least, maintaining both privacy and data usefulness is another difficulty for organizations that choose to leverage data. Frequently, data that undergoes strict PIIs removal and anonymization processes presents a low quality for analysis. For example, the HIPAA requires the removal of all elements of dates directly related to individuals. Such deletion complicates all research involving temporal factors.
So how can healthcare organizations gain data agility without compromising data privacy or utility?
At Statice, we see ourselves as enablers of data-driven innovation and as safeguards of individuals' privacy. We are building a synthetic data solution that provides healthcare companies with a reliable alternative to using sensitive data.
The concept is simple. From the sensitive medical data, organizations generate simulated data that offers the same analytical value but contains no real patient information. This new dataset mimics the original one. It can power all sorts of analysis, but entirely protect the patient's data privacy.
Under the hood, the software uses machine learning models that train on an original data source. These models understand the initial statistical distribution in the sensitive data and generate a new set of data without one to one correspondence.
For Roche, this represented an opportunity to benefit from an unlimited source of granular yet privacy-preserving data. Together with the company, we worked on understanding how to generate synthetic data from sensitive clinical trial data. Later, this synthetic data could serve as test data or training data for machine learning applications.
The project was two-phased, with a first phase focusing on data and requirements testing and a second-phase of synthetic data validation. The requirement was :
We decided to evaluate the feasibility of producing synthetic medical data with these requirements with an Harvard Dataverse dataset. This public dataset provides a set of 4 datasets of clinical trials data, containing 33 variables, 55660 observations on 123 patients.
The small number of patients in the data was an exciting challenge. When working with statistical distribution, the fewer individuals you have, the higher the disclosure risks are. We had to make sure the no statistical feature was singular enough to disclose information from original records.
Once we had the synthetic data, we evaluated it by comparing its property with the original data. The results were promising, especially given the small size of the sample. The synthetic data proved to largely preserve the statistical patterns initially present. The evaluations below were generated by the Statice's software. They show that the marginal distributions in the synthetic data were similar to those from the original data.
They also highlight the fact that correlations in the data were well preserved. In the case of the Harvard dataset, the correlations were between lab measurements. The conclusion was that synthetic data would be as useful for analysis as the original would have been.
Statice’s software offers additional privacy evaluations. They showed that no one to one connection existed between the original and new data. It means that it wasn't possible to tell whether an individual was part of the original dataset.
For us, this project was another strong signal of the potential of synthetic data in healthcare. Where privacy regulations, legacy infrastructure, and governance processes restrict the data’s availability, synthetic data can help drive data agility for teams.
The use-cases in the industry are numerous. Synthetic medical data can support the development of healthcare applications. For example, M-Sense is the company behind a migraine monitoring application. They use synthetic data to conduct migraine research from patient’s data while ensuring complete privacy and anonymity.
In Germany, the Charité Lab for Artificial Intelligence in Medicine works to develop synthetic data to generate data for collaborative research and facilitate the progression of different medical use cases. In Canada, the Canadian health economic hub Health City develops a synthetic patient database reflecting the characteristics of the real population for health research. And these are only a handful of examples.
There are still challenges related to generating and implementing synthetic data for innovation in healthcare. But the initiatives we see worldwide demonstrate the willingness of organizations to identify solutions to the current healthcare ecosystem data challenges.