AI-Ready Data Is Crucial to Advancing Precision Health

Female Neurologist Looks at charts

Implementing AI-Ready Data Practices to Promote Equitable, Protected, Machine Readable, and Well-Defined Precision Health Data

We offer the following four approaches for ensuring that precision health data is AI-ready. Several of these approaches were informed by practices from Data-Centric AI (DCAI), a movement focused on improving the data practices (e.g., data engineering) used in AI development.

1. Build Well-Defined Datasets to Reduce Domain Knowledge Barriers

Due to the complexity of precision health datatypes, it’s often necessary to possess domain-specific knowledge when analyzing them. Adoption of effective data documentation protocols serves to greatly reduce domain knowledge barriers. For instance, the creation of data sheets that detail the motivation, collection process, maintenance, intended use (e.g., how an individual’s data is used and shared), and distribution plan of a dataset can help ensure its appropriate use. Data generators and AI/ML modelers should also leverage automated anomaly detection tools and statistical techniques (e.g., Random Cut Forest) to validate data quality by identifying anomalous data points. Lastly, manual data inspection, through methods like basic distributional statistics, should be used to identify potential data quality issues and supplement automated tools by providing an additional dimension of contextual understanding.

2. Employ Equitable Data Curation and Application Standards to Improve Trust

Training precision health AI/ML models with non-representative data can lead to poor or biased performance on tasks such as identifying patients with complex health needs. To ensure that data is equitable, various technical and non-technical approaches should be employed. For example, the data collection process should be designed to proactively identify and address subpopulation considerations, and organizations should create diverse and inclusive data science teams that are well-trained to be cognizant of inter-group health and outcomes disparities. Statistical tests (e.g., Chi-Square, ANOVA) and bias detection software—which are included in Booz Allen’s aiSSEMBLE offering—should be used to identify data biases.  Thoughtfully generated synthetic data can be used to augment existing datasets and correct representation imbalances.

Unreliable or sparse target variable labels, insufficient label complexity, or unclear label definitions, can also cause poor or biased AI/ML model performance. AI/ML modelers and dataset creators can use crowdsourcing to augment labels and improve annotations.

3. Create and Apply Data Protection and Privacy Principles

Patients are more likely to provide reliable data when they trust that those collecting and using their sensitive health information will protect and handle it appropriately. Appropriate handling includes adhering to the Health Insurance Portability and Accountability Act of 1996 (HIPAA) Privacy and Security Rules governing the protection of sensitive individual and health information. Organizations should also consider creating new or utilizing existing data privacy principles, such as those outlined by the EU’s General Data Protection Regulation (GDPR) (e.g., purpose limitation, data minimization, accuracy, security). There is currently no U.S. federal law equivalent to GDPR, but some states have adopted similar legislation at the local level. For example, the California Consumer Privacy Act and subsequent California Privacy Rights Act enables consumers to know and control how their personal information is used by the businesses that collect it.

Recent privacy-enhancing technology innovations and synthetic data generation advancements promise to better enable AI/ML model development using protected sensitive data. Emerging privacy-enhancing technology solutions, such as federated learning and differential privacy, protect personal data by minimizing unnecessary data sharing, encrypting or anonymizing data, and ensuring confidentiality in aggregate data. Federated learning is a method of AI/ML model training in which multiple models are iteratively trained on independent datasets and combined, avoiding the explicit exchange of training data. Synthetic data generation advancements, enabled by Generative Adversarial Networks (GANs), have been demonstrated by research to produce realistic synthetic image and tabular (numerical, text) data, enabling AI/ML model development while allowing sensitive health data to remain protected.

4. Test the Machine Readability of Your Data

Properly preparing data, including ensuring that it can be processed by a computer, is a critical and often time-consuming prerequisite step to advanced analytics. To expedite AI/ML modeling, AI-ready data should be distributed in file formats and structures that ease ingestion into coding environments. Data repositories should provide random representative subsets of full datasets to enable quick data readability and suitability checks. AI/ML developers can use these random representative data samples to easily ingest data into coding notebook environments, such as Jupyter and RStudio, and use libraries such as pandas and Dplyr to gain a basic understanding of a dataset, including descriptive statistics, features and data types, and presence of null or missing values. AI/ML developers should also consider using AutoML tools, which automate many of the steps of the machine-learning process (e.g., feature and model selection), to accelerate suitability and exploratory analysis and inform future modeling efforts.

Through the application of these four AI-ready data practices, organizations can accelerate AI/ML research, discovery, and utilization to drive better precision health outcomes. 

Explore More Precision Health Insights

Nurse Explaining COVID Application

Artificial Intelligence for Public Health Surveillance

Booz Allen helps federal health agencies use artificial intelligence to streamline public health surveillance.

two operating doctors performing surgery

Precision Health

Powered by evolving technologies, precision health is poised to have an impact on healthcare delivery by accelerating diagnoses.

1 - 4 of 8