Advance Science with Privacy Preserving Record Linkage

By Susan Tenney, Lucy Han, and Anya Dabic

person on phone health app

What Is PPRL and What Does It Do?

PPRL is an approach for linking two or more records from a single individual without revealing their identity so that policymakers, decision-makers, scientists, researchers, clinicians, administrators, and others can glean information from their and other’s data and transform it into knowledge that can help this particular individual (for example, by designing precise medical treatments for them) or the public as a whole. 

Record linkage dates back to 1946 when Halbert L. Dunn of the United States National Bureau of Statistics coined the term to describe the process of assembling the records (or data) of an individual from birth to death. Since then, various organizations in the public and private sectors have been performing such record linkage. The traditional method for record linking uses combinations of direct personal identifiers, such as name, date of birth, social security number, and home address, to match records belonging to the same individual (see Figure 1). This requires sharing personally identifiable information (PII) across sources that link the records, which, of course, brings up a host of privacy and security concerns.

traditional v PPRL diagram Figure 1: Traditional Record Linkage vs. Privacy-Preserving Record Linkage

PPRL addresses data privacy and security concerns by using a hashing algorithm to encode PII and generate tokens and then sharing only the encoded tokens between organizations and sources (as shown in Figure 1). Tokens (strings of characters and numbers) generated using the same combination of PII by the same algorithm across data sources will be identical. They can be used to match records from the same individual without revealing their identity. These matched records of the same individual can then be linked (or combined) for use by researchers.

PPRL-based data linkage offers two key benefits:

  • Data enrichment across two dimensions of data—diverse types and longitudinal—for the same individual: PPRL facilitates combining data of different types and over different periods of time (longitudinal) collected for the same individual by different sources. Different data types might include electronic health records (EHR), clinical trial data, surveys, claims, and environmental information.
  • Broader data sharing: PPRL promotes broader data sharing and collaboration across multidisciplinary stakeholders while protecting the privacy of individuals whose data are shared. It also helps eliminate duplication of costly data generation—such as images or genomic data—that have already been collected by one source and made available for sharing.

Such enriched and more broadly shared data have greater potential to address unique and challenging scientific and public health research questions that a single or even a subset of data cannot answer. Further, with linked datasets, one can get more of a holistic view of the individual, enabling precision medicine, innovative research, establishment of disease or population-focused national patient registries, and public health monitoring.

How Is PPRL Being Used in the Real World?

Researchers from the National Institutes of Health (NIH) used PPRL to help address a real-world public health question regarding COVID-19 infections during the pandemic.

One of the central questions facing health researchers, clinicians, government decision-makers, and the public during the COVID-19 pandemic was: How likely are you to be reinfected with the SARS-CoV-2 virus after a previous infection?

At the start of the pandemic, when little was known about how the virus spread between and within individuals, it was unclear whether a prior infection with SARS-CoV-2 could protect someone from subsequent infections. Learning more on this subject would benefit individual patients as well as society as a whole by informing decisions and response activities at multiple levels on multiple questions, such as when it became safe to return to school or work, when it became safe to engage in group activities, and how best to prioritize vaccine distribution amongst various populations and geographies.

To answer this question, researchers used PPRL to link real-world data from various sources, including EHRs, medical and pharmacy claims, and COVID-19 antibody testing data, for more than 3 million people between January and August 2020. From the linked data, researchers found that having SARS-CoV-2 antibodies in the blood was associated with a lower risk of reinfection. However, additional research was needed to determine how long the protection lasted.

Similar research and public health questions are being answered using PPRL by various organizations.

The National Institutes of Health’s (NIH) National COVID Cohort Collaborative (N3C) links EHRs with different types of data to accelerate COVID-19 research

NIH’s All of Us (AoU) program establishes a Center for Linkage and Acquisition of Data (CLAD) for linking AoU participant data, including data obtained from biosamples, surveys, wearable devices, physical measurements, and EHRs

The Centers for Disease Control and Prevention (CDC) links COVID-19 case and vaccination data from reporting state jurisdictions to track the spread and prevention of COVID-19

The Department of Veterans Affairs (VA) links administrative data to EHR data from the Chicago HealthLNK Data Repository to identify veterans eligible for VA services 

Additional potential use cases for PPRL that we’ve identified include:

  • Developing prediction models to assess mental health risks in adolescents or other populations of interest
  • Tracking mother-child pairs over time to study a particular inherited disease or the effect of the mother’s exposure to an environmental agent on the child’s development
  • Achieving accurate counts and distributions of individuals for epidemiological studies—to understand disease burden, rate of disease spread, or other trends
  • Preventing the generation of duplicate data on an individual across sources to maximize your investment

What Do You Need to Know and Do to Implement PPRL?

With all the realized benefits and potential opportunities associated with PPRL, you might be wondering, why are more health researchers not using it? What do I need to know and do to use it for my research?

Five essential components must be addressed when considering PPRL (as shown in Figure 2): participants, data, technology, governance, and resources. These components intersect at various points to unlock the hidden insights from PPRL-linked data to address real-world use cases effectively and efficiently.

PPRL Implementation diagram Figure 2: 5 Essential PPRL Components
participants icon

Participants: Who are the participants you need to engage for your studies? Do they understand the benefits of linking their data? Have they consented to the linkage of their data? If not, can you get consent, a waiver of consent, or approval from a human subjects privacy board or an institutional review board to link their data?

data icon

Data: What types of data do you need, and from what period of time? Who has that data? What is the quality? Is it standardized in a way that you can link it? Will they share the data for linking? How can you gain access to the data?

PPRL tools icon

PPRL Tools and Technology:  What PPRL tool should you use for matching participants and linking their data? Is it freely available (open source) or proprietary? Is it compatible with the PIIs available for the data? Can it scale up to accommodate increasing volumes of data?

governance icon

Governance: What governance (policies, terms, and conditions for use) do you need to comply with for linking the data (and sharing the linked data with others, if needed)? If no specific governance is available, do you know who the data stewards are with whom you would need to work to identify the governance?

resources icon

Resources: Do you have the resources necessary, such as infrastructure and staff expertise, to perform the tokenization and linking, and the sharing of the linked data? Can the resources scale to your growing needs for linking?

Below are some key considerations to take into account when implementing PPRL within or across organizations:

  • Engender trust among participants whose data are to be linked and shared: Individual participants are at the heart of PPRL. In general, participants must give informed consent for data sources to link and share their data. Participants need to have confidence and trust that their data will not be shared or used outside of their consent. This can be addressed by including explicit language in the consent forms regarding linking and sharing their data and the associated risks, implementing robust oversight and governance that includes data privacy and security control mechanisms, and then educating participants on these measures so that it is a collaborative decision-making process with the participant. Research has shown that most participants are willing to share their data if the data is de-identified and appropriate privacy and security measures are in place.
  • Use high-quality standardized data for tokenization and linking: Garbage in, garbage out applies to PPRL just as it does to other data-related realms. The quality of the PII used for encoding (for example, low rates of missing values or errors) is critical for generating tokens and the subsequent matches with high accuracy. Each data source needs to agree on the PII quality so that the data are standardized across sources before tokenizing and matching. In addition to the PII, the actual datasets should also be of high quality for the linked data to be of meaningful use. This is achieved by either collecting or harmonizing the data that have already been collected in a standardized manner using common data elements or data models. It is less efficient and more costly to standardize data after it has been collected.
  • Mitigate any potential for re-identification of participants when linking datasets: Linking two or more datasets inherently raises the potential risk of re-identification even when each of the original datasets has been fully de-identified of all PII because the process of linkage leads to a richer dataset with more data points on the individual. Therefore, re-identification risk mitigation must be performed before linking the datasets, and disclosure avoidance measures are established before sharing the linked datasets. Some risk-mitigation mechanisms include suppressing or altering variables that, when combined, could potentially re-identify the participant; collapsing small counts of certain variables, such as race, ethnicity, or rare diseases; and perturbing data by adding noise. Some disclosure avoidance measures include reviewing outputs of linkage and using strict physical (e.g., a data enclave) or technical controls (e.g., a review committee) for data access.
  • Establish rules for linking and using the linked data, ideally through standardized data-use agreements: Data have significantly greater value when shared with the broader health research community, but data sharing also brings some risks of misuse—especially when linked datasets are involved. Therefore, data sources who agree to linking must adhere to a set of best practices for linkage and must establish appropriate agreements with stringent terms for use, such as prohibiting attempts to re-identify the participant and penalties for re-identification, including legal actions, and prohibiting the sharing of linked data with non-approved individuals or linking data for non-approved purposes.

How Can Federal Agencies Advance PPRL to Support Their Mission?

The U.S. government has a crucial role in protecting the privacy of the data generated from federally funded activities, and it has already instituted multiple privacy and confidentiality laws, including:

  • The Health Insurance Portability and Accountability Act (HIPAA)
  • The Confidential Information Protection and Statistical Efficiency Act (CIPSEA),
  • The Privacy Act
  • The Federal Information Security Modernization Act of 2002 (FISMA).

Data stewards who are required by law to maintain the privacy and confidentiality of their participants are hesitant to share data with others because of real and perceived risks, which has limited maximizing the application of advanced analytics, including AI and machine learning technologies that can potentially generate valuable insights and drive innovation for public benefit. PPRL and other privacy-enhancing technologies are at the core of solving this challenge of participant privacy in the context of linking and sharing data.

In March 2023, the White House Office of Science and Technology Policy released the National Strategy to Advance Privacy-Preserving Data Sharing and Analytics (PPDSA), which included multiple strategic priorities to advance PPDSA methods and technologies, including PPRL. Building upon this strategy report, below are a few general approaches that federal agencies can take to advance PPRL within their organizations:

  • Educate data stewards on the benefits of PPRL and the strategies for implementing PPRL to address use cases that directly drive agency missions—for example, high-impact clinical trials or longitudinal studies at NIH, post-marketing surveillance or adverse event tracking at FDA, or public health monitoring or infectious disease tracking at CDC.
  • Implement a shared PPRL infrastructure, including PPRL tools, data-linkage platforms, governance, staff expertise, and other resources necessary across programs within an agency to maximize cost savings and economies of scale.
  • Establish best practices for FAIR (findable, accessible, interoperable, and reusable) data using data and metadata standards, common data elements, and data models throughout the entire data lifecycle to promote efficient linking and sharing.
  • Institute policies and standard governance elements, including agreements and clear lines of responsibilities for data linking so that all stakeholders, especially participants, data stewards, and data users, are fully aware and informed of expectations when linking and using linked data.
  • Encourage PPRL-based data sharing within and across agencies to address multidisciplinary and cross-cutting areas that align with the agencies’ missions and advance the public’s well-being, including minority and underserved populations.

Ready to harness the power of data while safeguarding privacy?

Join the PPRL movement and discover how you can unlock insights, foster groundbreaking research, and drive innovation—all while respecting participant privacy.

Contact us to learn more and become a part of this transformative journey.

Privacy preserving record linkage (PPRL) for pediatric COVID-19 studies: Final Report from the NIH/The Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) funded project supported by Booz Allen

1 - 4 of 8