Privacy Preserving Record Linkage

Written by Alison Amar and Erin McAuley

woman in lab coat working at a computer workstation while on the phone

What Is Privacy-Preserving Record Linkage? A Definition and Brief History

Record linkage, established in 1946 by Halbert L. Dunn, is the task of matching records that refer to the same entity across disparate data sources. It is fundamental for the generation of official statistics and has recently become particularly important in healthcare and biomedical research. Its uses in that realm include linking the genomic and clinical data of individual patients within research studies as well as sharing data within clinical research networks to support the development of centralized repositories that store longitudinal patient level data among healthcare systems, clinical practices, government agencies, and academic institutions.

Traditional record linkage often depends on the presence of personally identifiable information (PII), which is protected under the Health Insurance Portability and Accountability Act (HIPAA), which limits the way it can be used or disclosed. PII attributes can be in the form of direct identifiers, like first and last name, or quasi-identifiers, like date of birth and zip code. While these attributes are used in traditional record linkage processes, for obvious reasons they are typically excluded from research studies and final datasets. Accurate record linkage based on erroneous, missing, incomplete, non-standardized, or outdated data represents a significant challenge.

Privacy-preserving record linkage (PPRL) is an emerging solution to this and related challenges that uses algorithmic techniques to effectively link records by matching PII attributes without revealing them, thus keeping them protected.

PPRL is an increasingly relevant, safe, and responsible way to produce broader, larger datasets that can be input to artificial intelligence (AI) and machine learning (ML) algorithms. Use of PPRL is an essential prerequisite for deploying meaningful analytic models that ingest a combination of disparate data sources to provide a more comprehensive view of the factors that most influence human health.

From their origin in the 1990s through the present, PPRL methods continue to evolve to accommodate the volume, velocity, variety, and veracity of healthcare-related big data.

a chart depicting the differences between clinical, imaging and genomics for Name, Date of Birth and City.

Figure 1. Example of non-standardized data to be encoded for record linkage

PPRL’s Rapid Evolution: From 1998 to Present

First-generation PPRL techniques (1998-2004) focused on matching entities that share the exact same identifying values across data sets. This was commonly done by algorithmically creating unique, encrypted, irreversible hashed pseudonyms for entities based on inputted PII, allowing data owners at different institutions to compare the one-way hashed tokens across multiple datasets and sources to match records from the same individual. A hash pseudonym can be thought of as a fingerprint of the data—it is a transformation of a given string of data into another value that represents the original string. Encryption algorithms strengthen the cybersecurity of such values because only those with a decryption key can decipher the hashed values.

Relying as they did on exact matches, these techniques could not accommodate data errors and variations, and as such, second-generation tools (2004-2009) evolved to use “fuzzy matching” to account for differences in text spelling, punctuation, and capitalization. Second-generation improvements relied on the introduction of Bloom filters, a space-efficient probabilistic data structure that is used to test whether an element is a member of a set.

As more digitized data were being collected, the third generation of PPRL techniques (2009–2014) sought to address scalability and computational resource usage.

In parallel with the big data revolution, the fourth generation of PPRL approaches (2014–2020) were tailored to optimize processes and tools for large datasets. Areas of enhancement included computational resource usage, schema optimization, privacy, and the development of tools and applications for practical utility.

More recent progress has focused on enabling use of more sophisticated analytics including machine and deep learning.

How Does PPRL Technology Work? A High-Level Review

    Linkage Milestones
Blocking Comparing Classifying Evaluating
Purpose Pre-process and transform sensitive data Prune pairs that are unlikely to be a match based on some filtering criteria Compare each candidate pair suing a pre-defined metric, resulting in a similarity score Use a combination of algorithms to determine which records match Assess the performance of the linkage process
  • Pre-processing
    • Deduplication
    • Data Cleaning
    • Standardization
  • Encoding
    • Bloom-filter based 
  • Locality-sensitive hashing (LSH)
  • Sorted Neighborhood
  • Q-gram-based
  • Suffix-based Indexing
  • Canopy Clustering 
  • Exact
  • Aproximate
    • Distance & Token Based
    • Substring Comparisons
  • Soft TF-IDF
  • Threshold-based
  • Probabilistic Matching
  • Rule-based
  • Machine Learning-based
    • Supervised
    • Semi-Supervised
  • Unsupervised
  • Linkage Success Metrics
  • Quality
    • Efficiency
    • Effectiveness
    • Privacy Protection and Disclosure Risk 

Figure 2. A PPRL Overview: The Five Key Steps

The technical steps that comprise PPRL are shown in Figure 2 and outlined below. These milestones correspond with a use case in which two database owners, site A and site B, seek to share data. Each participant is responsible for data preprocessing and cleaning, which can be done according to their own procedures and on their own data systems. But the two sites will have to agree on a shared linkage schema—a detailed description of exactly how to carry out the data encoding operation. Cryptographic hashing remains the predominant PPRL encoding approach. 

Who Uses PPRL and What For? The Real-World PPRL Community

PPRL tools have been adopted across large-scale initiatives to allow seamless collaboration across healthcare networks, research institutes, and data repositories. Given the complex ethical and legal considerations that data linkage introduces, several institutions and government agencies have established working groups and PPRL governance frameworks. Such efforts are meant to plan customized approaches on how to address protecting individual privacy, accommodating organizational data stewardship requirements, and compliance with laws and regulations during the linkage process.

Several well-established PPRL governance frameworks that govern the current use of PPRL tools and methods within a designated group or agency can be found in use at the Biomedical Research Informatics Computing System, National Institute of Mental Health Data Archive (NDA) Repository, National COVID Cohort Collaborative (N3C), and The National Patient-Centered Clinical Research Network  (PCORnet). The idea of international data linkage has been explored in a publication by a task force coordinated by members of the International Rare Diseases Research Consortium in collaboration with the Global Alliance for Genomics and Health (GA4GH).

Privacy-preserving record linkage techniques exist in the form of open-source tools or through commercial vendors. Some examples of open-source tools include PPRL (R-based), clkhash/Anonlink (Python-based), and PRIMAT (Java-based). Each of these tools is configurable for most of the milestones in a PPRL workflow. Popular existing PPRL tools include Anonlink, PPRL, PRIMAT, and Privacy preserving EHR linkage tool.

Additionally, some vendors offer seamless, high-governance, and privacy-compliant PPRL infrastructures and services. Technologies recognized by some of the largest healthcare organizations and agencies today include the Healthverity IPGE (Identify, Privacy, Governance, and Exchange) Platform, Datavant, the Senzing entity resolution software, and the Veeva Crossix Data Platform.  

On the Horizon: Open Areas of PPRL Research and Development

an icon of a clipboard with a checklist on it


PPRL algorithms are heavily dependent on how two organizations configure the specific parameters of how to perform data encoding, blocking, record comparison, and match classification. Configuration of how to perform the linkage—the “schema”—includes selection of the PII elements to encode as well as the emphasis of those attributes (for example, assigning more weight to social security number than a patient’s last name). Due to real-world variations in data, the linkage schema often needs to be optimized and iterated upon to produce high-quality, one-to-one linkages. An open area of research and development is the creation of tools for flexible schemas that accommodate linkage of multiple large datasets.      

an icon of a magnifying glass over some charts

Algorithmic Performance

Strategies to classify records as "matches" or "non-matches" in PPRL continue to incorporate sensitive machine learning methods to return high-confidence, one-to-one linkages. However, classification algorithms can be biased to certain subpopulations that are grouped by one or more protected or sensitive attributes. In the context of PPRL, this might resemble a difference in algorithmic performance between two subpopulations in the data. In addition to the correctness of the linkages, machine learning models must incorporate strategies to minimize bias and maximize equity or fairness in the linkage results. Examination of algorithmic fairness in PPRL, particularly quantitative metrics and mitigation strategies, is an open area of research and development.

an icon of 3 different graphs

Computational Efficiency

As datasets continue to grow, PPRL algorithms face challenges with computational scalability and resource usage. While techniques like blocking and pre-processing can cut down on computational resource usage, other methods to scale PPRL up include parallelization via batch processing, distributed computing, and hardware acceleration. Additional work, recently reviewed, focuses on ensuring that all steps of a PPRL pipeline, including blocking, can support multiparty blocking and linkage of multiple large datasets.  

an icon of three interconnected people surrounding a graph

Privacy Preservation

Recent work has demonstrated that data encoded via Bloom filters can potentially be re-identified, and that machine learning can reveal potentially sensitive attributes about data points used for training models. Methods to strengthen data security include asymmetric key cryptography, which uses local encryption keys for each data site, thus eliminating the need for a common secret encryption key. In response to the prospect of quantum computers that can break many of the public-key cryptosystems currently in use, quantum cryptographic systems are being developed to be secure against both quantum and classical computers.

PPRL technologies are essential as a pre-requisite for the development of AI-ready health datasets and, ultimately, to fully actualizing the vast potential of precision medicine to transform healthcare. Privacy-enhancing analytics and technologies drive world-class biomedical research by helping organizations architect appropriate information flows to increase privacy, transparency, and trust in research and clinical operations. Enhanced data privacy capabilities will enable organizations to be ready for modernized health privacy legislation, proactively protect existing data assets from data breaches, facilitate adoption and development of AI aligned with organizational missions, and respond to growing demand signals from consumers.

1 - 4 of 8