Aug 06, 2024●8 reads●No License

Intelligent PII Redaction: Leveraging RoBERTa for Secure Document Anonymization

m
Mo Abdelhamid

Introduction:

In the era of big data and digital information exchange, protecting personally identifiable information (PII) has become a critical concern for organizations and individuals alike. The inadvertent disclosure of sensitive personal data can lead to privacy breaches, identity theft, and legal complications. As the volume of digital documents containing PII continues to grow, manual redaction processes have become increasingly time-consuming, error-prone, and impractical. This publication presents an innovative solution to this challenge: an automated PII redaction model leveraging the powerful RoBERTa (Robustly Optimized BERT Approach) language model.

Data Source and Description

Our study utilizes the dataset from the n2c2 2014 (National NLP Clinical Challenges) competition, specifically focusing on the task of de-identification of protected health information (PHI) in medical records. This dataset is widely recognized in the field of clinical natural language processing and provides a robust foundation for developing and evaluating PHI redaction models.
Key characteristics of the dataset:

Origin: The data originates from the n2c2 de-identification challenge, which aims to advance the state of the art in automatically removing personal health information from medical records.
Format: The original data is provided in XML format, structured to contain both the raw text of medical records and corresponding annotation tags for various types of PHI.
Data Structure:

The dataset is divided into two main sets: training-PHI-Gold-Set1 and training-PHI-Gold-Set2.
Each XML file in these sets represents a single medical record.
Within each file, there are two primary sections:
a) A TEXT element containing the raw text of the medical record.
b) A TAGS element listing all the PHI entities present in the text, along with their types and positions.

PHI Types: The original dataset includes a wide range of PHI types, including but not limited to:
- PATIENT and DOCTOR (which we consolidated into PERSON)
- LOCATION (including subcategories like STREET, CITY, STATE, COUNTRY)
- DATE
- PHONE (which we relabeled as PHONE_NUMBER)
- EMAIL
- Other types such as AGE, IDNUM, HOSPITAL, etc.
Content: The records appear to be comprehensive medical documents, containing various sections such as patient history, examination results, medication lists, and clinical narratives. This is evident from the preprocessing steps that handle long sequences of text.

In our preprocessing pipeline, we focused on a subset of the PHI types, specifically PERSON, LOCATION, PHONE_NUMBER, DATE, and EMAIL, as these represent the most critical and common types of personal information in medical records. This selective approach allows us to concentrate on the most impactful elements of PHI redaction while maintaining a manageable scope for our model development.

Preprocessing

For information about the preprocessing steps, check the preprocessing_n2c2.ipynb notebook in the resources section.

Training RoBERTa for PII Redaction

For information about the training setup, check the roberta_for_ner.ipynb notebook in the resources section.

Results

	Recall	Precision	F1-score
PERSON	0.98	0.97	0.97
DATE	0.98	0.98	0.98
LOCATION	0.94	0.91	0.93
PHONE_NUMBER	0.95	0.83	0.89

micro	0.98	0.97	0.97
macro	0.96	0.92	0.94
weighted	0.98	0.97	0.97

Redaction of Personally Identifiable Information (PII)

To enhance the privacy and security of our dataset, we developed a script to redact specific types of PII using regular expressions and the Faker library. This redaction process involves identifying PII elements such as emails, phone numbers, and URLs in the text and replacing them with realistic fake data. The steps are as follows:

Regex Patterns: We defined comprehensive regex patterns to accurately identify emails, phone numbers, and URLs within the text data. These patterns are designed to capture various formats, ensuring robust detection.

Faker Library: We utilized the Faker library to generate fake data. Faker provides realistic and randomly generated values that mimic the structure of real data. This helps maintain the integrity and usability of the text while ensuring that sensitive information is anonymized.

Search and Replace Functionality: Our script employs a search and replace mechanism where identified PII elements are substituted with fake data. For instance, email addresses detected by the regex pattern are replaced with fake email addresses generated by Faker.

Implementation: The implementation is straightforward, involving reading the text, applying regex-based search, and replacing detected PII with fake data. This ensures that the resulting text is free from real PII, mitigating privacy risks while retaining the document’s format and readability.

Usage

For using the model on you own data, download the github repository and follow the usage section in the readme.

Example

You can see the redacted text in red and the text used as replacement in green:

Summary

This publication introduces an advanced automated PII redaction system leveraging the RoBERTa language model, designed to address the challenges of protecting personally identifiable information in large volumes of digital documents. Utilizing the n2c2 2014 dataset, which contains medical records annotated with various types of protected health information, the model demonstrates high accuracy in identifying and redacting sensitive data such as personal names, locations, phone numbers, and email addresses.

Key Features:

Data Utilization: Draws from the structured n2c2 dataset, focusing on critical PHI types for comprehensive redaction.
Advanced Processing: Employs RoBERTa for deep learning-based named entity recognition, ensuring precise PII detection and redaction.
Efficiency: Combines regex patterns and the Faker library to replace real PII with realistic synthetic alternatives, enhancing data privacy while maintaining text integrity.

Performance Results:

Achieved high recall and precision, with an overall micro F1-score of 0.97, indicating the model's effectiveness in accurately detecting and redacting PHI across multiple categories.

This solution not only enhances the security of sensitive information in medical records but also offers a scalable approach to PII redaction, suitable for various domains requiring stringent data privacy measures.