In the era of big data and digital information exchange, protecting personally identifiable information (PII) has become a critical concern for organizations and individuals alike. The inadvertent disclosure of sensitive personal data can lead to privacy breaches, identity theft, and legal complications. As the volume of digital documents containing PII continues to grow, manual redaction processes have become increasingly time-consuming, error-prone, and impractical. This publication presents an innovative solution to this challenge: an automated PII redaction model leveraging the powerful RoBERTa (Robustly Optimized BERT Approach) language model.
Our study utilizes the dataset from the n2c2 2014 (National NLP Clinical Challenges) competition, specifically focusing on the task of de-identification of protected health information (PHI) in medical records. This dataset is widely recognized in the field of clinical natural language processing and provides a robust foundation for developing and evaluating PHI redaction models.
Key characteristics of the dataset:
In our preprocessing pipeline, we focused on a subset of the PHI types, specifically PERSON, LOCATION, PHONE_NUMBER, DATE, and EMAIL, as these represent the most critical and common types of personal information in medical records. This selective approach allows us to concentrate on the most impactful elements of PHI redaction while maintaining a manageable scope for our model development.
For information about the preprocessing steps, check the preprocessing_n2c2.ipynb
notebook in the resources section.
For information about the training setup, check the roberta_for_ner.ipynb
notebook in the resources section.
Recall | Precision | F1-score | |
---|---|---|---|
PERSON | 0.98 | 0.97 | 0.97 |
DATE | 0.98 | 0.98 | 0.98 |
LOCATION | 0.94 | 0.91 | 0.93 |
PHONE_NUMBER | 0.95 | 0.83 | 0.89 |
micro | 0.98 | 0.97 | 0.97 |
macro | 0.96 | 0.92 | 0.94 |
weighted | 0.98 | 0.97 | 0.97 |
To enhance the privacy and security of our dataset, we developed a script to redact specific types of PII using regular expressions and the Faker library. This redaction process involves identifying PII elements such as emails, phone numbers, and URLs in the text and replacing them with realistic fake data. The steps are as follows:
Regex Patterns: We defined comprehensive regex patterns to accurately identify emails, phone numbers, and URLs within the text data. These patterns are designed to capture various formats, ensuring robust detection.
Faker Library: We utilized the Faker library to generate fake data. Faker provides realistic and randomly generated values that mimic the structure of real data. This helps maintain the integrity and usability of the text while ensuring that sensitive information is anonymized.
Search and Replace Functionality: Our script employs a search and replace mechanism where identified PII elements are substituted with fake data. For instance, email addresses detected by the regex pattern are replaced with fake email addresses generated by Faker.
Implementation: The implementation is straightforward, involving reading the text, applying regex-based search, and replacing detected PII with fake data. This ensures that the resulting text is free from real PII, mitigating privacy risks while retaining the document’s format and readability.
For using the model on you own data, download the github repository and follow the usage
section in the readme.
You can see the redacted text in red and the text used as replacement in green:
This publication introduces an advanced automated PII redaction system leveraging the RoBERTa language model, designed to address the challenges of protecting personally identifiable information in large volumes of digital documents. Utilizing the n2c2 2014 dataset, which contains medical records annotated with various types of protected health information, the model demonstrates high accuracy in identifying and redacting sensitive data such as personal names, locations, phone numbers, and email addresses.
Key Features:
Performance Results:
This solution not only enhances the security of sensitive information in medical records but also offers a scalable approach to PII redaction, suitable for various domains requiring stringent data privacy measures.