Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Scrubber 3.0 Beta Released!

Info

Scrubber v3.0 now uses Apache cTakes to provide parallel concept extraction during de-idenification. Apache cTAKES graciously invited us to port the Scrubber de-identification pipeline to the Apache hosted codebase. The maintenance version of the 2.X will remain available. The publication describing this work has been accepted with minor revision, this site will be updated shortly to reflect the described methods and results.

McMurry* AJ, Fitch* B, Savova G, Kohane IS, Reis BY. “Improved de-identification of physician notes through integrative modeling of both identifying and non-identifying medical text”, BMC Medical Informatics and Decision Making Accepted minor revise Jan 2013.

Motivation

"Free Text" medical notes contain information which can be used to locate human biospecimens and even predict patient outcomes.
Because medical notes often contain Protected Health Information, it is necessary to "scrub" notes of sensitive information prior to sharing with a clinical investigator. Towards this goal, we have developed Open Source software that removes PHI from raw text, XML, or databases.
The software has been approved for use by numerous hospital IRBs, and has been manually reviewed by physician experts.

Challenge

"Free Text" medical notes can be "messy", often lacking even complete sentences. Assuring that Free Text data is "cleaned" prior to sharing with an investigator is challenging. Furthermore, differences between hospital coding styles make it difficult to reuse NLP technology at other institutions. Distinguishing pertinent clinical facts from sensitive patient identifiers in free text clinical narratives is a difficult classification task. 
One reason is that variations in physician writing styles have limited how broadly NLP algorithms can be utilized in multi-site studies. 
Another reason is that hospital IRBs have differing perspectives regarding "privacy risk to research benefit". 
As a result, relatively few medical physician notes are used in the research setting studies despite the wealth of data notes provideavailable high quality clinical phenotypes .

Approach

De-Identify Patients

The scrubber software removes confidential identifiers from structured XML or plain text by comparing the input text phrases to a list of known identifiers (names, states, etc) and a series of Regular Expressions. While typically used to prepare confidential reports to be compliant with HIPAA standards, this utility is practical for any organization looking to protect privacy of their records – regardless if they are being used for medical purposes or not.

Acknowledge Incomplete Sentence Structure

We use a simple approach that relaxes grammar rules in recognition that proper grammar is not always followed. Instead, we will train our program to "score" words according to their frequency in medical literature such that frequently occurring words are determined to likely not be an identifier while infrequently occurring words potentially could be. For more on word scoring, see the ROADMAP. 

Enable site-specific configuration without recompiling the software

The HMS Scrubber builds on years of community progress in de-identification and NLP. 
In 2006, Beckwith developed and validated a rule based system to de-identify pathology reports. 
This widely accessed de-id program performed well in the pathology setting and was approved by four IRBs at Harvard teaching hospitals. 

Porting this software to other hospital settings and note types proved difficult and required fine-tuning the regular expressions for each installation. 
This lead to the creation of the "3.X" Scrubber, combining autocoding and de-identification tasks to maximize research utility and minimize site specific customization.

This new approach using machine learning analyzes similaraties and differences betwen physician notes, medical dictionaries, and medical journal publicationsWe allow user-provided data dictionaries and regular expressions to be included (such as doctor names, patient meta-data).
For details, see the Scrubber User Guide.