You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Motivation

"Free Text" medical notes contain information which can be used to locate human biospecimens and even predict patient outcomes.
Because medical notes often contain Protected Health Information, it is necessary to "scrub" notes of sensitive information prior to sharing with a clinical investigator. Towards this goal, we have developed Open Source software that removes PHI from raw text, XML, or databases.
The software has been approved for use by numerous hospital IRBs, and has been manually reviewed by physician experts.

Challenge

"Free Text" medical notes can be "messy", often lacking even complete sentences. Assuring that Free Text data is "cleaned" prior to sharing with an investigator is challenging. Furthermore, differences between hospital coding styles make it difficult to reuse NLP technology at other institutions. As a result, relatively few medical notes are used in the research setting despite the wealth of data notes provide.

Approach

De-Identify Patients

The scrubber software removes confidential identifiers from structured XML or plain text by comparing the input text phrases to a list of known identifiers (names, states, etc) and a series of Regular Expressions. While typically used to prepare confidential reports to be compliant with HIPAA standards, this utility is practical for any organization looking to protect privacy of their records – regardless if they are being used for medical purposes or not.

Acknowledge Incomplete Sentence Structure

We use a simple approach that relaxes grammar rules in recognition that proper grammar is not always followed. Instead, we will train our program to "score" words according to their frequency in medical literature such that frequently occurring words are determined to likely not be an identifier while infrequently occurring words potentially could be. For more on word scoring, see the ROADMAP. 

Enable site-specific configuration without recompiling the software

We allow user-provided data dictionaries and regular expressions to be included (such as doctor names, patient meta-data).
For details, see the Scrubber User Guide.

  • No labels