Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0


Motivation

"Free Text" medical notes contain information which can be used to locate human biospecimens and even predict patient outcomes.
Because medical notes often contain Protected Health Information, it is necessary to "scrub" notes of sensitive information prior to sharing with a clinical investigator. Towards this goal, we have developed Open Source software that removes PHI from raw text, XML, or databases.
The software has been approved for use by numerous hospital IRBs, and has been manually reviewed by physician experts.

Challenge

"Free Text" medical notes can be "messy", often lacking even complete sentences. Assuring that Free Text data is "cleaned" prior to sharing with an investigator is challenging. Furthermore, differences between hospital coding styles make it difficult to reuse NLP technology at other institutions. As a result, relatively few medical notes are used in the research setting despite the wealth of data notes provide.

Approach

De-Identify Patients

The scrubber software removes confidential identifiers from structured XML or plain text by comparing the input text phrases to a list of known identifiers (names, states, etc) and a series of Regular Expressions. While typically used to prepare confidential reports to be compliant with HIPAA standards, this utility is practical for any organization looking to protect privacy of their records – regardless if they are being used for medical purposes or not.

Acknowledge Incomplete Sentence Structure

We use a simple approach that relaxes grammar rules in recognition that proper grammar is not always followed. Instead, we will train our program to "score" words according to their frequency in medical literature such that frequently occurring words are determined to likely not be an identifier while infrequently occurring words potentially could be. For more on word scoring, see the ROADMAP. 

Enable site-specific configuration without recompiling the software

Much is made in this era of translational research of the need to cross boundaries between biomedical research organizations that individually have insufficient patient-subjects. Concurrently, the noisiness and high dimensionality of genome scale measurements has made it all the more pressing to increase sample size to overcome problems of false discovery, and irreproducibility that stem in part from insufficiently powered studies. Building on the success of SPIN, the Pathology Specimen Locator (PSL) is a core developed to facilitate translational research requiring human specimens. The PSL is a distributed network of databases containing de-identified information on archived specimens from IRB-approved repositories within Dana-Farber/Harvard Cancer Center affiliated institutions.

Approach

Using a peer-to-peer architecture, institutions become PSL members (nodes) by securing institutional review board (IRB) approvals and deploying the PSL software. At any time, an institution can withdraw from the network without leaving their data behind or disabling the network. PSL nodes can serve as peers or supernodes to query local databases or networks of child nodes, respectively.

PSL allows institutions to expose de-identified pathology reports while keeping corresponding reports containing Protected Health Information (PHI) disconnected from the Internet. A randomly generated unique identifier is assigned to both the PHI and de-identified reports in a locally controlled codebook. The machine storing the codebook is disconnected from the Internet and protected according to each participating site's policies. The resulting solution is flexible and compliant with HIPAA regulations.

PSL provides 3 levels of increasing access commensurate with investigator credentials and IRB approvals.

  1. First, feasibility studies are conducted using a statistical level query that returns only aggregated results. 
  2. Second, individual de-identified cases are selected by investigators certified by one of the participating institutions.
  3. Third level allows requests for specimens and clinical data that must be approved by the institution storing the requested data.



Request Human Tissue for Cancer Studies

...