How to Use ClinicalBERT for De-identification of Medical Notes

Aug 26, 2022 | Educational

In the era of big data, ensuring patient confidentiality is paramount. One of the tools aiding in this endeavor is a ClinicalBERT model, specifically fine-tuned for the de-identification of medical notes. This guide will walk you through using this model effectively, supported by a simple analogy to facilitate understanding, and troubleshooting tips to aid in your journey.

Understanding the ClinicalBERT Model

Imagine you are a librarian tasked with organizing a vast library of books. Each book contains various types of information—some sensitive (like personal letters) and some general (like encyclopedias). Your job is to remove all personal letters from these books while retaining all the other useful information. The ClinicalBERT model operates similarly. It categorizes tokens (words or phrases) into either being protected health information (PHI) or non-PHI, allowing us to sanitize the medical notes while preserving their essential content.

How to Use the Model

Follow these steps to effectively implement the ClinicalBERT model:

  • Demo: A detailed demonstration can be found here.
  • Steps to Run a Forward Pass: Check out the guidelines available here.
  • In Brief:
    • Sentencize the dataset to aggregate sentences back to the note level.
    • Tokenize the data.
    • Use the model’s predict function to gather token predictions.
    • Finally, remove PHI from the original note text using these predictions.

Dataset Information

The model was trained using the I2B2 2014 dataset, which consists of de-identified medical notes. This dataset includes detailed statistics on the PHI labels, ensuring that the model is robust and effective in its functionality.

Training Procedure

The training process encompasses multiple steps to ensure accuracy and efficiency:

  • Sentencizing with the en_core_sci_sm sentencizer from SpaCy.
  • Tokenization using a custom tokenizer based on en_core_sci_sm from SpaCy.
  • Context was added by appending 32 tokens from previous and next sentences—these aren’t used for learning but provide context.
  • Sequences are truncated to a maximum of 128 tokens, where longer sequences are split accordingly.
  • The tokenized dataset with token-level labels is utilized for training.

Results

The results yield a highly functional model that can efficiently remove sensitive information from medical notes while retaining the rest of the data intact. This capability is crucial for complying with regulations like HIPAA.

Troubleshooting

If you encounter any issues while working with the ClinicalBERT model, consider the following troubleshooting steps:

  • Ensure that your dataset is correctly formatted; any inconsistencies in data can lead to unexpected results.
  • Verify the version of the libraries you are using; outdated versions may not support certain functionalities.
  • If the model does not perform as expected, try re-tuning the parameters or using different tokenizer settings.
  • For further assistance, you can post an issue on the GitHub repo: Robust DeID.
  • For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox