Welcome to your one-stop guide on navigating through the intricacies of MS-BERT! This blog will provide you with all the necessary steps to utilize MS-BERT effectively, a cutting-edge model pre-trained specifically for addressing the nuances of Multiple Sclerosis (MS) patient data.
Introduction to MS-BERT
MS-BERT stands as an impressive achievement in the domain of natural language processing (NLP). This model was pre-trained on clinical notes from neurological examinations of MS patients at St. Michael’s Hospital in Toronto, yielding a dataset of 75,000 clinical notes from approximately 5,000 patients. That’s about 35.7 million words that detail patient conditions, their progress, and diagnoses!
Data Pre-processing: The Art of De-identification
Before diving into the luscious depths of MS-BERT’s applications, it’s crucial to discuss data pre-processing. Imagine you have a beautiful, pristine lake (the dataset) that you want to keep clean and unobstructed. In this analogy, the identifying information is the debris that clutters the lake. We cleaned the dataset by removing:
- Patient names
- Doctor names
- Hospital names
- Patient identification numbers
- Addresses and phone numbers
- Dates and times
To achieve this, we paired a curated database containing identifying information with regular expressions to meticulously groom our dataset. The identifiers were replaced with tokens that maintain a semblance of their semantic meaning while ensuring anonymity. Examples include:
- Female first names – Lucie
- Male first names – Ezekiel
- Last family names – Salamanca
- Dates – 2010s
- Patient IDs – 999
- Phone numbers – 1718
- Addresses – Silesia
- Times – 1610
- Hospital/Clinic names – Troy
Pre-training MS-BERT
Now that our dataset has been cleaned, we can begin the pre-training of MS-BERT. Think of this process as feeding a growing plant—the nutrients you provide will determine its strength and resilience. MS-BERT’s starting point is the pre-trained and fine-tuned BLUE-BERT model.
Using the masked language modeling task, sourced from the Hugging Face Transformers library, further training enables MS-BERT to learn from the abundant clinical notes available. The hyperparameters used for this modeling effort can be accessed in the config file available in this repository or here.
Troubleshooting Tips
While working on MS-BERT, you might run into a few hiccups. Here are some troubleshooting tips to help you out:
- Model Loading Errors: Make sure you have the necessary dependencies installed, including the Hugging Face Transformers library.
- Performance Issues: Ensure that your dataset is properly pre-processed and confirming that there are no parsing errors, as even tiny discrepancies can throw the workflow off.
- Inconsistent Results: Verify that the hyperparameters being used are similar to those recommended in MS-BERT documentation.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Acknowledgements
This groundbreaking project wouldn’t have been possible without the invaluable support from the Data Science and Advanced Analytics (DSAA) department at St. Michael’s Hospital. Special thanks to Dr. Marzyeh Ghassemi, Taylor Killan, Nathan Ng, and Haoran Zhang for their guidance throughout this initiative.
Disclaimer
It’s essential to remember that the results produced by MS-BERT are not intended for direct diagnostic use or medical decision-making without a clinical professional’s oversight. Always consult a healthcare professional if you have any questions regarding MS-BERT or its outputs.
Closing Thoughts
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

