Welcome to the exciting world of GatorTron-Medium! Developed collaboratively by the University of Florida and NVIDIA, this clinical language model boasts a whopping 3.9 billion parameters, achieved through a robust BERT architecture. In this guide, we will walk you through the basics of GatorTron-Medium, its applications, and how to get started.
What is GatorTron-Medium?
GatorTron-Medium is a groundbreaking clinical language model, pre-trained using an extensive dataset that includes:
- 82 billion words of de-identified clinical notes from the University of Florida Health System
- 6.1 billion words from PubMed CC0
- 2.5 billion words from WikiText
- 0.5 billion words of de-identified clinical notes from MIMIC-III
With GatorTron-Medium, users can easily perform various natural language processing (NLP) tasks in the realm of healthcare.
Model Variations
GatorTron comes in several variations catering to different needs:
- gatortron-base: 345 million parameters
- gatortronS: 345 million parameters
- gatortron-medium (this model): 3.9 billion parameters
- gatortron-large: 8.9 billion parameters
How to Use GatorTron-Medium
Getting started with GatorTron-Medium is as easy as pie! Here’s a simple step-by-step guide to help you integrate this powerful tool into your projects.
Consider GatorTron-Medium as a master chef who needs the right ingredients to whip up a delicious dish. Each ingredient (pieces of code) is essential for the final recipe (the execution of the clinical language model).
python
from transformers import AutoModel, AutoTokenizer, AutoConfig
tokenizer = AutoTokenizer.from_pretrained('UFNLP/gatortron-medium')
config = AutoConfig.from_pretrained('UFNLP/gatortron-medium')
my_model = AutoModel.from_pretrained('UFNLP/gatortron-medium')
encoded_input = tokenizer("Bone scan: Negative for distant metastasis.", return_tensors='pt')
encoded_output = my_model(**encoded_input)
Application of GatorTron-Medium
GatorTron can be integrated into various NLP packages, such as:
- Clinical Concept Extraction (Named Entity Recognition)
- Relation Extraction
- Extraction of Social Determinants of Health (SDoH)
De-identification Feature
One of the remarkable features of GatorTron-Medium is its de-identification system. This is crucial for maintaining patient privacy. Using the safe-harbor method, GatorTron removes Protected Health Information (PHI) by replacing sensitive information (like names) with dummy strings (e.g., [**NAME**]). This system complies with HIPAA regulations.
Troubleshooting Common Issues
If you encounter any hiccups while using GatorTron-Medium, here are some tips to help you overcome them:
- Ensure all dependencies are properly installed, particularly the transformers library.
- Double-check that you are using the correct model identifier when loading the model.
- If you experience memory issues, consider using smaller model variations.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Citation Information
For academic purposes, please cite the following study: Yang, Xi et al. (2022). A large language model for electronic health records. Npj Digit Med. Nature Publishing Group. The article can be found here.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
Contact Information
For further inquiries, you can reach out to:
- Yonghui Wu: yonghui.wu@ufl.edu
- Cheng Peng: c.peng@ufl.edu
Now, go forth and make the most of GatorTron-Medium in your NLP endeavors!

