In the world of bioinformatics, extracting relevant information about genes and their products from unstructured texts is crucial. Named Entity Recognition (NER) has proven to be a powerful tool for this purpose. In this blog, we’ll explore how to set up and use NER to find gene information and gene products using a pre-trained model.
Getting Started with NER
This guide outlines the use of a pre-trained NER model, specifically designed for extracting gene and gene product information. Let’s break down the key components involved in utilizing this model effectively.
Understanding the Model
Our NER model is trained on the JNLPBA dataset and is pre-trained on the pubmed-pretrained roberta model. It can identify various token classes, including:
- DNA
- RNA
- Protein
- Cell line
- Cell type
- Other entities
During this process, keep in mind that token prefixes such as B- or I- have been removed from the data label to simplify analysis.
Installing Necessary Libraries
Before starting, ensure you have the required libraries installed:
- Transformers
- Pandas
Setting Up the NER Pipeline
You can create NER using the following Python code:
from transformers import pipeline
PRETRAINED = 'raynardjner-gene-dna-rna-jnlpba-pubmed'
ner = pipeline(task='ner', model=PRETRAINED, tokenizer=PRETRAINED)
ner('Your text', aggregation_strategy='first')
This code initializes the NER pipeline with the pre-trained model, allowing you to process any text for gene entity extraction.
Cleansing the Output
As the code processes entities, you may want to refine the output for better readability and organization. Here’s an analogy: think of it as filtering through a box of assorted items to find only the essentials. The function below organizes the NER’s output into a structured format:
import pandas as pd
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(PRETRAINED)
def clean_output(outputs):
results = []
current = []
last_idx = 0
for output in outputs:
if output['index'] - 1 == last_idx:
current.append(output)
else:
results.append(current)
current = [output,]
last_idx = output['index']
if len(current) > 0:
results.append(current)
strings = []
for c in results:
tokens = []
starts = []
ends = []
for o in c:
tokens.append(o['word'])
starts.append(o['start'])
ends.append(o['end'])
new_str = tokenizer.convert_tokens_to_string(tokens)
if new_str != '':
strings.append(dict(
word=new_str,
start=min(starts),
end=max(ends),
entity=c[0]['entity']
))
return strings
Creating an Entity Table
Finally, you can present the results in a dataframe format, making it easier to analyze extracted entities:
def entity_table(pipeline, **pipeline_kw):
if 'aggregation_strategy' not in pipeline_kw:
pipeline_kw['aggregation_strategy'] = 'first'
def create_table(text):
return pd.DataFrame(
clean_output(
pipeline(text, **pipeline_kw)
)
)
return create_table
entity_table(ner)('YOUR_VERY_CONTENTFUL_TEXT')
Troubleshooting Tips
If you encounter issues while using the NER model, consider the following troubleshooting steps:
- Ensure that you have updated the Transformers library and compatible dependencies.
- Check if your internet connection is stable, as the model requires downloading to run.
- Review the input text for any discrepancies or errors that may affect tokenization.
- If all else fails, reach out to experts or communities for guidance.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
Using an NER model for gene and gene product extraction can significantly enhance your bioinformatics research. By following the guidelines outlined above, you can effortlessly incorporate NER into your workflow.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.