How to Use NER for Gene and Gene Product Extraction

Nov 8, 2021 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_22_1105

In the world of bioinformatics, extracting relevant information about genes and their products from unstructured texts is crucial. Named Entity Recognition (NER) has proven to be a powerful tool for this purpose. In this blog, we’ll explore how to set up and use NER to find gene information and gene products using a pre-trained model.

Getting Started with NER

This guide outlines the use of a pre-trained NER model, specifically designed for extracting gene and gene product information. Let’s break down the key components involved in utilizing this model effectively.

Understanding the Model

Our NER model is trained on the JNLPBA dataset and is pre-trained on the pubmed-pretrained roberta model. It can identify various token classes, including:

DNA
RNA
Protein
Cell line
Cell type
Other entities

During this process, keep in mind that token prefixes such as B- or I- have been removed from the data label to simplify analysis.

Installing Necessary Libraries

Before starting, ensure you have the required libraries installed:

Transformers
Pandas

Setting Up the NER Pipeline

You can create NER using the following Python code:

from transformers import pipeline

PRETRAINED = 'raynardjner-gene-dna-rna-jnlpba-pubmed'
ner = pipeline(task='ner', model=PRETRAINED, tokenizer=PRETRAINED)

ner('Your text', aggregation_strategy='first')

This code initializes the NER pipeline with the pre-trained model, allowing you to process any text for gene entity extraction.

Cleansing the Output

As the code processes entities, you may want to refine the output for better readability and organization. Here’s an analogy: think of it as filtering through a box of assorted items to find only the essentials. The function below organizes the NER’s output into a structured format:

import pandas as pd
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(PRETRAINED)

def clean_output(outputs):
    results = []
    current = []
    last_idx = 0
    for output in outputs:
        if output['index'] - 1 == last_idx:
            current.append(output)
        else:
            results.append(current)
            current = [output,]
        last_idx = output['index']
    if len(current) > 0:
        results.append(current)

    strings = []
    for c in results:
        tokens = []
        starts = []
        ends = []
        for o in c:
            tokens.append(o['word'])
            starts.append(o['start'])
            ends.append(o['end'])
        new_str = tokenizer.convert_tokens_to_string(tokens)
        if new_str != '':
            strings.append(dict(
                word=new_str,
                start=min(starts),
                end=max(ends),
                entity=c[0]['entity']
            ))
    return strings

Creating an Entity Table

Finally, you can present the results in a dataframe format, making it easier to analyze extracted entities:

def entity_table(pipeline, **pipeline_kw):
    if 'aggregation_strategy' not in pipeline_kw:
        pipeline_kw['aggregation_strategy'] = 'first'

    def create_table(text):
        return pd.DataFrame(
            clean_output(
                pipeline(text, **pipeline_kw)
            )
        )
    return create_table

entity_table(ner)('YOUR_VERY_CONTENTFUL_TEXT')

Troubleshooting Tips

If you encounter issues while using the NER model, consider the following troubleshooting steps:

Ensure that you have updated the Transformers library and compatible dependencies.
Check if your internet connection is stable, as the model requires downloading to run.
Review the input text for any discrepancies or errors that may affect tokenization.
If all else fails, reach out to experts or communities for guidance.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Using an NER model for gene and gene product extraction can significantly enhance your bioinformatics research. By following the guidelines outlined above, you can effortlessly incorporate NER into your workflow.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox