Unlocking the Nucleotide Transformer: A Guide to Using the Multi-Species Model

July 22, 2024

The Nucleotide Transformers, specifically the nucleotide-transformer-v2-500m-multi-species, represent a significant leap in the field of genomics. This foundational language model has been pre-trained on DNA sequences from thousands of diverse genomes. In this guide, we will explore how to effectively use this model for your genomic tasks.

What is the Nucleotide Transformer?

The nucleotide-transformer-v2-500m-multi-species model has been developed to provide accurate predictions of molecular phenotypes by integrating DNA sequences from over 3,200 human genomes and 850 genomes from various species. It’s like having a vast library of genetic information at your fingertips, enabling deeper insights into genomic data.

How to Use the Nucleotide Transformer

Using the Nucleotide Transformer is straightforward. Below, we cover the setup and a sample code snippet to get you started.

Setup Instructions

First, ensure you have Python installed on your device.
Install the transformers library from source with the command:

pip install --upgrade git+https://github.com/huggingface/transformers.git

Code Sample for Inference

The following Python snippet demonstrates how to retrieve both logits and embeddings from a dummy DNA sequence:

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Import the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-500m-multi-species", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-500m-multi-species", trust_remote_code=True)

# Specify input sequence length
max_length = tokenizer.model_max_length

# Create a dummy DNA sequence and tokenize it
sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
tokens_ids = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length=max_length)["input_ids"]

# Compute the embeddings
attention_mask = tokens_ids != tokenizer.pad_token_id
torch_outs = model(
    tokens_ids,
    attention_mask=attention_mask,
    encoder_attention_mask=attention_mask,
    output_hidden_states=True
)

# Compute sequences embedding
embeddings = torch_outs["hidden_states"][-1].detach().numpy()
print(f"Embeddings shape: {embeddings.shape}")
print(f"Embeddings per token: {embeddings}")

# Compute mean embeddings per sequence
mean_sequence_embeddings = torch.sum(attention_mask * embeddings, axis=-2) / torch.sum(attention_mask, axis=1)
print(f"Mean sequence embeddings: {mean_sequence_embeddings}")

Understanding the Code with an Analogy

Imagine you’re a skilled chef wanting to create a new recipe. Instead of using a single ingredient, you’d like to incorporate flavors from a variety of spices and herbs. Each component adds a unique touch, and when combined, they yield a delicious outcome.

In this code, the DNA sequences act like those ingredients. The tokenizer prepares the “ingredients” (DNA sequences) into a compatible format for the model. In essence, the model then processes these sequences—just like a chef would blend ingredients—to produce meaningful “flavors” in the form of embeddings. Finally, the average flavor of each sequence is calculated, giving a well-rounded taste of the genomic data.

Troubleshooting

If you encounter issues while using the Nucleotide Transformer, here are some troubleshooting steps:

Ensure that all dependencies are installed correctly, especially the transformers library.
Check your Python version compatibility with the libraries used.
If you face memory issues, consider using smaller batch sizes or reducing the input sequence length.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

The Nucleotide Transformer is a powerful tool for genomics, enabling scientists to leverage extensive genomic data for analysis. We hope this guide has helped you get started with this revolutionary model.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.