How to Use GenerRNA: A Guide to Generative RNA Language Modeling

Sep 11, 2024 | Educational

Welcome to your comprehensive guide on using GenerRNA, a state-of-the-art generative RNA language model! This guide will walk you through the installation, setup, and usage of GenerRNA, making the complex process user-friendly and accessible. Whether you’re looking to explore RNA sequences or fine-tune the model to suit specific needs, we’ve got you covered!

What is GenerRNA?

GenerRNA is a cutting-edge generative RNA language model based on a Transformer decoder-only architecture. It was trained on a whopping 30 million sequences, comprising 17 billion nucleotides! With this powerful tool, you can generate and fine-tune RNA sequences effortlessly.

Requirements

A CUDA environment
Minimum VRAM of 8GB

Dependencies

Before diving in, make sure you have the following dependencies installed:

torch=2.0
numpy
transformers==4.33.0.dev0
datasets==2.14.4
tqdm

Setting Up GenerRNA

To get started with GenerRNA, you’ll need to combine the split model files into a single model file. Follow these simple steps:

cat model.pt.part-* model.pt.recombined

Understanding the Directory Structure

Your setup will look something like this:

├── LICENSE
├── README.md
├── configs
│   ├── example_finetuning.py
│   └── example_pretraining.py
├── experiments_data
├── model.pt.part-aa # split bin data of pre-trained model
├── model.pt.part-ab
├── model.pt.part-ac
├── model.pt.part-ad
├── model.py         # define the architecture
├── sampling.py      # script to generate sequences
├── tokenization.py  # prepare data
├── tokenizer_bpe_1024
│   ├── tokenizer.json
│   ├── ....
├── train.py         # script for training and fine-tuning

Think of this directory structure like the layout of a library. Each file is like a separate book or resource, neatly organized for easy access whenever you need to refer to a specific “topic” in your RNA research.

Usage of GenerRNA

Generating Sequences in a Zero-shot Fashion

To generate RNA sequences without pre-training, simply use the following command:

python sampling.py --out_path output_file_path --max_new_tokens 256 --ckpt_path model.pt --tokenizer_path path_to_tokenizer_directory

Pre-training or Fine-tuning on Your Own Sequences

If you wish to fine-tune the model using your own sequences, follow these steps:

First, tokenize your sequence data. Ensure each sequence is on a separate line and there is no header.

python tokenization.py --data_dir path_to_the_directory_containing_sequence_data --file_name file_name_of_sequence_data --tokenizer_path path_to_tokenizer_directory --out_dir directory_to_save_tokenized_data --block_size 256

Next, refer to the .configs/example_**.py to create a configuration file for the model.
Finally, execute the training command:

python train.py --config path_to_your_config_file

Training Your Own Tokenizer

To train your own tokenizer, use the command:

python train_BPE.py --txt_file_path path_to_training_file(txt,each sequence is on a separate line) --vocab_size 50256 --new_tokenizer_path directory_to_save_trained_tokenizer

Troubleshooting Common Issues

If you encounter any issues during setup or usage, consider the following troubleshooting tips:

Ensure that your CUDA environment is properly configured and that you have sufficient VRAM.
Double-check that you have installed the required dependencies and that they are the correct versions.
If you’re having trouble with sequence generation or tokenization, verify that your input data meets the specified formats.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

With this guide, you’re now equipped to harness the power of GenerRNA in your own RNA research. Enjoy exploring the RNA space!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox