Welcome to your comprehensive guide on using GenerRNA, a state-of-the-art generative RNA language model! This guide will walk you through the installation, setup, and usage of GenerRNA, making the complex process user-friendly and accessible. Whether you’re looking to explore RNA sequences or fine-tune the model to suit specific needs, we’ve got you covered!
What is GenerRNA?
GenerRNA is a cutting-edge generative RNA language model based on a Transformer decoder-only architecture. It was trained on a whopping 30 million sequences, comprising 17 billion nucleotides! With this powerful tool, you can generate and fine-tune RNA sequences effortlessly.
Requirements
- A CUDA environment
- Minimum VRAM of 8GB
Dependencies
Before diving in, make sure you have the following dependencies installed:
torch=2.0numpytransformers==4.33.0.dev0datasets==2.14.4tqdm
Setting Up GenerRNA
To get started with GenerRNA, you’ll need to combine the split model files into a single model file. Follow these simple steps:
cat model.pt.part-* model.pt.recombined
Understanding the Directory Structure
Your setup will look something like this:
├── LICENSE
├── README.md
├── configs
│ ├── example_finetuning.py
│ └── example_pretraining.py
├── experiments_data
├── model.pt.part-aa # split bin data of pre-trained model
├── model.pt.part-ab
├── model.pt.part-ac
├── model.pt.part-ad
├── model.py # define the architecture
├── sampling.py # script to generate sequences
├── tokenization.py # prepare data
├── tokenizer_bpe_1024
│ ├── tokenizer.json
│ ├── ....
├── train.py # script for training and fine-tuning
Think of this directory structure like the layout of a library. Each file is like a separate book or resource, neatly organized for easy access whenever you need to refer to a specific “topic” in your RNA research.
Usage of GenerRNA
Generating Sequences in a Zero-shot Fashion
To generate RNA sequences without pre-training, simply use the following command:
python sampling.py --out_path output_file_path --max_new_tokens 256 --ckpt_path model.pt --tokenizer_path path_to_tokenizer_directory
Pre-training or Fine-tuning on Your Own Sequences
If you wish to fine-tune the model using your own sequences, follow these steps:
- First, tokenize your sequence data. Ensure each sequence is on a separate line and there is no header.
- Next, refer to the
.configs/example_**.pyto create a configuration file for the model. - Finally, execute the training command:
python tokenization.py --data_dir path_to_the_directory_containing_sequence_data --file_name file_name_of_sequence_data --tokenizer_path path_to_tokenizer_directory --out_dir directory_to_save_tokenized_data --block_size 256
python train.py --config path_to_your_config_file
Training Your Own Tokenizer
To train your own tokenizer, use the command:
python train_BPE.py --txt_file_path path_to_training_file(txt,each sequence is on a separate line) --vocab_size 50256 --new_tokenizer_path directory_to_save_trained_tokenizer
Troubleshooting Common Issues
If you encounter any issues during setup or usage, consider the following troubleshooting tips:
- Ensure that your CUDA environment is properly configured and that you have sufficient VRAM.
- Double-check that you have installed the required dependencies and that they are the correct versions.
- If you’re having trouble with sequence generation or tokenization, verify that your input data meets the specified formats.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
With this guide, you’re now equipped to harness the power of GenerRNA in your own RNA research. Enjoy exploring the RNA space!

