How to Get Started with MoLFormer: A Large-Scale Chemical Language Model

Aug 29, 2023 | Educational

Welcome to the world of chemistry and AI! MoLFormer is an innovative model designed for learning about small molecules through their SMILES (Simplified Molecular Input Line Entry System) strings. This guide will walk you through the necessary steps to get started with MoLFormer, making your journey user-friendly and efficient.

Getting Started
Data
Pretraining
Finetuning
Feature Extraction
Attention Visualization Analysis

Getting Started

This model has been tested with Nvidia V100s. Before diving into training, you should have a few resources ready.

Pretrained Models and Training Logs

If you’re training from scratch, the pretrained models and their associated training logs will be located in the data directory within a structured hierarchy:

data
├── Pretrained MoLFormer
│   ├── checkpoints
│   ├── events.out.tfevents
│   └── hparams.yaml
├── checkpoints
├── Full_Attention_Rotary_Training_Logs
└── Linear_Rotary_Training_Logs

MoLFormer has been pretrained on a dataset of approximately 100 million molecules. This enhances its performance metrics across various benchmarks.

Replicating Conda Environment

Installing the required environment is crucial. Apex must be compiled from source since it’s used in the example code. Detailed instructions are located in environment.md.

Data

Data is at the heart of any machine learning model. The datasets needed for MoLFormer are accessible through this link. Ensure you have the right formats before proceeding.

Pretraining Datasets

The code expects two datasets, Zinc15 and PubChem, to be organized in a specific directory structure:

data
├── pubchem
│   └── CID-SMILES-CANONICAL.smi
└── ZINC
    ├── Example.smi

Finetuning Datasets

The finetuning datasets also need to be in a specific format, as shown below:

data
├── bace
│   ├── test.csv
│   ├── train.csv
│   └── valid.csv
├── bbbp
│   ├── test.csv
│   ├── train.csv
│   └── valid.csv
└── tox21
    ├── test.csv
    ├── train.csv
    └── valid.csv

Pretraining

MoLFormer employs a masked language model for pretraining. It is crucial to filter compounds to a maximum length of 211 characters for optimal training efficiency.

Finetuning

After pretraining, you can run a finetune task by executing the command bash run_finetune_mu.sh.

Feature Extraction

Utilize the provided notebook frozen_embeddings_classification.ipynb to extract features using the pre-trained model.

Attention Visualization Analysis

For conducting attention visualization, refer to the two notebooks provided for an in-depth analysis.

Troubleshooting

While working with MoLFormer, you might encounter some hiccups. Here are a few troubleshooting tips:

Ensure your datasets are correctly formatted according to the specifications.
Check that all dependencies listed in environment.md are installed properly.
If you face issues with training on specific GPUs, consider reallocating resources or switching GPUs.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox