LLamol is an advanced tool designed for de novo molecular design utilizing a dynamic multi-conditional generative transformer model. Whether you’re a developer, researcher, or a curious mind, this guide aims to simplify your journey through installation, dataset preparation, and model training.
Installation of LLamol
To get started, you’ll want to install LLamol swiftly. The preferred method is using Mamba:
bash
$SHELL
(curl -L micro.mamba.pm/install.sh)
$ micromamba env create -f torch2-env.yaml
$ micromamba activate torch2-llamol
$ python sample.py
Downloading and Preprocessing the OrganiX13 Dataset
If you wish to train using the full dataset of 13 million entries, follow these steps:
- Download and preprocess the OPV dataset by running
data/opv/prepare_opv.py. - Download and preprocess the ZINC dataset by executing:
data/zinc/zinc_complete_run_download.pydata/zinc/convert_to_parquet.py(require at least 16GB RAM).
- Download and convert the ZINC dataset using
data/qm9/zinc250k_cep/convert_to_parquet.py. - Combine all datasets by running
data/combine_all.py(this step can take a while). - Run
preprocess_dataset.pyto create.cache/processed_dataset_None.pkl.
Interactive Demo
After installation, you can explore the model using the demonstrator.ipynb file. Run all cells and navigate to the last cell where a user interface awaits your commands.
Training the Model
Before training, activate your environment:
bash
$ conda activate torch2-llamol
# When installed with conda instead of micromamba
OR
$ micromamba activate torch2-llamol
To initiate local training, run:
bash
$ python train.py train=llama2-M-Full-RSS
If you need to customize parameters, you may override them like this:
bash
$ python train.py train=llama2-M-Full-RSS train.model.dim=1024
For deep diving into multi-GPU training on a SLURM cluster, follow the script examples provided in the README. Adjust the GPU settings in the trainLLamaMol.sh file as necessary.
Sampling Your Results
To sample results, you can modify parameters as shown below:
bash
$ python sample.py --num_samples 2000 --ckpt_path outllama2-M-Full-RSS.pt --max_new_tokens 256 --cmp_dataset_path=data/OrganiX13.parquet --seed 4312 --context_cols logp sascore mol_weight --temperature 0.8
Using Your Own Dataset
To utilize a custom dataset, process it with preprocess_dataset.py. Ensure your data follows the correct format with SMILES present in a designated column. After processing, rename the resultant file if necessary and integrate it into your training configuration.
Understanding the Training Method
The LLamol model employs a method referred to as Random Smiles Sampling (RSS). Think of this as training a chef (the model) to prepare a unique dish by sampling different ingredient combinations (the tokens from SMILES). The chef learns by trying out various ingredient arrangements to create delightful new recipes (molecules), instead of always sticking to the same classical recipe (dataset). This method facilitates creativity and innovation in molecular designs.
Troubleshooting
If you encounter issues while using LLamol, here are a few troubleshooting tips:
- Performance Issues: Ensure you have sufficient RAM (at least 16GB) as recommended for dataset processing.
- Training Errors: Check your configurations in the
train.yamlfiles to ensure correct parameters align with your setup. - Dataset Issues: Confirm your custom dataset is correctly formatted and includes the required SMILES column.
- For further assistance, brainstorm with fellow developers and access curated information by visiting **[fxis.ai](https://fxis.ai)**.
At [fxis.ai](https://fxis.ai), we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
Armed with this guide, you should be well-equipped to leverage the power of LLamol for transformative molecular design. Happy coding!

