IUPAC2SMILES: Your Guide to Translating Chemical Names into SMILES

Feb 17, 2024 | Educational

Converting IUPAC chemical names into their corresponding SMILES (Simplified Molecular Input Line Entry System) representations is essential for various fields such as chemistry, biology, and medicine. The IUPAC2SMILES-canonical-base model simplifies this process using advanced machine learning techniques. In this blog, we’ll walk you through how to use this model effectively, troubleshoot common issues, and understand its intricacies.

Model Overview

The IUPAC2SMILES-canonical-base was developed by Knowladgator Engineering, and it’s based on the MT5 model, featuring optimizations for different tokenizers to enhance accuracy in both encoding and decoding processes. It operates as an encoder-decoder with an attention mechanism, processing the English languages of SMILES and IUPAC to provide a seamless translating experience.

Getting Started

To start using the IUPAC2SMILES model, you’ll first need to install the required library. Here’s how you can do that:

pip install chemical-converters

Translating IUPAC to SMILES

Once you’ve installed the library, you can easily translate a chemical name into its SMILES representation:

from chemicalconverters import NamesConverter
converter = NamesConverter(model_name='knowledgator/IUPAC2SMILES-canonical-base')
print(converter.iupac_to_smiles('ethanol'))
print(converter.iupac_to_smiles(['ethanol', 'ethanol', 'ethanol']))

In the above code, when we input ‘ethanol’, the model returns: [CCO]. This depicts the SMILES format for ethanol.

Batch Processing

If you need to process multiple names at once, the IUPAC2SMILES model has batched capabilities:

from chemicalconverters import NamesConverter
converter = NamesConverter(model_name='knowledgator/IUPAC2SMILES-canonical-base')
print(converter.iupac_to_smiles([f'buta-1,3-diene' for _ in range(10)], num_beams=1,
                                  process_in_batch=True, batch_size=1000))

This code processes multiple instances of buta-1,3-diene in one go and returns an array of corresponding SMILES.

Understanding the Output Styles

The model can also predict different IUPAC styles, which are categorized as follows:

  • BASE: The commonly recognized name, often a mixture of traditional and systematic styles.
  • SYST: A style that completely embodies systematic naming without trivial names.
  • TRAD: A style derived from trivial names associated with parts of the substances.

Limitations and Risks

While the IUPAC2SMILES-canonical-base model has impressive accuracy (93.7% with a BLEU-4 score of 0.974), it does have limitations. Its performance diminishes with larger molecular structures, and it currently lacks support for isomeric and isotopic representations.

Training Information

The model was trained on 100 million examples of SMILES-IUPAC pairs, utilizing a learning rate of 0.00001 and a batch size of 512 over 2 epochs. Such extensive training is critical to the model’s overall performance.

Evaluation Summary

Here’s a brief evaluation of the model’s performance against similar models:

Model Accuracy BLEU-4 Score Size (MB)
IUPAC2SMILES-canonical-small 88.9% 0.966 23
IUPAC2SMILES-canonical-base 93.7% 0.974 180
STOUT V2.0* 68.47% 0.92 128

*According to the original paper here.

Troubleshooting

If you run into issues such as poor model accuracy or errors while translating chemical names, consider the following troubleshooting steps:

  • Ensure that the library is correctly installed by running pip show chemical-converters.
  • Verify that the IUPAC name is spelled correctly and follows standard naming conventions.
  • Use simpler or more common compounds to test if the model returns expected results.

For advanced support and collaboration insights, feel free to reach out. For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox