Skip to main content

Chemical Models

🧪 Chemical-Converters​

Chemical-Converters is a Python library developed by Knowledgator for translating between different chemical formats, specifically between SMILES (Simplified Molecular Input Line Entry System) strings and IUPAC (International Union of Pure and Applied Chemistry) names. This tool is designed to assist researchers, students, and professionals in the chemical domain by providing accurate and efficient conversions.


Features​

  • Bidirectional Conversion: Translate between SMILES and IUPAC names.
  • Multiple Models: Choose from various models optimized for different accuracy and performance needs.
  • Style Preferences: Specify preferred IUPAC naming styles using style tokens.
  • Batch Processing: Efficiently process large datasets with batch conversion.
  • Validation: Validate conversions using Tanimoto similarity scores.

Installation​

Install the library using pip:

pip install chemical-converters

Models​

The models are based on Google's MT5 architecture, modified to support chemical format translations. Available models include:

Model NameAccuracySize (MB)Task
SMILES2IUPAC-canonical-small75%24SMILES to IUPAC
SMILES2IUPAC-canonical-base86.9%180SMILES to IUPAC
IUPAC2SMILES-canonical-small88.9%24IUPAC to SMILES
IUPAC2SMILES-canonical-base93.7%180IUPAC to SMILES

To check available models:

from chemicalconverters import NamesConverter

print(NamesConverter.available_models())

Usage​

SMILES to IUPAC​

from chemicalconverters import NamesConverter

converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-base")
result = converter.smiles_to_iupac('CCO')
print(result) # Output: ['ethanol']
Specifying IUPAC Style​

Use style tokens to specify the desired IUPAC naming convention:

  • <BASE>: Common name, possibly a mix of traditional and systematic styles.
  • <SYST>: Fully systematic name without trivial names.
  • <TRAD>: Name based on trivial names of substance parts.

Example:

converter.smiles_to_iupac(['<SYST>CCO', '<TRAD>CCO', '<BASE>CCO'])
# Output: ['ethanol', 'ethanol', 'ethanol']
Batch Processing​
converter.smiles_to_iupac(
["<BASE>C=CC=C" for _ in range(10)],
num_beams=1,
process_in_batch=True,
batch_size=1000
)
# Output: ['buta-1,3-diene', 'buta-1,3-diene', ...]
Validation​

Validate the conversion by checking the Tanimoto similarity between the original and converted molecules:

converter.smiles_to_iupac('CCO', validate=True)
# Output: ['ethanol'], 1.0

Manual validation:

validation_model = NamesConverter(model_name="knowledgator/IUPAC2SMILES-canonical-base")
NamesConverter.validate_iupac(input_sequence='CCO', predicted_sequence='ethanol', validation_model=validation_model)
# Output: 1.0

Note: Validation is not implemented for batch processing.

IUPAC to SMILES​

converter = NamesConverter(model_name="knowledgator/IUPAC2SMILES-canonical-base")
result = converter.iupac_to_smiles('ethanol')
print(result) # Output: ['CCO']
Batch Processing​
converter.iupac_to_smiles(
["buta-1,3-diene" for _ in range(10)],
num_beams=1,
process_in_batch=True,
batch_size=1000
)
# Output: ['<SYST>C=CC=C', '<SYST>C=CC=C', ...]

Remember, chemistry is not just about reactions; it's about connections. Let's build those connections together!