Building your own molecular representation class#

This notebook walks through the creation of a custom molecular representation (BaseCompoundVecRepresentation) that can be used with model training and predictions in OCE.

Parent Class Specs#

The custom representation class inherits from BaseCompoundVecRepresentation. All representations have two parameters defined from the parent class:       “scale”: sklearn scaler (or None) which is used to scale the output representation data. Defaults to StandardScaler.       “collinear_thresh”: Threshold for linear collinearity. Representation features with correlation coefficient above threshold with any other feature is removed.


You may define any custom parameters necessary for representation calculation in the init function of your class. Ensure that initialization allows for keyword arguments which will be passed to the parent class (e.g. scale, collinear_thresh)

from olorenchemengine import BaseCompoundVecRepresentation
from olorenchemengine import log_arguments

class CustomRepresentation(BaseCompoundVecRepresentation):
    def __init__(self, param1, log=True, **kwargs):
        self.param1 = param1
        super().__init__(log=False, **kwargs)

_convert function#

The only function needed in your custom representation class is the _convert helper function, which takes as a parameter the SMILES string representation of a single molecule and outputs a numpy array of its representation vector. The numpy array should be of shape (n,), where n is the number of bits/features for one molecule’s representation.

class CustomRepresentation(BaseCompoundVecRepresentation):
    def __init__(self, radius = 2, nbits = 1024, log=True, **kwargs):
        self.radius = 2
        self.nbits = 1024
        super().__init__(log=False, **kwargs)

    def _convert(self, smiles):
        '''Run calculations on molecule SMILES string.
        Example calculation using Morgan Fingerprint from RDKit.
        from rdkit import Chem
        from rdkit.Chem import AllChem
        import numpy as np

        m = Chem.MolFromSmiles(smiles)
        fp = AllChem.GetMorganFingerprintAsBitVect(m, radius=self.radius, nBits=self.nbits)

        return np.array(fp)


The parent class’s convert function automatically calls the child’s _convert function to convert lists and single molecule strings.

representation = CustomRepresentation(radius = 2, nbits = 1024)
single_rep = representation.convert('CN=C=O')
list_rep = representation.convert(['CN=C=O', '[Cu+2].[O-]S(=O)(=O)[O-]', 'O=Cc1ccc(O)c(OC)c1 COc1cc(C=O)ccc1O'])

Using the representation in a model#

You may now use your newly created representation class in any OCE model

import olorenchemengine as oce
import pandas as pd

#load dataset
df = pd.read_csv("")
dataset = (oce.BaseDataset(data = df.to_csv(),
    structure_col = "Smiles", property_col = "pChEMBL Value") +
           oce.CleanStructures() +
model = oce.BaseBoosting([
    oce.RandomForestModel(representation = representation, n_estimators=1000),
    oce.RandomForestModel(oce.OlorenCheckpoint("default"), n_estimators=1000),
    oce.ChemPropModel(epochs=20, batch_size=64)