1A First Model#

In this example we’ll be training a small permeability model using data from “ADME Properties Evaluation in Drug Discovery: Prediction of Caco-2 Cell Permeability Using a Combination of NSGA-II and Boosting” https://pubs.acs.org/doi/10.1021/acs.jcim.5b00642. We’ll also compare our results to the results presented in the paper.

Getting the data#

# Downloading the data

import requests
r = requests.get("https://ndownloader.figstatic.com/files/4917022")
open("caco2_data.xlsx" , 'wb').write(r.content)
# Reading the data into a dataframe
# Subsetting the data into molecule, split, and property
# Converting property values to floats
# Creating splits

import pandas as pd
import numpy as np

df = pd.read_excel("caco2_data.xlsx")

df["split"] = df["Dataset"].replace({"Tr": "train", "Te": "test"})
df = df[["smi", "split", "logPapp"]].dropna()

def isfloat(num):
        return True
    except ValueError:
        return False
df = df[df["logPapp"].apply(isfloat)]

df["logPapp"] = df["logPapp"].astype('float')
# Now we use the dataframe to create a BaseDataset object.
# We will generate it from the pd.DataFrame object.
# We have defined our own split column, which will be used by the dataset object.

import olorenchemengine as oce

dataset = oce.BaseDataset(data = df.to_csv(), structure_col="smi", property_col="logPapp")

Making the model#

import olorenchemengine as oce

model = oce.BaseBoosting([
    oce.RandomForestModel(oce.DescriptastorusDescriptor("morgan3counts"), n_estimators=1000),
    oce.RandomForestModel(oce.DescriptastorusDescriptor("rdkit2dnormalized"), n_estimators=1000),
    oce.RandomForestModel(oce.OlorenCheckpoint("default"), n_estimators=1000),
# Now we'll evaluate on the test set, achieving an RMSE of 0.35.
# This is a good score for such a model, and we'll be exploring how to tune it,
# and to utilize longer training time models to improve the performance.
# There are also a few issues with dataset splitting that we'll be discussing in
# further examples.

results = model.test(*dataset.test_dataset, values = True)
model_preds = results.pop("values")
{'r2': 0.7786276730476042,
 'Spearman': 0.8772959326459585,
 'Explained Variance': 0.7786669441658207,
 'Max Error': 1.299909006114051,
 'Mean Absolute Error': 0.2682983744619139,
 'Mean Squared Error': 0.13329354636831567,
 'Root Mean Squared Error': 0.36509388705963797}
# Now we'll plot the predicted vs true values for the test set.
# Notice how the compounds with higher similarity (red) to the training set have more
# accurate predictions compared to those with lower similarity (blue) to the training set.

vis = oce.VisualizeModelSim(dataset, model)