Integrated Error Models#

In addition to creating your own error models to evaluate pre-trained models, error models can also be built alongside a model.

import olorenchemengine as oce
import pandas as pd
import numpy as np
import json
import tqdm

import matplotlib.pyplot as plt
from scipy.stats import linregress

#lipo_dataset = oce.DatasetFromCSV("Lipophilicity.csv", structure_col = "smiles", property_col = "exp")
#splitter = oce.RandomSplit(split_proportions=[0.8,0.1,0.1])
#lipo_dataset = splitter.transform(lipo_dataset)
#oce.save(lipo_dataset, 'lipophilicity_dataset.oce')

dataset = oce.load('lipophilicity_dataset.oce')
model = oce.RandomForestModel(oce.OlorenCheckpoint("default"), n_estimators=1000)

To build an error model during model training, simply input the error model you wish to use. Here, we will use the oce.SDC error model.

error_model = oce.SDC()
model.fit(dataset.train_dataset[0], dataset.train_dataset[1], error_model=error_model)

The error model is now built and stored in model.error_model. From here, any error model methods, such as .train() and .train_cv() for aggregate error models, or .fit() and .fit_cv() for all error models, can be run. Note that by default, .train is not run for aggregate error models, and must be run individually before model fitting.

Fitting can also be done when running model.test() by setting fit_error_model=True.

model.test(dataset.valid_dataset[0], dataset.valid_dataset[1], fit_error_model=True)

Finally, if a model contains a fitted error model, setting return_ci=True when running model.predict() will return the confidence intervals. Setting return_vis=True will in turn return VisualizeError objects.

df = model.predict(dataset.test_dataset[0], return_ci=True, return_vis=True)
df.head()
df.vis[0].render_ipynb()

Production Level Models#

Production level models use the entire dataset to train the model. As such, metrics and error model training and fitting are done via cross validation. The entire process can be done by calling the .fit_cv() function.

model = oce.RandomForestModel(oce.OlorenCheckpoint("default"), n_estimators=1000)
error_model = oce.SDC()

model.fit_cv(dataset.entire_dataset[0], dataset.entire_dataset[1], error_model=error_model, scoring = "r2")

The trained error model will be stored in model.error_model