olorenchemengine package#

Subpackages#

Submodules#

olorenchemengine.base_class module#

base_class consists of the building blocks of models: the base classes models should extend from and relevant utility functions.

class olorenchemengine.base_class.BaseErrorModel(ci: float = 0.95, method: str = 'qbin', curvetype: str = 'auto', window: int = None, bins: int = None, log=True, **kwargs)#

Bases: BaseClass

Base class for error models.

Estimates confidence intervals for trained oce models.

Parameters:
  • ci (float) – desired confidence interval

  • method ({'bin','qbin','roll'}) – whether to fit the error model via binning, quantile binning, or rolling quantile

  • bins (int) – number of bins for binned quantiles. If None, selects the number of points per bins as n^(2/3) / 2.

  • window (int) – number of points per window for rolling quantiles. If None, selects the number of points per window as n^(2/3) / 2.

  • curvetype (str) – function used for regression. If auto, the function is chosen automatically to minimize the mse.

build()#

builds the error model from a trained BaseModel and dataset

_build()#

optionally implemented, error model-specific computations

fit()#

fits confidence scores to a trained model and external dataset

fit_cv()#

fits confidence scores to k-fold cross validation on the training dataset

_fit()#

fits confidence scores to residuals

calculate()#

calculates confidence scores from inputs

score()#

returns confidence intervals on a dataset

build(model: BaseModel, X: Union[_MockObject.DataFrame, _MockObject.ndarray, list, _MockObject.Series], y: Union[_MockObject.ndarray, list, _MockObject.Series], **kwargs)#

Builds the error model with a trained model and training dataset

Parameters:
  • model (BaseModel) – trained model

  • X (array-like) – training features, list of SMILES

  • y (array-like) – training values

abstract calculate(X: Union[_MockObject.DataFrame, _MockObject.ndarray, list, _MockObject.Series], y_pred: _MockObject.ndarray) _MockObject.ndarray#

To be implemented by the child class; calculates confidence scores from inputs.

Parameters:
  • X – features, list of SMILES

  • y_pred (1-dimensional np.ndarray) – predicted values

Returns:

scores (1-dimensional np.ndarray)

copy() BaseErrorModel#

returns a copy of itself

Returns:

copied instance of itself

Return type:

BaseErrorModel

fit(X: Union[pd.DataFrame, np.ndarray, list, pd.Series], y: Union[np.ndarray, list, pd.Series]) plotly.graph_objects.Figure#

Fits confidence scores to an external dataset

Parameters:
  • X (array-like) – features, smiles

  • y (array-like) – true values

Returns:

plotly figure of fitted model against validation dataset

fit_cv(n_splits: int = 10) plotly.graph_objects.Figure#

Fits confidence scores to the training dataset via cross validation.

Parameters:

n_splits (int) – Number of cross validation splits, default 5

Returns:

plotly figure of fitted model against validation dataset

score(X: Union[_MockObject.DataFrame, _MockObject.ndarray, list, _MockObject.Series]) _MockObject.ndarray#

Calculates confidence scores on a dataset.

Parameters:

X (array-like) – target dataset, list of SMILES

Returns:

a list of confidence intervals as tuples for each input

class olorenchemengine.base_class.BaseModel(normalization='zscore', setting='auto', name=None, **kwargs)#

Bases: BaseClass

BaseModel for training and evaluating different models

Parameters:
  • normalization (BasePreprocessor or str) – the normalization to be used for the data

  • setting (str) – whether the model is a “classification” model or a “regression” model. Default is “auto” which automatically detects the setting from the dataset.

  • model_name (str) – the name of the model. Default is None, which instructs BaseModel to use model_name_from_model to select the name of the model.

setting#

whether the model is a “classification” model or a “regression” model

Type:

str

model_name#

the name of the model, either assigned or a hashed name

Type:

str

preprocess()#

preprocess the inputted data into the appropriate format

_fit()#

fit the model to the preprocessed data, to be used internally implemented by child classes

fit()#

fit the model to the inputted data, user can specify if they want regression or classification using the setting parameter.

_predict()#

predict the properties of the inputted data, to be used internally implemented by child classes

predict()#

predict the properties of the inputted data

test()#

test the model on the inputted data, output metrics and optionally predicted values

copy()#

returns a copy of the model (internal state not copied)

calibrate(X_valid, y_valid)#
copy() BaseModel#

returns a copy of itself

Returns:

copied instance of itself

Return type:

BaseModel

create_error_model(error_model: BaseErrorModel, X_train: Union[_MockObject.DataFrame, _MockObject.ndarray, list, _MockObject.Series], y_train: Union[_MockObject.ndarray, list, _MockObject.Series], X_valid: Optional[Union[_MockObject.DataFrame, _MockObject.ndarray, list, _MockObject.Series]] = None, y_valid: Optional[Union[_MockObject.ndarray, list, _MockObject.Series]] = None, **kwargs)#

Initializes, builds, and fits an error model on the input value.

The error model is built with the training dataset and fit via either a validation dataset or cross validation. The error model is stored in model.error_model.

Parameters:
  • error_model (BaseErrorModel) – Error model type to be created

  • X_train (array-like) – Input data for model training

  • y_train (array-like) – Values for model training

  • X_valid (array-like) – Input data for error model fitting. If no value passed in, the error model is fit via cross validation on the training dataset.

  • y_valid (array-like) – Values for error model fitting. If no value passed in, the error model is fit via cross validation on the training dataset.

Example

import olorenchemengine as oce

model = oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000) model.fit(train[“Drug”], train[“Y”]) oce.create_error_model(model, oce.SDC(), train[“Drug”], train[“Y”], valid[“Drug”], valid[“Y”], ci = 0.95, method = “roll”) model.error_model.score(test[“Drug”]) ——————————

fit(X_train: Union[_MockObject.DataFrame, _MockObject.ndarray], y_train: Union[_MockObject.Series, list, _MockObject.ndarray], valid: Optional[Tuple[Union[_MockObject.DataFrame, _MockObject.ndarray], Union[_MockObject.Series, list, _MockObject.ndarray]]] = None, error_model: Optional[BaseErrorModel] = None)#

Calls the _fit method of the model to fit the model on the provided dataset.

Parameters:
  • X_train (Union[pd.DataFrame, np.ndarray]) – Input data to be fit on (structures + optionally features) e.g. a pd.DataFrame containing a “smiles” column or a list of experimental data

  • y_train (Union[pd.Series, list, np.ndarray]) – Values to predict from the input data

  • valid (Tuple[Union[pd.DataFrame, np.ndarray], Union[pd.Series, list, np.ndarray]]) – Optional validation data, which can be used as with methods like early stopping and model averaging.

  • error_model (BaseErrorModel) – Optional error model, which can be used to predict confidence intervals.

fit_class(X_train: Union[_MockObject.DataFrame, _MockObject.ndarray], y_train: Union[_MockObject.Series, list, _MockObject.ndarray], valid: Optional[Tuple[Union[_MockObject.DataFrame, _MockObject.ndarray], Union[_MockObject.Series, list, _MockObject.ndarray]]] = None)#
fit_cv(X: ~typing.Union[_MockObject.DataFrame, _MockObject.ndarray], y: ~typing.Union[_MockObject.Series, list, _MockObject.ndarray], kf: ~olorenchemengine.dataset.BaseKFold = <olorenchemengine.dataset.RandomKFold object>, error_model: ~typing.Optional[~olorenchemengine.base_class.BaseErrorModel] = None, scoring: ~typing.Optional[str] = None, **kwargs)#

Trains a production-ready model.

This method trains the model on the entire dataset. It also performs an intermediate cross-validation step over dataset to both generate test metrics across the entire dataset, as well as to generate information which is used to calibrate the trained model.

Calibration means to ensure that the probabilities outputted by classifiers reflect true distributions and to create appropriate confidence intervals for regression data.

Parameters:
  • X_train (Union[pd.DataFrame, np.ndarray]) – Input data to be fit on (structures + optionally features) e.g. a pd.DataFrame containing a “smiles” column or a list of experimental data

  • y_train (Union[pd.Series, list, np.ndarray]) – Values to predict from the input data

  • n_splits (int) – Number of cross validation splits, default 5

  • error_model (BaseErrorModel) – Optional error model, which can be used to predict confidence intervals.

  • ci (float) – the confidence interval predicted by the error model

  • scoring (str) – Metric function to use for scoring cross validation splits; must be in metric_functions

Returns:

Cross validation metrics for each split

Return type:

list

predict(X: Union[_MockObject.DataFrame, _MockObject.ndarray, list, _MockObject.Series], return_ci=False, return_vis=False, skip_preprocess=False, **kwargs) _MockObject.ndarray#

Calls the _predict method of the model and returns the predicted values for provided dataset.

Parameters:
  • X (Union[pd.DataFrame, np.ndarray, list, pd.Series]) – Input data to be predicted (structures + optionally features), will be preprocessed by the preprocess method.

  • return_ci (bool) – If error model fitted, whether or not to return the confidence intervals.

  • return_vis (bool) – If error model fitted, whether or not to return BaseVisualization objects.

Returns:

np.ndarray: Predicted values for the provided dataset.

Shape: (number of samples, number of predicted values)

If return_ci or return_vis are true:

pd.DataFrame: Predicted values, confidence intervals, and/or error plots for the provided dataset.

Return type:

If return_ci and return_vis are False

preprocess(X, y, fit=False)#
Parameters:

X (list of smiles) –

Returns:

Processed list converted into whatever input for the model

test(X: Union[_MockObject.DataFrame, _MockObject.ndarray], y: Union[_MockObject.Series, list, _MockObject.ndarray], values: bool = False, fit_error_model: bool = False) dict#

Tests the model on the provided dataset returning a dictionary of metrics and optionally the predicted values.

Parameters:
  • X (Union[pd.DataFrame, np.ndarray]) – Input test data to be predicted (structures + optionally features)

  • y (Union[pd.Series, list, np.ndarray]) – True values for the properties

  • values (bool, optional) – Whether or not to return the predicted values for the test data. Defaults to False.

  • fit_error_model (bool) – If present, whether or not to fit the error model on the test data.

Returns:

Dictionary of metrics and optionally the predicted values

Return type:

dict

upload_oas(fname: Optional[str] = None)#

uploads the BaseClass object to cloud for OAS access. Model must be trained.

Parameters:

fname (str) (optional) – the file name to name the uploaded model file. If left empty/None, names the file with default name associated with the BaseClass object.

visualize_parameters_ipynb()#
class olorenchemengine.base_class.BaseReduction#

Bases: BaseClass

BaseReduction for applying dimensionality reduction on high-dimensional data

Parameters:

n_components (int) – the number of components to keep

fit()#

fit the model with input data

fit_transform()#

fit the model with and apply dimensionality reduction to input data

transform()#

apply dimensionality reduction to input data

abstract fit(X)#
abstract fit_transform(X)#
abstract transform(X)#
class olorenchemengine.base_class.BaseSKLearnModel(representation, regression_model, classification_model, log=True, **kwargs)#

Bases: BaseModel

Base class for creating sklearn-type models, e.g. with a sklearn RandomForestRegressor and RandomForestClassifier.

representation#

Representation to be used to preprocess the input data.

Type:

BaseRepresentation

regression_model#

Model to be used for regression tasks.

Type:

BaseModel or BaseEstimator

classification_model#

Model to be used for classification tasks.

Type:

BaseModel or BaseEstimator

preprocess(X, y, fit=False)#
Parameters:

X (list of smiles) –

Returns:

Processed list converted into whatever input for the model

class olorenchemengine.base_class.BaseSKLearnReduction#

Bases: BaseReduction

Base class for creating sklearn dimensionality reduction

fit(X)#
fit_transform(X)#
transform(X)#
class olorenchemengine.base_class.MakeMultiClassModel(individual_classifier: BaseModel)#

Bases: BaseModel

Base class for extending the classification capabilities of BaseModel to more than two classes, e.g. classes {W,X,Y,Z}. Uses the commonly-implemented One-vs-Rest (OvR) strategy. For each classifier, the class is fitted against all the other classes. The probabilities are then normalized and compared for each class.

Parameters:

individual_classifier (BaseModel) – Model for binary classification tasks, which is to be turned into a multi-class model.

fit(X_train: Union[_MockObject.DataFrame, _MockObject.ndarray], y_train: Union[_MockObject.Series, list, _MockObject.ndarray], valid: Optional[Tuple[Union[_MockObject.DataFrame, _MockObject.ndarray], Union[_MockObject.Series, list, _MockObject.ndarray]]] = None)#

Calls the _fit method of the model to fit the model on the provided dataset.

Parameters:
  • X_train (Union[pd.DataFrame, np.ndarray]) – Input data to be fit on (structures + optionally features) e.g. a pd.DataFrame containing a “smiles” column or a list of experimental data

  • y_train (Union[pd.Series, list, np.ndarray]) – Values to predict from the input data

  • valid (Tuple[Union[pd.DataFrame, np.ndarray], Union[pd.Series, list, np.ndarray]]) – Optional validation data, which can be used as with methods like early stopping and model averaging.

  • error_model (BaseErrorModel) – Optional error model, which can be used to predict confidence intervals.

predict(X: Union[_MockObject.DataFrame, _MockObject.ndarray, list, _MockObject.Series])#
Parameters:

X (Union[pd.DataFrame, np.ndarray, list, pd.Series]) – Input data to be predicted (structures + optionally features).

Returns:

Predicted values for the provided dataset, with multiple columns for the 2+ different classes and each row representing a different prediction.

Return type:

pd.DataFrame

olorenchemengine.basics module#

Machine learning algorithms for use with molecular vector representations and features from experimental data.

class olorenchemengine.basics.AutoRandomForestModel(representation, n_iter=100, scoring=None, verbose=2, cv=5, **kwargs)#

Bases: BaseSKLearnModel, BaseObject

RandomForestModel where parameters have automatically tuned hyperparameters

Parameters:

representation (str): The representation to use for the model. n_iter (int): The number of iterations to run the hyperparameter tuning. scoring (str): The scoring metric to use for the hyperparameter tuning. verbose (int): The verbosity level of the hyperparameter tuning. cv (int): The number of folds to use for the hyperparameter tuning.

Example

import olorenchemengine as oce

model = oce.AutoRandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048)) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————

autofit(model, n_iter, cv, scoring, verbose)#

Takes an XGBoost model and replaces its fit function with one that automatically tunes the model hyperparameters

Parameters:
  • model (sklearn model) – The model to be tuned

  • n_iter (int) – Number of iterations to run the hyperparameter tuning

  • cv (int) – Number of folds to use for cross-validation

  • scoring (str) – Scoring metric to use for cross-validation

  • verbose (int) – Verbosity level

Returns:

The tuned model

Return type:

model (sklearn model)

class olorenchemengine.basics.BaseMLPClassifier(*args, **kwargs)#

Bases: BaseEstimator

Wrapper for sklearn MLP

class olorenchemengine.basics.BaseMLPRegressor(*args, **kwargs)#

Bases: BaseEstimator

Wrapper for sklearn MLP

class olorenchemengine.basics.FeaturesClassification(config='lineardiscriminant')#

Bases: BaseModel

FeaturesClassification uses machine learning models to classify features based on their experimental data

obj#

Machine learning model to use.

Type:

sklearn.base.BaseEstimator

Parameters:

config (str) – Configuration to use for the model.

class olorenchemengine.basics.GuessingRegression(config='full', reg='lr', **kwargs)#

Bases: BaseModel

Guessing model for regression, used to infer non-linear relationships.

This model tries different non-linear relationships between each feature and property, selecting the best such relationship for each feature. Then the features are transformed then either aggregated using linear regression or averages to obtain the final prediction for the property. This is best used for using experimental features with direct relationships to the properties.

transformations#

List of transformations to apply to the data i.e. possible relationships between feature and property.

Type:

List[Callable]

state#

State of the model, best transformation for each feature.

Type:

Dict[str, Tuple[int, float, float]]

reg#

Method to use for combining features, either “lr” linear regression or “avg” average.

Type:

str

linearize(X)#

Linearize the data–apply the best transformation for each feature.

Parameters:

X (np.ndarray) – List of lists of features.

Returns:

List of lists of features. Shape: (n_samples, n_features)

Return type:

np.ndarray

preprocess(X, y, fit=False)#

This method is used to preprocess the data before training.

Parameters:
  • X (np.ndarray) – List of lists of features.

  • y (np.array) – List of properties.

Returns:

List of lists of features. Shape: (n_samples, n_features)

Return type:

np.ndarray

class olorenchemengine.basics.KBestLinearRegression(k=1, *args, **lwargs)#

Bases: BaseEstimator

Selects the K-best features and then does linear regression

class olorenchemengine.basics.KNN(representation, **kwargs)#

Bases: BaseSKLearnModel

KNN model

Parameters:

representations (str): The representation to use for the model.

Example

import olorenchemengine as oce

model = oce.KNN(representation = oce.MorganVecRepresentation(radius=2, nbits=2048)) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————

class olorenchemengine.basics.KNeighborsClassifier(*args, **kwargs)#

Bases: BaseEstimator

Wrapper for sklearn KNeighborsClassifier

predict(X)#

Predict the output of the estimator

Parameters:

X (np.array) – The data to predict the output of the estimator on

Returns:

The predicted output of the estimator

Return type:

y (np.array)

class olorenchemengine.basics.KNeighborsRegressor(*args, **kwargs)#

Bases: BaseEstimator

Wrapper for sklearn KNeighborsRegressor

class olorenchemengine.basics.LogisticRegression(*args, **kwargs)#

Bases: BaseEstimator

Wrapper for sklearn LogisticRegression

class olorenchemengine.basics.MLP(representation: BaseVecRepresentation, layer_dims=[2048, 512, 128], activation='tanh', epochs=100, batch_size=16, lr=0.0005, dropout=0, kernel_regularizer=0.0001, **kwargs)#

Bases: BaseSKLearnModel

MLP model

Parameters:

representation (BaseVecRepresentation): The representation to use for the model. hidden_layer_sizes (List[int]): The hidden layer sizes to use for the model. activation (str): The activation function to use for the model. epochs (int): The number of epochs to use for the model. batch_size (int): The batch size to use for the model. lr (float): The learning rate to use for the model. dropout (float): The dropout rate to use for the model. kernel_regularizer (float): The kernel regularizer to use for the model.

Example

import olorenchemengine as oce

model = oce.MLP(representation = oce.MorganVecRepresentation(radius=2, nbits=2048)) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————

preprocess(X, y, fit=False)#
Parameters:

X (list of smiles) –

Returns:

Processed list converted into whatever input for the model

class olorenchemengine.basics.RandomForestClassifier(*args, **kwargs)#

Bases: BaseEstimator

Wrapper for sklearn RandomForestClassifier

predict(X)#

Predict the output of the estimator

Parameters:

X (np.array) – The data to predict the output of the estimator on

Returns:

The predicted output of the estimator

Return type:

y (np.array)

class olorenchemengine.basics.RandomForestModel(representation, max_features='log2', max_depth=None, criterion='entropy', class_weight=None, bootstrap=True, n_estimators=100, random_state=None, **kwargs)#

Bases: BaseSKLearnModel

Random forest model

Parameters:

n_estimators (int): The number of trees in the forest. max_depth (int): The maximum depth of the tree. max_features (int): The number of features to consider when looking for the best split. bootstrap (bool): Whether bootstrap samples are used when building trees. criterion (str): The function to measure the quality of a split. class_weight (str): Dict or ‘balanced’, defaults to None.

Example

import olorenchemengine as oce

model = oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————

class olorenchemengine.basics.RandomForestRegressor(*args, **kwargs)#

Bases: BaseEstimator

Wrapper for sklearn RandomForestRegressor

class olorenchemengine.basics.RandomizedSearchCVModel(*args, **kwargs)#

Bases: BaseEstimator

Wrapper class for RandomizedSearchCV

fit(*args, **kwargs)#

Fit the estimator to the data

Parameters:
  • X (np.array) – The data to fit the estimator to

  • y (np.array) – The target data to fit the estimator to

Returns:

The estimator object fit to the data

Return type:

self (object)

predict(*args, **kwargs)#

Predict the output of the estimator

Parameters:

X (np.array) – The data to predict the output of the estimator on

Returns:

The predicted output of the estimator

Return type:

y (np.array)

class olorenchemengine.basics.SVC(*args, **kwargs)#

Bases: BaseEstimator

Wrapper for sklearn SVC

class olorenchemengine.basics.SVR(*args, **kwargs)#

Bases: BaseEstimator

Wrapper for sklearn SVR

class olorenchemengine.basics.SklearnMLP(representation, hidden_layer_sizes=[100], activation='relu', solver='adam', alpha=0.0001, batch_size=32, learning_rate='constant', learning_rate_init=0.001, power_t=0.5, max_iter=200, log=True, **kwargs)#

Bases: BaseSKLearnModel

MLP Model based on sklearn implementation

Parameters:

representation (BaseVecRepresentation): The representation to use for the model. hidden_layer_sizes (list): The number of neurons in each hidden layer. activation (str): The activation function to use. solver (str): The solver to use. alpha (float): Learning rate. batch_size (int): The size of the minibatches for stochastic optimizers. learning_rate (str): The learning rate schedule. learning_rate_init (float): The initial learning rate for the solver. power_t (float): The exponent for inverse scaling learning rate. max_iter (int): Maximum number of iterations.

Example

import olorenchemengine as oce

model = oce.SklearnMLP(representation = oce.MorganVecRepresentation(radius=2, nbits=2048)) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————

class olorenchemengine.basics.SupportVectorMachine(representation, C=0.8, kernel='rbf', gamma='scale', coef0=0, cache_size=500, **kwargs)#

Bases: BaseSKLearnModel

Support vector machine

Parameters:

representations (str): The representation to use for the model. kernel (str): The kernel to use for the model. C (float): The C parameter for the model. gamma (float): The gamma parameter for the model. coef0 (float): The coef0 parameter for the model. cache_size (int): The cache size parameter for the model.

Example

import olorenchemengine as oce

model = oce.SupportVectorMachine(representation = oce.MorganVecRepresentation(radius=2, nbits=2048)) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————

class olorenchemengine.basics.TorchMLP(representation, hidden_layer_sizes=[100], norm_layer: str = None, activation_layer: str = None, dropout: float = 0.0, epochs: int = 100, log=True, **kwargs)#

Bases: BaseModel

MLP Model based on torch implementation

Parameters:

representation (BaseVecRepresentation): The representation to use for the model. hidden_layer_sizes (list): The number of neurons in each hidden layer. norm_layer (str): The normalization to use for a final normalization layer. Default None. activation_layer (str): The activation function to use for a final activation layer. Default None. dropout (float): The dropout rate to use for the model.

Example

import olorenchemengine as oce

model = oce.TorchMLP(representation = oce.MorganVecRepresentation(radius=2, nbits=2048)) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————

preprocess(X, y, fit=False)#
Parameters:

X (list of smiles) –

Returns:

Processed list converted into whatever input for the model

class olorenchemengine.basics.XGBClassifier(*args, **kwargs)#

Bases: BaseEstimator

Wrapper for xgboost XGBRegressor

class olorenchemengine.basics.XGBRegressor(*args, **kwargs)#

Bases: BaseEstimator

Wrapper for xgboost XGBRegressor

class olorenchemengine.basics.XGBoostModel(representation, n_estimators=2000, max_depth=6, subsample=0.5, max_leaves=5, learning_rate=0.05, colsample_bytree=0.8, min_child_weight=1, log=True, **kwargs)#

Bases: BaseSKLearnModel, BaseObject

XGBoost model

Parameters:

representation (str): The representation to use for the model n_iter (int): Number of iterations to run the hyperparameter tuning cv (int): Number of folds to use for cross-validation scoring (str): Scoring metric to use for cross-validation verbose (int): Verbosity level

Example

import olorenchemengine as oce

model = oce.XGBoostModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048)) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————

class olorenchemengine.basics.ZWK_XGBoostModel(representation, n_iter=100, scoring=None, verbose=2, cv=5, **kwargs)#

Bases: BaseSKLearnModel, BaseObject

XGBoost model from https://github.com/smu-tao-group/ADMET_XGBoost

Parameters:

representation (str): The representation to use for the model. n_iter (int): The number of iterations to run the hyperparameter tuning. scoring (str): The scoring metric to use for the hyperparameter tuning. verbose (int): The verbosity level of the hyperparameter tuning. cv (int): The number of folds to use for the hyperparameter tuning.

Example

import olorenchemengine as oce

model = oce.ZWK_XGBoostModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048)) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————

autofit(model, n_iter, cv, scoring, verbose)#

Takes an XGBoost model and replaces its fit function with one that automatically tunes the model hyperparameters

Parameters:
  • model (sklearn model) – The model to be tuned

  • n_iter (int) – Number of iterations to run the hyperparameter tuning

  • cv (int) – Number of folds to use for cross-validation

  • scoring (str) – Scoring metric to use for cross-validation

  • verbose (int) – Verbosity level

Returns:

The tuned model

Return type:

model (sklearn model)

olorenchemengine.dataset module#

class olorenchemengine.dataset.BaseDataset(name: str = None, data: str = None, structure_col: str = None, property_col: str = None, feature_cols: list = [], date_col: str = None, log=True, **kwargs)#

Bases: BaseClass

BaseDataset for all dataset objects

BaseDataset holds its data in a Pandas DataFrame.

Parameters:
  • name (str) – Name of the dataset

  • data (str) – The output when df.to_csv() where df is the pd.DataFrame containing the dataset.

  • structure_col (str) – Name of column containing structure information, e.g. “smiles”

  • feature_cols (list[str]) – List of names of columns containing features, e.g. [“X1”, “X2”]

  • property_col (str) – Name of property of interest, e.g. “Y”

property entire_dataset#

Returns the entire dataset

Returns:

The entire dataset

Return type:

pd.DataFrame

property entire_dataset_split#

Returns a tuple of three elements where the first is the input train data, the second is the input validation data, and the third is the input test data

Returns:

(train_data, val_data, test_data)

Return type:

tuple

property size#
property test_dataset#

Gives a tuple of two elements where the first is the input test data and the second is the property of interest

Returns:

The test data

Return type:

pd.DataFrame

property train_dataset#

Returns the train dataset

property trainval_dataset#

Returns the train and validation dataset

transform(dataset: _MockObject.Self)#

Combines this dataset with the passed dataset object

property valid_dataset#

Gives a tuple of two elements where the first is the input val data and the second is the property of interest

Returns:

The validation data

Return type:

pd.DataFrame

class olorenchemengine.dataset.BaseDatasetTransform(log=True)#

Bases: BaseClass

Applies a transformation onto the inputted BaseDataset.

Transformation applied as defined in the abstract method transform.

Parameters:

dataset (BaseDataset) – The dataset to transform.

abstract transform(dataset: BaseDataset) BaseDataset#

Applies a transformation onto the inputted BaseDataset.

Parameters: dataset (BaseDataset): The dataset to transform.

class olorenchemengine.dataset.BaseKFold(n_splits: int = 10, log=True)#

Bases: BaseDatasetTransform

Base class for all classes which split the data into KFolds for cross- validation with various strategies.

get_n_splits()#
abstract transform(dataset: BaseDataset, random_state: int = 42, *args, **kwargs)#

Splits document into folds, identified by 1, …, n_splits in the ‘cv’ column.

class olorenchemengine.dataset.CleanStructures(log=True)#

Bases: BaseDatasetTransform

CleanStructures creates a new dataset from the original dataset by removing structures that are not valid.

Parameters:

dataset (BaseDataset) – The dataset to clean.

transform(dataset: BaseDataset, dropna_property: bool = True, **kwargs)#

Applies a transformation onto the inputted BaseDataset.

Parameters: dataset (BaseDataset): The dataset to transform.

class olorenchemengine.dataset.DatasetFromCDDSearch(search_id, cache_file_path=None, update=True, log=True, **kwargs)#

Bases: BaseDataset

Dataset for retreiving data from CDD via a saved search.

Requires a CDD Token to be set.

Parameters:
  • search_id (str) – The ID of the saved CDD search to use.

  • cache_file_path (str) – The path to the file to cache the dataset.

  • update (bool) – Whether or no to update the cached dataset by redoing the CDD search

check_export_status(export_id)#

Uses the export_id passed as a parameter to find the pertinent dataset and return its export status

Parameters: export_id (str): The unique export ID of the dataset searched for

Uses a CDD Token (passed as search_id) to search saved datasets to find and return its related dataset export id. Using the export id, it then checks the export status and returns the dataset’s data in CSV format.

Parameters: search_id (str): The ID of the saved CDD search to use.

get_export(export_id)#

Uses the export_id passed as a parameter to find the pertinent dataset and return the dataset’s data in CSV format.

Parameters: export_id (str): The unique export ID of the dataset searched for

Uses a CDD Token (passed as search_id) to search saved datasets to find and return its related dataset export id.

Parameters: search_id (str): The ID of the saved CDD search to use.

class olorenchemengine.dataset.DatasetFromCSV(file_path, log=True, **kwargs)#

Bases: BaseDataset

DatasetFromFile for all dataset objects

Parameters:

file_path (str) – Relative or absolute to a local CSV file

class olorenchemengine.dataset.Discretize(prop_cutoff: float, dir: str = 'larger', log=True, **kwargs)#

Bases: BaseDatasetTransform

Discretize creates a new dataset from the original dataset by discretizing the property column.

Parameters:
  • prop_cutoff (float) – where to threshold the property column.

  • dir (str) – Whether to have the 1 class be “smaller” o “larger” then the property value. Default, “larger”.

transform(dataset: BaseDataset, **kwargs)#

Applies a transformation onto the inputted BaseDataset.

Parameters: dataset (BaseDataset): The dataset to transform.

class olorenchemengine.dataset.KMeansKFold(rep: BaseVecRepresentation, n_splits: int = 10, log=True)#

Bases: BaseKFold

transform(dataset: BaseDataset, random_state: int = 42, *args, **kwargs)#

Splits document into folds, identified by 1, …, n_splits in the ‘cv’ column.

class olorenchemengine.dataset.OneHotEncode(feature_col: str, log=True, **kwargs)#

Bases: BaseDatasetTransform

This one hot encodes a given feature column

Parameters:

feature_col (str) – The feature column to one hot encode.

transform(dataset: BaseDataset, **kwargs)#

Applies a transformation onto the inputted BaseDataset.

Parameters: dataset (BaseDataset): The dataset to transform.

class olorenchemengine.dataset.RandomKFold(n_splits: int = 10, log=True)#

Bases: BaseKFold

transform(dataset: BaseDataset, *args, random_state: int = 42, **kwargs)#

Splits document into folds, identified by 1, …, n_splits in the ‘cv’ column.

class olorenchemengine.dataset.ScaffoldKFold(n_splits: int = 10, log=True)#

Bases: BaseKFold

transform(dataset: BaseDataset, *args, random_state: int = 42, **kwargs)#

Splits document into folds, identified by 1, …, n_splits in the ‘cv’ column.

class olorenchemengine.dataset.ScaffoldKMeansKFold(rep: BaseVecRepresentation, n_splits: int = 10, log=True)#

Bases: BaseKFold

transform(dataset: BaseDataset, random_state: int = 42, *args, **kwargs)#

Splits document into folds, identified by 1, …, n_splits in the ‘cv’ column.

olorenchemengine.dataset.func(self: BaseDataset, other: BaseDatasetTransform) BaseDataset#

olorenchemengine.ensemble module#

Ensembling methods to combine `BaseModel`s to create better, combined models.

class olorenchemengine.ensemble.Averager(models: List[BaseModel], n: int = 1, log: bool = True, **kwargs)#

Bases: BaseModel

Averager averages the predictions of multiple models for an ensembled prediction.

Parameters:
  • models (list) – list of BaseModel objects to be averaged.

  • n (int, optional) – Number of times to repeat the given models. Defaults to 1.

Example

import olorenchemengine as oce

model = oce.Averager(models = [

oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000), oce.SupportVectorMachine(representation = oce.Mol2Vec())

]) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————

preprocess(X, y, fit=False)#

Preprocesses the data for the model.

Parameters:
  • X (pd.DataFrame) – Dataframe of features.

  • y (pd.DataFrame) – Dataframe of labels.

Returns:

Dataframe of features.

Return type:

X (pd.DataFrame)

class olorenchemengine.ensemble.BaseBoosting(models: List[BaseModel], n: int = 1, oof=False, nfolds=5, log: bool = True, **kwargs)#

Bases: BaseModel

BaseBoosting uses models in a gradient boosting fashion to create an ensembled model.

Parameters:
  • models (List[BaseModel]) – list of models to use for the learners to be stacked together.

  • n (int, optional) – Number of times to repeat the given models. Defaults to 1.

  • oof (bool, optional) – Whether or not to use out-of-fold predictions for the ensembled model. Defaults to False.

  • log (bool, optional) – Whether or not to log the arguments of this constructor. Defaults to True.

Example

import olorenchemengine as oce

model = oce.BaseBoosting(
models = [

oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000), oce.SupportVectorMachine(representation = oce.Mol2Vec())]

) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————

class olorenchemengine.ensemble.BaseStacking(models: List[BaseModel], stacker_model: BaseModel, n: int = 1, oof: bool = False, split: float = 0.0, log: bool = True, nfolds=5, **kwargs)#

Bases: BaseModel

BaseStacking stacks the predictions of models for an ensembled prediction.

Parameters:
  • models (List[BaseModel]) – list of models to use for the learners to be stacked together.

  • stacker_model (BaseModel) – a model to use for stacking the models.

Called only by child classes. Not to be called directly by user.

featurize(X)#

Featurizes the data for the model.

Parameters:
  • X (pd.DataFrame) – Dataframe of features.

  • y (pd.DataFrame) – Dataframe of labels.

Returns:

featurized dataset.

Return type:

data

preprocess(X, y, fit=False)#
Parameters:

X (list of smiles) –

Returns:

Processed list converted into whatever input for the model

class olorenchemengine.ensemble.BestStacker(models: List[BaseModel], n: int = 1, k: int = 1, log: bool = True, **kwargs)#

Bases: BaseStacking

BestStacker is a stacking method that uses the best model from a collection of models to make an ensembled prediction.

Parameters:
  • models (List[BaseModel]) – list of models to use for the learners to be stacked together.

  • n (int, optional) – Number of times to repeat the given models. Defaults to 1.

  • log (bool, optional) – Whether or not to log the arguments of this constructor. Defaults to True.

Example

import olorenchemengine as oce

model = oce.BestStacker(models = [

oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000), oce.SupportVectorMachine(representation = oce.Mol2Vec())

]) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————

class olorenchemengine.ensemble.LinearRegressionStacker(models: List[BaseModel], n: int = 1, log: bool = True, **kwargs)#

Bases: BaseStacking

LinearRegressionStacker is a stacking method that uses linear regression on the predictions

from a collection of models to make an ensembled prediction.

Parameters:

models (List[BaseModel]): list of models to use for the learners to be stacked together. n (int, optional): Number of times to repeat the given models. Defaults to 1. log (bool, optional): Whether or not to log the arguments of this constructor. Defaults to True.

Example

import olorenchemengine as oce

model = oce.LinearRegressionStacker(models = [

oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000), oce.SupportVectorMachine(representation = oce.Mol2Vec())

]) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————

class olorenchemengine.ensemble.MLPStacker(models, layer_dims=[2048, 512, 128], activation='tanh', epochs=100, batch_size=16, verbose=0, n=1, log=True, **kwargs)#

Bases: SKLearnStacker

MLPStacker is a subclass of SKLearnStacker that uses a multi-layer perceptron model to make an ensembled prediction.

Parameters:

models (List[BaseModel]): list of models to use for the learners to be stacked together. layer_dims (List[int]): list of layer dimensions for the MLP. activation (str, optional): activation function to use for the MLP. Defaults to ‘tanh’. epochs (int, optional): number of epochs to train the MLP. Defaults to 100. batch_size (int, optional): batch size for the MLP. Defaults to 16. verbose (int, optional): verbosity level for the MLP. Defaults to 0. n (int, optional): Number of times to repeat the given models. Defaults to 1. log (bool, optional): Whether or not to log the arguments of this constructor. Defaults to True.

Example

import olorenchemengine as oce

model = oce.MLPStacker(
models = [

oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000), oce.SupportVectorMachine(representation = oce.Mol2Vec())],

layer_dims = [32, 32], activation = ‘tanh’, epochs = 15, batch_size = 16, verbose = 0,

) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————

class olorenchemengine.ensemble.RFStacker(models: List[BaseModel], n_estimators: int = 100, max_features: str = 'log2', n: int = 1, log: bool = True, **kwargs)#

Bases: SKLearnStacker

RFStacker is a subclass of SKLearnStacker that uses a random forest models to make an ensembled prediction.

Parameters:

models (List[BaseModel]): list of models to use for the learners to be stacked together. n_estimators (int, optional): Number of trees in the forest. Defaults to 100. n (int, optional): Number of times to repeat the given models. Defaults to 1. log (bool, optional): Whether or not to log the arguments of this constructor. Defaults to True.

Example

import olorenchemengine as oce

model = oce.RFStacker(
models = [

oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000), oce.SupportVectorMachine(representation = oce.Mol2Vec())],

n_estimators = 100

) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————

class olorenchemengine.ensemble.Resample1(model: BaseModel, log=True)#

Bases: BaseModel

Sample from imbalanced dataset. Take all compounds from smaller class and then sample an equal number from the larger class.

Parameters:

model (BaseModel) – Model to use for classification.

Example

import olorenchemengine as oce

model = oce.Resample1(

oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000)

) model.fit(train[‘Drug’], train[‘Y’]) preds = model.predict(test[‘Drug’]) ——————————

Note: may only be used on binary classification data.

fit(X_train, y_train)#

Calls the _fit method of the model to fit the model on the provided dataset.

Parameters:
  • X_train (Union[pd.DataFrame, np.ndarray]) – Input data to be fit on (structures + optionally features) e.g. a pd.DataFrame containing a “smiles” column or a list of experimental data

  • y_train (Union[pd.Series, list, np.ndarray]) – Values to predict from the input data

  • valid (Tuple[Union[pd.DataFrame, np.ndarray], Union[pd.Series, list, np.ndarray]]) – Optional validation data, which can be used as with methods like early stopping and model averaging.

  • error_model (BaseErrorModel) – Optional error model, which can be used to predict confidence intervals.

class olorenchemengine.ensemble.Resample2(model: BaseModel, log=True)#

Bases: BaseModel

Sample from imbalanced dataset. Take all compounds from smaller class and then sample an equal number from the larger class.

fit(X_train, y_train)#

Calls the _fit method of the model to fit the model on the provided dataset.

Parameters:
  • X_train (Union[pd.DataFrame, np.ndarray]) – Input data to be fit on (structures + optionally features) e.g. a pd.DataFrame containing a “smiles” column or a list of experimental data

  • y_train (Union[pd.Series, list, np.ndarray]) – Values to predict from the input data

  • valid (Tuple[Union[pd.DataFrame, np.ndarray], Union[pd.Series, list, np.ndarray]]) – Optional validation data, which can be used as with methods like early stopping and model averaging.

  • error_model (BaseErrorModel) – Optional error model, which can be used to predict confidence intervals.

class olorenchemengine.ensemble.ResampleAdaboost(models: List[BaseModel], n: int = 1, factor: int = 8, size: int = None, equation: str = 'abs', log: bool = True, **kwargs)#

Bases: BaseBoosting

ResampleAdaBoost performs the Adaboost with sampling weighting being done via resampling of the dataset to create an ensembled model.

Parameters:
  • models (List[BaseModel]) – list of models to use for the learners to be stacked together.

  • n (int, optional) – Number of times to repeat the given models. Defaults to 1.

  • size (int, optional) – Size of the resampled dataset. Defaults to None.

  • factor (int, optional) – Factor by which to resample the dataset. Defaults to 8.

  • equation (str, optional) – Equation to use for resampling. Defaults to “abs”.

  • log (bool, optional) – Whether or not to log the arguments of this constructor. Defaults to True.

Example

import olorenchemengine as oce

model = oce.ResampleAdaboost(
models = [

oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000), oce.SupportVectorMachine(representation = oce.Mol2Vec())],

factor = 8, equation = ‘abs’

) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————

class olorenchemengine.ensemble.SKLearnStacker(models: List[BaseModel], regression_stacker_model: BaseEstimator, classification_stacker_model: BaseEstimator, n: int = 1, log: bool = True, **kwargs)#

Bases: BaseSKLearnModel

SKLearnStacker is a stacking method that uses a sklearn-like models to make an ensembled prediction.

reg_stack#

a sklearn-like model to use for stacking the models for regression tasks.

class_stack#

a sklearn-like model to use for stacking the models for classification tasks.

Called only by child classes. Not to be called directly by user.

olorenchemengine.ensemble.get_oof(self, model, X, y, kf)#

olorenchemengine.gnn module#

Building blocks for graph neural networks.

class olorenchemengine.gnn.AttentiveFP(hidden_channels=4, out_channels=1, num_layers=1, num_timesteps=1, dropout=0, skip_lin=True, layer_dims=[512, 128], activation='leakyrelu', optim='adamw', **kwargs)#

Bases: BaseLightningModule

AttentiveFP is a wrapper for the PyTorch Geometric interpretation of https://pubs.acs.org/doi/10.1021/acs.jmedchem.9b00959.

Parameters:
  • hidden_channels (int, optional) – the number of hidden channels to use in the model. Defaults to 4.

  • out_channels (int, optional) – the number of output channels to use in the model. Defaults to 1.

  • num_layers (int, optional) – the number of layers to use in the model. Defaults to 1.

  • num_timesteps (int, optional) – the number of timesteps to use in the model. Defaults to 1.

  • dropout (float, optional) – the dropout rate to use in the model. Defaults to 0.

  • skip_lin (bool, optional) – the number of skip connections to use in the model. Defaults to True.

  • layer_dims (List, optional) – the dimensions to use for each layer in the model. Defaults to [512, 128].

  • activation (str, optional) – the activation function to use in the model. Defaults to “leakyrelu”.

  • optim (str, optional) – the optimizer to use in the model. Defaults to “adamw”.

create(dimensions)#

Create the model.

Parameters:

dimensions (list) – the dimensions of the input data.

forward(data)#
class olorenchemengine.gnn.BaseLightningModule(*args, optim: str = 'adam', input_dimensions: Optional[Tuple] = None, **kwargs)#

Bases: BaseClass

BaseLightningModule allows for the use of a Pytorch Lightning module as a BaseClass to be incorporated into the framework.

Parameters:
  • optim (str, optional) – parameter describing what kind of optimizer to use. Defaults to “adam”.

  • input_dimensions (Tuple, optional) – Tulpe describing the dimensions of the input data. Defaults to None.

configure_optimizers()#
forward(batch)#
loss(y_pred, y_true)#

Calculate the loss for the model.

Parameters:
  • y_pred (torch.tensor) – the predictions for the model.

  • y_true (torch.tensor) – the true labels for the model.

Returns:

the loss for the model.

Return type:

torch.tensor

set_task_type(task_type, pos_weight=<MagicMock name='mock.tensor()' id='140680899396464'>)#

Sets the task type for the model.

Parameters:
  • task_type (str) – the task type to set the model to.

  • pos_weight (torch.tensor, optional) – the weight to use for the positive class. Defaults to torch.tensor([1]).

test_step(batch, batch_idx)#
training_step(batch, batch_idx)#
validation_step(batch, batch_idx)#
class olorenchemengine.gnn.BaseTorchGeometricModel(network: ~olorenchemengine.gnn.BaseLightningModule, representation: ~olorenchemengine.internal.BaseRepresentation = <olorenchemengine.representations.TorchGeometricGraph object>, epochs: int = 1, batch_size: int = 16, lr: float = 0.0001, auto_lr_find: bool = True, pos_weight: str = 'balanced', preinitialized: bool = False, log: bool = True, **kwargs)#

Bases: BaseModel

BaseTorchGeometricModel is a base class for models in the PyTorch Geometric framework.

Parameters:
  • network (BaseLightningModule) – The network to be used for the model.

  • representation (BaseRepresentation, optional) – The representation to be used for the model. Note that the representation must be compatible with the network, so the default, TorchGeometricGraph() is highly reccomended

  • epochs (int, optional) – The number of epochs to train the model for.

  • batch_size (int, optional) – The batch size to use for training.

  • lr (float, optional) – The learning rate to use for training.

  • auto_lr_find (bool, optional) – Whether to automatically adjust the learning rate.

  • pos_weight (str, optional) – Strategy for weighting positives in classification

  • preinitialized (bool, optional) – Whether to the network is pre-initialized.

  • log (bool, optional) – Log arguments or not. Should only be true if it is not nested. Defaults to True.

preprocess(X, y, fit=False)#
Parameters:

X (list of smiles) –

Returns:

Processed list converted into whatever input for the model

class olorenchemengine.gnn.TLFromCheckpoint(model_path, num_tasks: int = 2048, dropout: float = 0.1, lr: float = 0.0001, optim: str = 'adam', reset: bool = False, **kwargs)#

Bases: BaseLightningModule

TLFromCheckpoint is a class for transfer-learning from an OlorenVec PyTorch-lightning checkpoint.

Parameters:
  • model_path (str, option) – The path to the PyTorch-lightning checkpoint. Ise “default” to use a pretrained OlorenVec model.

  • map_location (str, optional) – The location to map the model to. Default is “cuda:0”.

  • num_tasks (int, optional) – The number of tasks in the OlorenVec model

  • dropout (float, optional) – The dropout rate to use for the model. Default is 0.1.

  • lr (float, optional) – The learning rate to use for training. Default is 1e-4.

  • optim (str, optional) – The optimizer to use for training. Default is “adam”.

olorenchemengine.hyperparameters module#

Contains the basic framework for hyperparameter optimization.

We use hyperopt as our framework for hyperparameter optimization, and the class Opt functions as the bridge between olorenchemengine and hyperopt. Hyperparameters are defined in Opt which is used as an argument in a BaseClass object’s instantiation. These hyperparameters are then collated and used for hyperparameter optimization.

The following is a brief introduction to hyperopt and is a useful starting point for understanding our hyperparameter optimization engine: https://github.com/hyperopt/hyperopt/wiki/FMin.

class olorenchemengine.hyperparameters.Opt(label, *args, use_int=False, **kwargs)#

Bases: BaseClass

abstract property get_hp#
class olorenchemengine.hyperparameters.OptChoice(label, *args, use_int=False, **kwargs)#

Bases: Opt

property get_hp#
class olorenchemengine.hyperparameters.OptLogNormal(label, *args, use_int=False, **kwargs)#

Bases: Opt

property get_hp#
class olorenchemengine.hyperparameters.OptLogUniform(label, *args, use_int=False, **kwargs)#

Bases: Opt

property get_hp#
class olorenchemengine.hyperparameters.OptQLogNormal(label, *args, use_int=False, **kwargs)#

Bases: Opt

property get_hp#
class olorenchemengine.hyperparameters.OptQLogUniform(label, *args, use_int=False, **kwargs)#

Bases: Opt

property get_hp#
class olorenchemengine.hyperparameters.OptQNormal(label, *args, use_int=False, **kwargs)#

Bases: Opt

property get_hp#
class olorenchemengine.hyperparameters.OptQUniform(label, *args, use_int=False, **kwargs)#

Bases: Opt

property get_hp#
class olorenchemengine.hyperparameters.OptRandInt(label, *args, use_int=False, **kwargs)#

Bases: Opt

property get_hp#
class olorenchemengine.hyperparameters.OptUniform(label, *args, use_int=False, **kwargs)#

Bases: Opt

property get_hp#
olorenchemengine.hyperparameters.cast_int(f: float) int#
olorenchemengine.hyperparameters.index_hyperparameters(object: BaseClass) dict#

Returns a dictionary of hyperparameters for the model.

olorenchemengine.hyperparameters.load_hyperparameters(object: BaseClass, hyperparameter_dictionary: dict) dict#
olorenchemengine.hyperparameters.load_hyperparameters_(object: BaseClass, hyperparameter_dictionary: dict) dict#
olorenchemengine.hyperparameters.optimize(model: Union[BaseModel, dict], runner: Union[BaseModelManager, Callable], max_evals=3)#

olorenchemengine.internal module#

class olorenchemengine.internal.BaseClass(log=True)#

Bases: BaseRemoteSymbol

BaseClass is the base class for all models.

All classes in Oloren ChemEngine should inherit from BaseClass to enable for universal saving and loading of both parameters and internal state. This requires the implementation of abstract methods _save and _load.

Registry()#

returns a dictionary mapping the name of a class to the class itself for all subclasses of the class.

_save()#

saves an instance of a BaseClass to a dictionary (abstract method to be implmented by subclasses)

_load()#

loads an instance of a BaseClass from a dictionary (abstract method to be implmented by subclasses)

classmethod AllInstances()#

AllTypes returns a list of all standard instances of all subclasses of BaseClass.

Standard instances means that all required parameters for instantiation of the subclasses are set with canonical values.

classmethod Opt(*args, **kwargs)#
classmethod Registry()#

Registry is a recursive method to create a dictionary of all subclasses of BaseClass, with the key being the name of the subclass and the value being the subclass itself.

copy()#
class olorenchemengine.internal.BaseDepreceated(*args, **kwargs)#

Bases: BaseClass

BaseDepreceated is a class which is used to deprecate a class.

Depreceated classes will raise Exception and will not run.

class olorenchemengine.internal.BaseEstimator(obj=None)#

Bases: BaseObject

Utility class used to wrap any object with a fit and predict method

fit(X, y)#

Fit the estimator to the data

Parameters:
  • X (np.array) – The data to fit the estimator to

  • y (np.array) – The target data to fit the estimator to

Returns:

The estimator object fit to the data

Return type:

self (object)

predict(X)#

Predict the output of the estimator

Parameters:

X (np.array) – The data to predict the output of the estimator on

Returns:

The predicted output of the estimator

Return type:

y (np.array)

class olorenchemengine.internal.BaseObject(obj=None)#

Bases: BaseClass

BaseObject is the parent class for all classes which directly wrap some object to be saved via joblib.

obj#

the object which is wrapped by the BaseObject

Type:

object

class olorenchemengine.internal.BasePreprocessor(obj=None)#

Bases: BaseObject

BasePreprocessor is the parent class for all preprocessors which transform the features or properties of a dataset.

fit()#

fit the preprocessor to the dataset

fit_transform()#

fit the preprocessor to the dataset return the transformed values

transform()#

return the transformed values

inverse_transform()#

return the original values from the transformed values

fit(X)#

Fits the preprocessor to the dataset.

Parameters:

X (np.ndarray) – the dataset

Returns:

The fit preprocessor instance

fit_transform(X)#

Fits the preprocessor to the dataset and returns the transformed values.

Parameters:

X (np.ndarray) – the dataset

Returns:

The transformed values of the dataset as a numpy array

inverse_transform(X)#

Returns the original values from the transformed values.

Parameters:

X (np.ndarray) – the transformed values

Returns:

The original values from the transformed values

transform(X)#

Returns the transformed values of the dataset as a numpy array.

Parameters:

X (np.ndarray) – the dataset

Returns:

The transformed values of the dataset as a numpy array

class olorenchemengine.internal.BaseRemoteSymbol(REMOTE_SYMBOL_NAME, REMOTE_PARENT, args=None, kwargs=None)#

Bases: object

classmethod from_rid(rid)#
class olorenchemengine.internal.BaseRepresentation(log=True)#

Bases: BaseClass

BaseClass for all molecular representations (PyTorch Geometric graphs, descriptors, fingerprints, etc.) :param log: Whether to log the representation or not :type log: boolean

_convert(smiles

str, y: Union[int, float, np.number] = None) -> Any: converts a single structure (represented by a SMILES string) to a representation

convert(Xs

Union[list, pd.DataFrame, dict, str], ys: Union[list, pd.Series, np.ndarray]=None) -> List[Any]: converts input data to a list of representations

convert(Xs: Union[list, _MockObject.DataFrame, dict, str], ys: Optional[Union[list, _MockObject.Series, _MockObject.ndarray]] = None, **kwargs) List[Any]#

Converts input data to a list of representations :param Xs: input data :type Xs: Union[list, pd.DataFrame, dict, str] :param ys: target values of the input data :type ys: Union[list, pd.Series, np.ndarray]=None

Returns:

list of representations of the input data

Return type:

List[Any]

class olorenchemengine.internal.BaseVecRepresentation(*args, collinear_thresh=1.01, scale=<olorenchemengine.internal.StandardScaler object>, names=None, log=True, **kwargs)#

Bases: BaseRepresentation

Representation where given input data, returns a vector representation for each compound.

calculate_distance(x1: Union[str, List[str]], x2: Union[str, List[str]], metric: str = 'cosine', **kwargs) _MockObject.ndarray#

Calculates the distance between two molecules or list of molecules.

Returns a 2D array of distances between each pair of molecules of shape len(x1) by len(x2).

This uses pairwise_distances from sklearn.metrics to calculate distances between the vector representations of the molecules. Options for distances are Valid values for metric are:

From scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’,

‘manhattan’]. These metrics support sparse matrix inputs. [‘nan_euclidean’] but it does not yet support sparse matrices.

From scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’,

‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’].

See the documentation for scipy.spatial.distance for details on these metrics.

convert(Xs: Union[list, _MockObject.DataFrame, dict, str], ys: Optional[Union[list, _MockObject.Series, _MockObject.ndarray]] = None, lambda_convert: Optional[Callable] = None, fit=False, **kwargs) List[_MockObject.ndarray]#

BaseVecRepresentation’s convert returns a list of numpy arrays.

Parameters:
  • Xs (Union[list, pd.DataFrame, dict, str]) – input data

  • ys (Union[list, pd.Series, np.ndarray], optional) – included for compatibility, unused argument. Defaults to None.

Returns:

list of molecular vector representations

Return type:

List[np.ndarray]

property names#
class olorenchemengine.internal.ConcatenatedVecRepresentation(rep1: BaseVecRepresentation, rep2: BaseVecRepresentation, log=True, **kwargs)#

Bases: BaseVecRepresentation

Creates a structure vector representation by concatenating multiple representations.

Parameters:

Can be created by adding two representations together using + operator.

Example

import olorenchemengine as oce combo_rep = oce.MorganVecRepresentation(radius=2, nbits=2048) + oce.Mol2Vec() model = oce.RandomForestModel(representation = combo_rep, n_estimators = 1000)

model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————

convert(smiles_list, ys=None, fit=False, **kwargs)#

BaseVecRepresentation’s convert returns a list of numpy arrays.

Parameters:
  • Xs (Union[list, pd.DataFrame, dict, str]) – input data

  • ys (Union[list, pd.Series, np.ndarray], optional) – included for compatibility, unused argument. Defaults to None.

Returns:

list of molecular vector representations

Return type:

List[np.ndarray]

class olorenchemengine.internal.LinearRegression(*args, **kwargs)#

Bases: BaseEstimator

Wrapper for sklearn LinearRegression

class olorenchemengine.internal.LogScaler(min_value=0, with_mean=True, with_std=True)#

Bases: BasePreprocessor

LogScaler is a BasePreprocessor which standardizes the data by taking the log and then removing the mean and scaling to unit variance.

fit(X)#

Fits the preprocessor to the dataset.

Parameters:

X (np.ndarray) – the dataset

Returns:

The fit preprocessor instance

fit_transform(X)#

Fits the preprocessor to the dataset and returns the transformed values.

Parameters:

X (np.ndarray) – the dataset

Returns:

The transformed values of the dataset as a numpy array

inverse_transform(X)#

Returns the original values from the transformed values.

Parameters:

X (np.ndarray) – the transformed values

Returns:

The original values from the transformed values

transform(X)#

Returns the transformed values of the dataset as a numpy array.

Parameters:

X (np.ndarray) – the dataset

Returns:

The transformed values of the dataset as a numpy array

class olorenchemengine.internal.OASConnector#

Bases: object

Class which links oce to OAS and can move data between them using Firestore and an API

authenticate()#
upload_model(model, model_name)#
upload_vis(visualization)#
class olorenchemengine.internal.QuantileTransformer(n_quantiles=1000, output_distribution='normal', subsample=100000.0, random_state=None)#

Bases: BasePreprocessor

QuantileTransformer is a BasePreprocessor which transforms a dataset by quantile transformation to specified distribution.

obj#

the object which is wrapped by the BasePreprocessor

Type:

sklearn.preprocessing.QuantileTransformer

class olorenchemengine.internal.Remote(remote_url, session_id=None, keep_alive=False, debug=False)#

Bases: object

class olorenchemengine.internal.RemoteObj(remote_id)#

Bases: BaseRemoteSymbol

Dummy object to represent remote objects.

class olorenchemengine.internal.SMILESRepresentation(log=True)#

Bases: BaseRepresentation

Extracts the SMILES strings from inputted data

convert(Xs

Union[list, pd.DataFrame, dict, str], ys: Union[list, pd.Series, np.ndarray]=None) -> List[Any]: converts input data to a list of SMILES strings Data types:

pd.DataFrames will have columns “smiles” or “Smiles” or “SMILES” extracted lists and tuples of multiple elements will have their first element extracted strings will be converted to a list of one element everything else will be returned as inputted

convert(Xs, ys=None, **kwargs)#

Converts input data to a list of representations :param Xs: input data :type Xs: Union[list, pd.DataFrame, dict, str] :param ys: target values of the input data :type ys: Union[list, pd.Series, np.ndarray]=None

Returns:

list of representations of the input data

Return type:

List[Any]

class olorenchemengine.internal.StandardScaler(with_mean=True, with_std=True)#

Bases: BasePreprocessor

StandardScaler is a BasePreprocessor which standardizes the data by removing the mean and scaling to unit variance.

obj#

the object which is wrapped by the BasePreprocessor

Type:

sklearn.preprocessing.StandardScaler

olorenchemengine.internal.all_subclasses(cls)#

Helper function to return all subclasses of class

olorenchemengine.internal.create_BC(d: dict) BaseClass#

create_BC is a method which creates a BaseClass object from a dictionary of parameters.

Note the instances variables of the object are not specified.

Parameters:

d (dict) – a dictionary of parameters returned by parameterize

Returns:

the object created from the parameters

Return type:

BaseClass

olorenchemengine.internal.deparametrize_args_kwargs(params)#
olorenchemengine.internal.detect_setting(data)#
olorenchemengine.internal.download_public_file(path, redownload=False)#

Download a public file from Oloren’s Public Storage, and returns the contents.

@param path: The path to the file to read. @param redownload: Whether to redownload the file if it already exists.

olorenchemengine.internal.generate_uuid()#
olorenchemengine.internal.get_all_reps()#
olorenchemengine.internal.get_default_args(func)#
olorenchemengine.internal.get_runtime()#
olorenchemengine.internal.import_or_install(package_name: str, statement: Optional[str] = None, scope: Optional[dict] = None)#
olorenchemengine.internal.install_with_permission(package_name: str)#
olorenchemengine.internal.json_params_str(base: Union[BaseClass, dict]) str#

Returns a json string of the parameters of the passed BaseClass object so that the model parameter dictionary can be reconstructed with json.load(params_str)

olorenchemengine.internal.load(fname: str) BaseClass#

loads a BaseClass from a file

Parameters:

fname (str) – name of the file to load the object from

Returns:

the BaseClass object which as been recreated from the file

Return type:

BaseClass

olorenchemengine.internal.loads(d: dict) BaseClass#

loads is a method which recreates a BaseClass object from a save.

Parameters:

d (dict) – the dictionary returned by saves which saves the state of a BaseClass object

Returns:

the recreated object

Return type:

BaseClass

olorenchemengine.internal.log_arguments(func: Callable[[...], None]) Callable[[...], None]#
log_arguments is a decorator which logs the arguments of a BaseClass constructor to instance variables for use in

model parameterization.

Parameters:

func (function) – a __init__(self, *args, **kwargs) function of a baseclass.

Returns:

the same __init__ function with arguments saved to instance variables.

Return type:

wrapper (function)

olorenchemengine.internal.mock_imports(g, *args)#
olorenchemengine.internal.model_name_from_model(model: BaseClass) str#

model_name_from_model creates a unique name for a model.

Parameters:

model (BaseClass) – the model to be named

Returns:

the model name consisting of the the model class name with a hash of the parameters

Return type:

str

olorenchemengine.internal.model_name_from_params(param_dict: dict) str#

model_name_from_params creates a unique name for a model based on the parameters passed to it.

Parameters:

param_dict (dict) – dictionary of parameters returned by parameterize neccessary to instantiate the model (note this is different from the instance save)

Returns:

the model name consisting of the the model class name with a hash of the parameters

Return type:

str

olorenchemengine.internal.package_available(package_name: str) bool#

Checks if a package is available.

Parameters:

package_name (str) – the name of the package to check for

Returns:

True if the package is available, False otherwise

Return type:

bool

olorenchemengine.internal.parameterize(object: Optional[Union[BaseClass, list, dict, int, float, str]]) dict#

parameterize is a recursive method which creates a dictionary of all arguments necessary to instantiate a BaseClass object.

Note that only objects which are instances of subclasses of BaseClass can be parameterized, other supported objects are to enable to recursive use of parameterize but cannot themselves be parameterized.

Parameters:

object (Union[BaseClass, list, dict, int, float, str, None]) – parameterize is a recursive method which creates a dictionary of all arguments necessary to instantiate a BaseClass object.

Raises:

ValueError – Object is not of type that can be parameterized

Returns:

dictionary of parameters necessary to instantiate the object.

Return type:

dict

olorenchemengine.internal.parametrize_args_kwargs(args, kwargs)#
olorenchemengine.internal.pretty_args_kwargs(args, kwargs)#
olorenchemengine.internal.pretty_params(base: Union[BaseClass, dict]) dict#

Returns a dictionary of the parameters of the passed BaseClass object, formatted such that they are in a human readable format, with the names of the arguments included.

olorenchemengine.internal.pretty_params_str(base: Union[BaseClass, dict]) str#

Returns a string of the parameters of the passed BaseClass object, formatted such that they are in a human readable format

olorenchemengine.internal.recursive_get_attr(parent, attr)#
olorenchemengine.internal.save(model: BaseClass, fname: str)#

saves a BaseClass object to a file

Parameters:
  • model (BaseClass) – the object to be saved

  • fname (str) – the file name to save the model to

olorenchemengine.internal.saves(object: Optional[Union[BaseClass, dict, list, int, float, str]]) dict#

saves is a method which saves BaseClass object, which can be recovered via loads.

Parameters:

object (Union[BaseClass, dict, list, int, float, str, None]) – the object to be saved

Returns:

a dictionary which can be passed to loads to recreate the object

Return type:

dict

olorenchemengine.internal.set_runner(runner)#

olorenchemengine.interpret module#

class olorenchemengine.interpret.CounterfactualEngine(model: BaseModel, perturbation_engine: PerturbationEngine = 'default')#

Bases: BaseClass

Generates counterfactual compounds based on:

exmol GitHub repository Model agnostic generation of counterfactual explanations for molecules

generate_cfs(delta: Union[int, float, Tuple] = (-1, 1), n: int = 4) None#

Generates counterfactuals and stores them in self.cfs as a list of dictionaries.

Parameters:
  • delta – margin defining counterfactuals for regression models

  • n – number of counterfactuals

generate_samples(smiles: str) None#

Generates candidate counterfactuals and stores them in self.samples as a list of dictionaries.

Parameters:

smiles – SMILES string of the target prediction

get_cfs() _MockObject.DataFrame#

Returns counterfactuals as a pandas dataframe.

Returns:

pandas dataframe of counterfactuals

get_samples() _MockObject.DataFrame#

Returns candidate counterfactuals as a pandas dataframe.

Returns:

pandas dataframe of candidate counterfactuals

class olorenchemengine.interpret.PerturbationEngine(log=True)#

Bases: BaseClass

PerturbationEngine is the base class for techniques which mutate or perturb a compound into a similar one with a small difference.

get_compound_at_idx()#

returns a compound with a modification at a given atom index

get_compound()#

returns a compound with a randomly chosen modification

get_compound_list()#

returns a list of compounds with modifications, the list is meant to be comprehensive of the result of the application of an entire class of modifications.

abstract get_compound(smiles, n=1, **kwargs) str#
abstract get_compound_at_idx(mol: _MockObject.Chem.Mol, idx: int) str#
abstract get_compound_list(smiles, **kwargs) list#
class olorenchemengine.interpret.STONEDMutations(mutations: int = 1, log=True)#

Bases: PerturbationEngine

Implements STONED-SELFIES algorithm for generating modified compounds.

STONED-SELFIES GitHub repository Beyond Generative Models: Superfast Traversal, Optimization, Novelty, Exploration and Discovery (STONED) Algorithm for Molecules using SELFIES

get_compound_at_idx()#

returns a compound with a randomly chosen modification at idx

get_compound()#

returns a compound with a randomly chosen modification

get_compound_list()#

returns a list of num_samples compounds with randomly chosen modifications

get_compound(smiles: str, **kwargs) str#
get_compound_at_idx(mol: _MockObject.Chem.Mol, idx: int, **kwargs) str#
get_compound_list(smiles: str, num_samples: int = 1000, **kwargs) list#
class olorenchemengine.interpret.SwapMutations(radius=0, log=True)#

Bases: PerturbationEngine

SwapMutations replaces substructures with radius r with another substructure with radius < r. The substructure is chosen such that it has the same outgoing bonds, and this set of substructures is identified through a comprehensive ennumeration of a large set of lead-like compounds.

get_compound()#

returns a compound with a randomly chosen modification

get_compound_list()#

returns a list of compounds with modifications, the list is meant to be comprehensive of the result of the application of an entire class of modifications

get_compound(smiles, **kwargs)#
get_compound_at_idx(mol, idx, **kwargs)#
get_compound_list(smiles, idx: Optional[int] = None, **kwargs) list#
get_entry(m, idx, r=1)#
get_substitution(m, idx, r=1)#
stitch(m)#
olorenchemengine.interpret.model_molecule_sensitivity(model: BaseModel, smiles: str, perturbation_engine: PerturbationEngine = 'default', n: int = 30) _MockObject.Chem.Mol#

Calculates the sensitivity of a model to perturbations in on each of a molecule’s atoms, outputting a rdkit molecule, with sensitivity as an atom property.

Parameters:
  • model – model to be used for sensitivity calculation

  • smiles – SMILES string of the target prediction

  • perturbation_engine – perturbation engine to be used for sensitivity calculation

  • n – number of perturbations to be used for sensitivity calculation

Returns:

rdkit molecule with sensitivity as an atom property

olorenchemengine.manager module#

class olorenchemengine.manager.BaseModelManager(dataset: BaseDataset, metrics: List[str], file_path: str = None, primary_metric: str = None, verbose=True, log=True)#

Bases: BaseClass

BaseModelManager is the base class for all model managers.

Parameters:
  • dataset (Dataset) – The dataset to use for training and testing.

  • metrics (List[str]) – A list of metrics to use.

  • verbose (bool) – Whether or not to print progress.

  • file_path (str) – The path to the model_database.

  • log (bool) – Whether or not to log to the model_database.

property direction#
get_dataset()#
get_model_database()#
primary_metric()#
run(models: Union[BaseModel, List[BaseModel]], return_models: bool = False)#

Runs the model on the dataset and saves the results to the model_database.

Parameters:
  • models (Union[BaseModel,List[BaseModel]]) – The model(s) to run.

  • return_models (bool) – Whether or not to return the trained models.

class olorenchemengine.manager.FirebaseModelManager(dataset: BaseDataset, metrics: List[str], uid: str, primary_metric: str = None, file_path: str = None, log=True)#

Bases: BaseModelManager

FirebaseModelManager is a ModelManager that saves model parameters and performances to a Firebase database.

A Firebase service account key in oce.CONFIG is required for database access.

Model information is saved to a collection called ‘models’ in the database. For each document, the following is saved:

  • uid: the user id of the user associated with the model

  • did: the dataset_id of on which the model was trained

  • model_parameters: parameters of the BaseModel oce object

  • model_name

  • model_status

  • fit_time

  • metrics: model training metrics

Dataset information is saved to a collection called ‘datasets’ in the database. For each document, the following is saved:

  • dataset: map representation of the BaseDataset oce object

  • hashed_dataset: md5 hash of the dataset data

  • uid: the user id of the user associated with the dataset

Parameters:
  • dataset (BaseDataset) – The dataset to use for model development.

  • metrics (list[Str]) – The metrics to track e.g. ROC AUC, Root Mean Squared Error.

  • file_path (str) – The path to save the model manager to.

  • uid (str) – The user id associated with the model manager

run(models: Union[BaseModel, List[BaseModel]])#

Run the model manager on the given model(s).

Parameters:

models (BaseModel or list[BaseModel]) – The model(s) to run.

class olorenchemengine.manager.ModelManager(dataset: BaseDataset, metrics: List[str], file_path: str = None, primary_metric: str = None, verbose=True, log=True)#

Bases: BaseModelManager

ModelManager is the class that tracks model development against a specified dataset. It is responsible for saving parameter settings and metrics.

Parameters:
  • dataset (BaseDataset) – The dataset to use for model development.

  • metrics (list[Str]) – The metrics to track e.g. ROC AUC, Root Mean Squared Error.

  • autosave (str) – The path to save the model manager to. Optional.

class olorenchemengine.manager.SheetsModelManager(dataset: BaseDataset, metrics: List[str], file_path: str = None, primary_metric: str = None, name: str = 'SheetsModelManager', email: str = '', log=True)#

Bases: BaseModelManager

SheetsModelManager is the class that tracks model development against a specified dataset on Google Sheets. It is responsible for saving parameter settings and metrics.

Parameters:
  • dataset (BaseDataset) – The dataset to use for model development.

  • metrics (list[Str]) – The metrics to track e.g. ROC AUC, Root Mean Squared Error.

  • name (str) – The name of the Google Sheets to save this to. Optional.

  • email (str) – The email to share the results to. Optional, Default is share to anyone with the link.

olorenchemengine.manager.TOP_MODELS_ADMET() List[BaseModel]#

Returns a list of the top models from the ADMET dataset.

Returns:

A list of the top models from the ADMET dataset.

Return type:

List[BaseModel]

olorenchemengine.reduction module#

class olorenchemengine.reduction.FactorAnalysis(*args, **kwargs)#

Bases: BaseSKLearnReduction

Wrapper for sklearn FactorAnalysis

class olorenchemengine.reduction.PCA(*args, **kwargs)#

Bases: BaseSKLearnReduction

Wrapper for sklearn PCA

olorenchemengine.representations module#

A library of various molecular representations.

class olorenchemengine.representations.AtomFeaturizer(log=True)#

Bases: BaseClass

Abstract class for atom featurizers, which create a vector representation for a single atom.

length(self) int#

returns the length of the atom vector representation, to be implemented by subclasses

convert(self, atom

Chem.Atom) -> np.ndarray: converts a single Chem.Atom string to a vector representation, to be implemented by subclasses

abstract convert(atom: _MockObject.Chem.Atom) _MockObject.ndarray#
abstract property length: int#
class olorenchemengine.representations.BaseCompoundVecRepresentation(normalize=False, **kwargs)#

Bases: BaseVecRepresentation

Computes a vector representation from each structure.

Parameters:
  • normalize (bool) – whether to normalize the vector representation or not

  • names (List[str]) – list of the names of the features in the vector representation, optional.

convert(Xs: Union[list, _MockObject.Series, _MockObject.DataFrame, dict, str], ys: Optional[Union[list, _MockObject.Series, _MockObject.ndarray]] = None, lambda_convert: Optional[Callable] = None, fit=False, **kwargs) _MockObject.ndarray#

Computes a vector representation from each structure in Xs.

inverse(Xs)#

Inverts the vector representation to the original feature values

Parameters:

Xs (np.ndarray) – vector representation of the structures

Returns:

list of the original feature values

Return type:

list

class olorenchemengine.representations.BondFeaturizer(log=True)#

Bases: BaseClass

Abstract class for bond featurizers, which create a vector representation for a single bond.

length(self) int#

returns the length of the bond vector representation, to be implemented by subclasses

convert(self, bond

Chem.Bond) -> np.ndarray: converts a single Chem.Bond string to a vector representation, to be implemented by subclasses

abstract convert(bond: _MockObject.Chem.Bond) _MockObject.ndarray#
abstract property length: int#
class olorenchemengine.representations.ConcatenatedAtomFeaturizers(atom_featurizers: List[AtomFeaturizer])#

Bases: AtomFeaturizer

Concatenates multiple atom featurizers into a single vector representation.

length(self) int#

returns the length of the atom vector representation, to be implemented by subclasses

convert(self, atom

Chem.Atom) -> np.ndarray: converts a single Chem.Atom string to a vector representation, to be implemented by subclasses

convert(atom: _MockObject.Chem.Atom) _MockObject.ndarray#
property length: int#
class olorenchemengine.representations.ConcatenatedBondFeaturizers(bond_featurizers: List[BondFeaturizer])#

Bases: BondFeaturizer

Concatenates multiple bond featurizers into a single vector representation.

length(self) int#

returns the length of the bond vector representation, to be implemented by subclasses

convert(self, bond

Chem.Bond) -> np.ndarray: converts a single Chem.Bond string to a vector representation, to be implemented by subclasses

convert(bond: _MockObject.Chem.Bond) _MockObject.ndarray#
property length: int#
class olorenchemengine.representations.ConcatenatedStructVecRepresentation(rep1: BaseCompoundVecRepresentation, rep2: BaseCompoundVecRepresentation, log=True, **kwargs)#

Bases: BaseCompoundVecRepresentation

Creates a structure vector representation by concatenating multiple representations.

DEPRECEATED, use ConcatenatedVecRepresentation instead.

Parameters:
class olorenchemengine.representations.DatasetFeatures(*args, collinear_thresh=1.01, scale=<olorenchemengine.internal.StandardScaler object>, names=None, log=True, **kwargs)#

Bases: BaseVecRepresentation

Selects features from the input dataset as the vector representation

convert(X, **kwargs)#

BaseVecRepresentation’s convert returns a list of numpy arrays.

Parameters:
  • Xs (Union[list, pd.DataFrame, dict, str]) – input data

  • ys (Union[list, pd.Series, np.ndarray], optional) – included for compatibility, unused argument. Defaults to None.

Returns:

list of molecular vector representations

Return type:

List[np.ndarray]

class olorenchemengine.representations.DescriptastorusDescriptor(name, *args, log=True, scale=None, **kwargs)#

Bases: BaseCompoundVecRepresentation

Wrapper for DescriptaStorus descriptors (https://github.com/bp-kelley/descriptastorus)

Parameters:
  • name (str) – name of the descriptor. Either “atompaircounts”, “morgan3counts”, “morganchiral3counts”,”morganfeature3counts”,”rdkit2d”,”rdkit2dnormalized”, “rdkitfpbits”

  • log (bool) – whether to log the representations or not

classmethod AllInstances()#

AllTypes returns a list of all standard instances of all subclasses of BaseClass.

Standard instances means that all required parameters for instantiation of the subclasses are set with canonical values.

available_descriptors = ['atompaircounts', 'morgan3counts', 'morganchiral3counts', 'morganfeature3counts', 'rdkit2d', 'rdkit2dnormalized', 'rdkitfpbits']#
class olorenchemengine.representations.FragmentIndicator(log=True, **kwargs)#

Bases: BaseCompoundVecRepresentation

Indicator variables for all fragments in rdkit.Chem.Fragments

http://rdkit.org/docs/source/rdkit.Chem.Fragments.html

class olorenchemengine.representations.GobbiPharma2D(normalize=False, **kwargs)#

Bases: BaseCompoundVecRepresentation

2D Gobbi pharmacophore descriptor (implemented in RDKit, from https://doi.org/10.1002/(SICI)1097-0290(199824)61:1<47::AID-BIT9>3.0.CO;2-Z)

class olorenchemengine.representations.GobbiPharma3D(normalize=False, **kwargs)#

Bases: BaseCompoundVecRepresentation

3D Gobbi pharmacophore descriptor (implemented in RDKit, from https://doi.org/10.1002/(SICI)1097-0290(199824)61:1<47::AID-BIT9>3.0.CO;2-Z)

class olorenchemengine.representations.LipinskiDescriptor(log=True, **kwargs)#

Bases: BaseCompoundVecRepresentation

Wrapper for Lipinski descriptors (https://www.rdkit.org/docs/RDKit_Book.html#Lipinski_Descriptors)

Parameters:

log (bool) – whether to log the representations or not

class olorenchemengine.representations.MACCSKeys#

Bases: BaseCompoundVecRepresentation

Calculate MACCS (Molecular ACCess System) Keys fingerprint.

Durant, Joseph L., et al. “Reoptimization of MDL keys for use in drug discovery.” Journal of chemical information and computer sciences 42.6 (2002): 1273-1280.

class olorenchemengine.representations.MCSClusterRep(dataset: BaseDataset, *args, eval_set='train', timeout: int = 5, threshold: float = 0.9, cached=False, log=True, **kwargs)#

Bases: BaseCompoundVecRepresentation

Clusters a train set of compounds and then finds the maximum common substructure (MCS) within each set. The presence of each cluster’s MCS is used as a feature

class olorenchemengine.representations.ModelAsRep(model: Union[BaseModel, str], name='ModelAsRep', download_public_file=False, log=True, **kwargs)#

Bases: BaseCompoundVecRepresentation

Uses a trained model itself as a representation.

If we are trying to predict property A, and there is a highly related property B that has a lot of data we could train a model on property B and use that model with ModelAsRep as a representation for property A.

Parameters:
  • model (BaseModel, str) – A trained model to be used as the representation, either a BaseModel object or a path to a saved model

  • download_public_file (bool, optional) – If True, will download the specified model from OCE’s public warehouse of models. Defaults to False.

  • name (str) –

    Name of the property the passed model predicts, which is usefully for clear save files/interpretability visualizations.

    Optional.

class olorenchemengine.representations.MordredDescriptor(descriptor_set: Union[str, list] = '2d', log: bool = True, normalize: bool = False, **kwargs)#

Bases: BaseCompoundVecRepresentation

Wrapper for Mordred descriptors (https://github.com/mordred-descriptor/mordred)

Parameters:
  • log (bool) – whether to log the representations or not

  • descriptor_set (str) – name of the descriptor set to use

  • normalize (bool) – whether to normalize the descriptors or not

convert(Xs, ys=None, **kwargs)#

Computes a vector representation from each structure in Xs.

convert_full(Xs, ys=None, **kwargs)#

Convert list of SMILES to descriptors in the form of a numpy array.

Parameters:
  • Xs (list) – List of SMILES strings.

  • ys (list) – List of labels.

  • normalize (bool) – Whether to normalize the descriptors.

Returns:

Array of descriptors. Shape: (len(Xs), len(self.names))

Return type:

np.ndarray

class olorenchemengine.representations.MorganVecRepresentation(radius=2, nbits=1024, scale=None, log=True, **kwargs)#

Bases: BaseCompoundVecRepresentation

info(smiles)#
class olorenchemengine.representations.NoisyVec(rep: BaseVecRepresentation, *args, a_std=0.1, m_std=0.1, **kwargs)#

Bases: BaseVecRepresentation

Adds noise to a given BaseVecRepresentation

Parameters:
  • rep (BaseVecRepresentation) – BaseVecRepresentation to add noise to

  • a_std (float) – standard deviation of the additive noise. Defaults to 0.1.

  • m_std (float) – standard deviation of the multiplicative noise. Defaults to 0.1.

  • names (List[str]) – list of the names of the features in the vector representation, optional.

Example

import olorenchemengine as oce model = oce.RandomForestModel(representation = oce.’’’BaseCompoundVecRepresentation(Params)’’’, n_estimators=1000)

model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————

class olorenchemengine.representations.OGBAtomFeaturizer#

Bases: AtomFeaturizer

Creates a vector representation for a single atom using the Open Graph Benchmark’s atom_to_feature_vector function.

convert(atom: _MockObject.Chem.Atom)#
property length#
class olorenchemengine.representations.OGBBondFeaturizer#

Bases: BondFeaturizer

Creates a vector representation for a single bond using the Open Graph Benchmark’s bond_to_feature_vector function.

convert(bond: _MockObject.Chem.Bond)#
property length#
class olorenchemengine.representations.OlorenCheckpoint(model_path: str, num_tasks: int = 2048, log: bool = True, **kwargs)#

Bases: BaseCompoundVecRepresentation

Use OlorenVec from checkpoint as a molecular representation

Parameters:
  • model_path (str) – path to checkpoint file for OlorenVec. Use “default” if unsure

  • num_tasks (int) – number of coordinates in the vector representation

  • log (bool, optional) – Log arguments or not. Should only be true if it is not nested. Defaults to True.

classmethod AllInstances()#

AllTypes returns a list of all standard instances of all subclasses of BaseClass.

Standard instances means that all required parameters for instantiation of the subclasses are set with canonical values.

molecule2graph(mol, include_mol=False)#

Convert a molecule to a PyG graph with features and labels

Parameters:
  • mol (rdkit.Chem.rdchem.Mol) – molecule to convert

  • include_mol (bool, optional) – Whether or not include the molecule in the graph. Defaults to False.

Returns:

PyG graph

Return type:

graph

smiles2pyg(smiles_str, y, morgan_params={'nBits': 1024, 'radius': 2})#

Convert a SMILES string to a PyG graph with features and labels

Parameters:
  • smiles_str (str) – SMILES string to convert

  • y (int) – label of the molecule

  • morgan_params (dict, optional) – parameters for morgan fingerprint. Defaults to {“radius”: 2, “nBits”: 1024}.

Returns:

PyG graph

Return type:

graph

class olorenchemengine.representations.PeptideDescriptors1(log=True, **kwargs)#

Bases: BaseCompoundVecRepresentation

class olorenchemengine.representations.PubChemFingerprint#

Bases: BaseCompoundVecRepresentation

PubChem Fingerprint

Implemented as a fingerprint, which runs locally vs by calling the PubChem Fingerprint (PCFP) webservice, using RDKIT to calculate the fingerprint.

Specs described in ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt. Search patterns from https://bitbucket.org/caodac/pcfp/src/master/src/tripod/fingerprint/PCFP.java. —–

class olorenchemengine.representations.PubChemFingerprint_local#

Bases: BaseCompoundVecRepresentation

PubChem Fingerprint

Implemented as a fingerprint, which runs locally vs by calling the PubChem Fingerprint (PCFP) webservice, using RDKIT to calculate the fingerprint.

On a validation set of 400 compounds from the FDA Orange Book, the PubCheFP_local matches the PubChem server-based version on 331/400 compounds, and is within 1 bit on 360/400 compounds. There are however 28/400 compounds where it is between 50 and 100 bits off.

Specs described in ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt. Search patterns from https://bitbucket.org/caodac/pcfp/src/master/src/tripod/fingerprint/PCFP.java. —–

class olorenchemengine.representations.TorchGeometricGraph(atom_featurizer: ~olorenchemengine.representations.AtomFeaturizer = <olorenchemengine.representations.OGBAtomFeaturizer object>, bond_featurizer: ~olorenchemengine.representations.BondFeaturizer = <olorenchemengine.representations.OGBBondFeaturizer object>, **kwargs)#

Bases: BaseRepresentation

Representation which returns torch_geometric.data.Data objects.

Parameters:
dimensions#

number of dimensions for the atom and bond representations

Type:

Tuple[int, int]

_convert(self, smiles

str, y: Any=None) -> Data: converts a single SMILES string to a torch_geometric.data.Data object

convert(Xs: Union[list, _MockObject.DataFrame, dict, str], ys: Optional[Union[list, _MockObject.Series, _MockObject.ndarray]] = None, **kwargs) List[Any]#

Converts input data to a list of representations :param Xs: input data :type Xs: Union[list, pd.DataFrame, dict, str] :param ys: target values of the input data :type ys: Union[list, pd.Series, np.ndarray]=None

Returns:

list of representations of the input data

Return type:

List[Any]

property dimensions#
olorenchemengine.representations.countAnyRing(mol, rings, size)#
olorenchemengine.representations.countAromaticRing(mol, rings)#
olorenchemengine.representations.countHeteroAromaticRing(mol, rings)#
olorenchemengine.representations.countHeteroInRing(mol, rings, size)#
olorenchemengine.representations.countNitrogenInRing(mol, rings, size)#
olorenchemengine.representations.countSaturatedOrAromaticCarbonOnlyRing(mol, rings, size)#
olorenchemengine.representations.countSaturatedOrAromaticHeteroContainingRing(mol, rings, size)#
olorenchemengine.representations.countSaturatedOrAromaticNitrogenContainingRing(mol, rings, size)#
olorenchemengine.representations.countUnsaturatedCarbonOnlyRing(mol, rings, size)#
olorenchemengine.representations.countUnsaturatedHeteroContainingRing(mol, rings, size)#
olorenchemengine.representations.countUnsaturatedNitrogenContainingRing(mol, rings, size)#
olorenchemengine.representations.get_valid_combinations(sets)#
olorenchemengine.representations.isAromaticRing(mol, atoms)#
olorenchemengine.representations.isCarbonOnlyRing(mol, atoms)#
olorenchemengine.representations.isRingSaturated(mol, atoms)#
olorenchemengine.representations.isRingUnsaturated(mol, atoms, all_rings)#

olorenchemengine.splitters module#

For creating splits on the data

class olorenchemengine.splitters.BaseSplitter(split_proportions=[0.8, 0.1, 0.1], log=True)#

Bases: BaseDatasetTransform

Base class for all splitters.

Parameters:
  • split_proportions (float) – Proportion of the data to be used for training.

  • log (bool) – Whether to log the data or not.

abstract split(data, *args, **kwargs)#

Split data into train/val/test sets.

Parameters:

data (pandas.DataFrame) – Dataset to split, must have a structure column.

Returns:

Tuple of training, validation, and testing dataframes.

Return type:

(tuple)

transform(dataset: BaseDataset, *args, **kwargs) BaseDataset#

Applies a transformation onto the inputted BaseDataset.

Parameters: dataset (BaseDataset): The dataset to transform.

class olorenchemengine.splitters.DateSplitter(log=True, **kwargs)#

Bases: BaseSplitter

Split data into train/val/test sets by date range.

Parameters:
  • date_col (str) – Name of the column to split by.

  • log (bool) – Whether to log the data or not.

split(data, date_col)#

Return array of train/val/test dataframes in format [train, val, test].

Example

import olorenchemengine as oce

df = pd.read_csv(“Your Dataset”) dataset = (

oce.BaseDataset(data = df.to_csv(), structure_col = “SMILES COLUMN”, property_col = “PROPERTY COLUMN”) + oce.DateSplitter(split_proportions = [0.8, 0.1, 0.1], date_col=”DATE COLUMN”)

) #OR train, val, test = oce.DateSplitter(split_proportions = [0.8, 0.1, 0.1], date_col=”DATE COLUMN”).split(df) ——————————

split(data, date_col, *args, **kwargs)#

Split data into train/val/test sets.

Parameters:

data (pandas.DataFrame) – Dataset to split, must have a structure column.

Returns:

Tuple of training, validation, and testing dataframes.

Return type:

(tuple)

class olorenchemengine.splitters.PropertySplit(property_col, threshold=None, noise=0.1, categorical=False, log=True, **kwargs)#

Bases: BaseSplitter

Split molecules into train/val/test based on user-defined property.

Parameters:
  • property_col (string) – column in dataset with property values to split data on

  • threshold (int) (optional) – user-defined value to split data. If set to None (default), threshold will be determined based on split_proportions. User defines a single threshold for train/test split.

  • noise (int) – random noise to add to dataset before splitting. Note: data is minmax scaled to [0, 1] range before noise is introduced.

  • categorical (bool) – Set True to convert property values to categorical format ([0, 1, 2]) based on threshold.

  • Methods – split(data): Return array of train/val/test dataframes in format [train, val, test].

  • Example

  • ------------------------------

  • oce (import olorenchemengine as) –

  • pd.read_csv (df =) –

  • ( (dataset =) – oce.BaseDataset(data = df.to_csv(), structure_col = “SMILES COLUMN”, property_col = “PROPERTY COLUMN”) + oce.PropertySplit(split_proportions = [0.8, 0.1, 0.1], property_col = “PROPERTY COLUMN”, threshold = 0.5, noise = 0.1, categorical = False)

  • )

  • #OR

  • train (split_proportions = [0.8, 0.1, 0.1], property_col = "PROPERTY COLUMN", threshold = 0.5, noise = 0.1, categorical = False).split(df) –

  • val (split_proportions = [0.8, 0.1, 0.1], property_col = "PROPERTY COLUMN", threshold = 0.5, noise = 0.1, categorical = False).split(df) –

  • oce.PropertySplit (test =) –


split(data: _MockObject.DataFrame, *args, **kwargs)#

Split data into train/val/test sets.

Parameters:

data (pandas.DataFrame) – Dataset to split, must have a structure column.

Returns:

Tuple of training, validation, and testing dataframes.

Return type:

(tuple)

class olorenchemengine.splitters.RandomSplit(log=True, **kwargs)#

Bases: BaseSplitter

Split data randomly into train/val/test sets.

Parameters:
  • data (pandas.DataFrame) – Dataset to split.

  • split_proportions (tuple[int]) – Tuple of train/val/test proportions of data to split into.

  • log (bool) – Whether to log the data or not.

split(data)#

Return array of train/val/test dataframes in format [train, val, test].

Example

import olorenchemengine as oce

df = pd.read_csv(“Your Dataset”) dataset = (

oce.BaseDataset(data = df.to_csv(), structure_col = “SMILES COLUMN”, property_col = “PROPERTY COLUMN”) + oce.RandomSplit(split_proportions = [0.8, 0.1, 0.1])

) #OR train, val, test = oce.RandomSplit(split_proportions = [0.8, 0.1, 0.1]).split(df) ——————————

split(data, *args, **kwargs)#

Split data into train/val/test sets.

Parameters:

data (pandas.DataFrame) – Dataset to split, must have a structure column.

Returns:

Tuple of training, validation, and testing dataframes.

Return type:

(tuple)

class olorenchemengine.splitters.ScaffoldSplit(scaffold_filter_threshold: int = 0, split_type='murcko', log=True, **kwargs)#

Bases: BaseSplitter

Split data into train/val/test sets by scaffold. Makes sure that the same Bemis-Murcko scaffold is not used in both train and test.

Parameters:
  • scaffold_filter_threshold (float) – Threshold for minimum number of compounds per scaffold class for a scaffold class to be included.

  • split_proportions (tuple[int]) – Tuple of train/val/test proportions of data to split into.

  • split_type (string) – type of split murcko: split data by bemis-murcko scaffold kmeans_murcko: split data by kmeans clustering murcko scaffolds

split(data, structure_col)#

Return array of train/val/test dataframes in format [train, val, test].

Example

import olorenchemengine as oce

df = pd.read_csv(“Your Dataset”) dataset = (

oce.BaseDataset(data = df.to_csv(), structure_col = “SMILES COLUMN”, property_col = “PROPERTY COLUMN”) + oce.ScaffoldSplit(split_proportions = [0.8, 0.1, 0.1], scaffold_filter_threshold = 5, split_type = “murcko”)

) #OR train, val, test = oce.ScaffoldSplitter(split_proportions = [0.8, 0.1, 0.1], scaffold_filter_threshold = 5, split_type = “murcko”).split(df, structure_col = “SMILES COLUMN”) ——————————

split(data: _MockObject.DataFrame, *args, structure_col: str = 'Smiles', **kwargs)#

Split data into train/val/test sets.

Parameters:

data (pandas.DataFrame) – Dataset to split, must have a structure column.

Returns:

Tuple of training, validation, and testing dataframes.

Return type:

(tuple)

class olorenchemengine.splitters.StratifiedSplitter(value_col, log=True, **kwargs)#

Bases: BaseSplitter

Split data into train/val/test sets stratified by a value column (generally the label).

Parameters:
  • value_col (str) – Name of the column to stratify by.

  • log (bool) – Whether to log the data or not.

split(data)#

Return array of train/val/test dataframes in format [train, val, test].

Example

import olorenchemengine as oce

df = pd.read_csv(“Your Dataset”) dataset = (

oce.BaseDataset(data = df.to_csv(), structure_col = “SMILES COLUMN”, property_col = “PROPERTY COLUMN”) + oce.StratifiedSplitter(split_proportions = [0.8, 0.1, 0.1], value_col=”PROPERTY COLUMN”)

) #OR train, val, test = oce.StratifiedSplitter(split_proportions = [0.8, 0.1, 0.1], value_col=”PROPERTY COLUMN”).split(df) ——————————

split(data, *args, **kwargs)#

Split data into train/val/test sets.

Parameters:

data (pandas.DataFrame) – Dataset to split, must have a structure column.

Returns:

Tuple of training, validation, and testing dataframes.

Return type:

(tuple)

class olorenchemengine.splitters.dc_ScaffoldSplit(log=True, **kwargs)#

Bases: BaseSplitter

Split data into train/val/test sets by scaffold using DeepChem implementation. https://deepchem.readthedocs.io/en/latest/api_reference/splitters.html#scaffoldsplitter

Parameters:

split_proportions (tuple[int]) – Tuple of train/val/test proportions of data to split into.

split(data, structure_col)#

Return array of train/val/test dataframes in format [train, val, test].

Example

import olorenchemengine as oce

df = pd.read_csv(“Your Dataset”) dataset = (

oce.BaseDataset(data = df.to_csv(), structure_col = “SMILES COLUMN”, property_col = “PROPERTY COLUMN”) + oce.dc_ScaffoldSplit(split_proportions = [0.8, 0.1, 0.1])

) #OR train, val, test = oce.dc_ScaffoldSplitter(split_proportions = [0.8, 0.1, 0.1]).split(df, structure_col = “SMILES COLUMN”) ——————————

split(data: _MockObject.DataFrame, *args, structure_col: str = 'Smiles', **kwargs)#

Split data into train/val/test sets.

Parameters:

data (pandas.DataFrame) – Dataset to split, must have a structure column.

Returns:

Tuple of training, validation, and testing dataframes.

Return type:

(tuple)

class olorenchemengine.splitters.gg_ScaffoldSplit(log=True, **kwargs)#

Bases: BaseSplitter

Split data into train/val/test sets by scaffold using implementation from https://www.nature.com/articles/s42256-021-00438-4, https://github.com/PaddlePaddle/PaddleHelix

Parameters:

split_proportions (tuple[int]) – Tuple of train/val/test proportions of data to split into.

split(data, structure_col)#

Return array of train/val/test dataframes in format [train, val, test].

Example

import olorenchemengine as oce

df = pd.read_csv(“Your Dataset”) dataset = (

oce.BaseDataset(data = df.to_csv(), structure_col = “SMILES COLUMN”, property_col = “PROPERTY COLUMN”) + oce.gg_ScaffoldSplit(split_proportions = [0.8, 0.1, 0.1])

) #OR train, val, test = oce.gg_ScaffoldSplitter(split_proportions = [0.8, 0.1, 0.1]).split(df, structure_col = “SMILES COLUMN”) ——————————

generate_scaffold(smiles, include_chirality=False)#

Obtain Bemis-Murcko scaffold from smiles.

Parameters:
  • smiles – smiles sequence

  • include_chirality – Default=False

Returns:

the scaffold of the given smiles.

gg_split(dataset, frac_train=None, frac_valid=None, frac_test=None, structure_col='smiles')#
Parameters:
  • dataset (InMemoryDataset) – the dataset to split. Make sure each element in the dataset has key “smiles” which will be used to calculate the scaffold.

  • frac_train (float) – the fraction of data to be used for the train split.

  • frac_valid (float) – the fraction of data to be used for the valid split.

  • frac_test (float) – the fraction of data to be used for the test split.

split(data: _MockObject.DataFrame, *args, structure_col: str = 'smiles', **kwargs)#

Split data into train/val/test sets.

Parameters:

data (pandas.DataFrame) – Dataset to split, must have a structure column.

Returns:

Tuple of training, validation, and testing dataframes.

Return type:

(tuple)

olorenchemengine.uncertainty module#

Techniques for quantifying uncertainty and estimating confidence intervals for all oce models.

class olorenchemengine.uncertainty.ADAN(criterion: str = 'Category', rep: BaseCompoundVecRepresentation = None, dim_reduction: str = 'pls', explvar: float = 0.8, threshold: float = 0.95, log=True, **kwargs)#

Bases: BaseErrorModel

Applicability Domain Analysis

ADAN is an error model that predicts error bars based on one or multiple ADAN categories: Applicability Domain Analysis (ADAN): A Robust Method for Assessing the Reliability of Drug Property Predictions

Parameters:
  • (str) (criterion) –

  • (BaseCompoundVecRepresentation) (rep) – usees the representation of the BaseModel object.

  • ({"pls" (dim_reduction) –

  • "pca"}) (the dimensionality reduction to use.) –

  • (float) (threshold) – reduction components as a proportion of total variance.

  • (float) – its standard range.

Example

import olorenchemengine as oce

model = oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000) model.fit_cv(train[“Drug”], train[“Y”], error_model = oce.ADAN(“E_raw”)) model.predict(test[“Drug”], return_ci = True)

DModX(X: _MockObject.ndarray, Xp: _MockObject.ndarray) _MockObject.ndarray#

Computes the distance to the model (DmodX).

Computes the distance between a datapoint and the PLS model plane. See <https://www.jmp.com/support/help/en/15.2/index.shtml#page/jmp/dmodx-calculation.shtml> for more details about the statistic.

Parameters:
  • X (np.ndarray) – queries

  • Xp (np.ndarray) – queries transformed into latent space

SDEP(Xp: _MockObject.ndarray, n_drop: int = 0, neighbor_thresh: float = 0.05) _MockObject.ndarray#

Computes the standard deviation error of predictions (SDEP).

Computes the standard deviation training error of the neighbor_thresh fraction of closest training queries to each query in Xp in latent space.

Parameters:
  • Xp (np.ndarray) – queries transformed into latent space

  • n_drop (int) – 1 if fitting, 0 if scoring

  • neighbor_thresh (float) – fraction of closest training queries to consider

calculate(X, y_pred, standardize: bool = True)#

Calcualtes confidence scores.

calculate_full(X, standardize: bool = True)#

Calculates complete confidence scores for visualization.

preprocess(X, y=None)#

Preprocesses data into the appropriate representation.

class olorenchemengine.uncertainty.AggregateErrorModel(*error_models: ~olorenchemengine.base_class.BaseErrorModel, reduction: ~olorenchemengine.base_class.BaseReduction = <olorenchemengine.reduction.FactorAnalysis object>, log=True, **kwargs)#

Bases: BaseErrorModel

AggregateErrorModel estimates uncertainty by aggregating ucertainty scores from

several different BaseErrorModels.

Parameters:
  • (BaseErrorModel) (error_models) –

  • (BaseReduction) (reduction) – Must output 1 component. Default FactorAnalysis().

Example

import olorenchemengine as oce

model = oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000) model.fit(train[“Drug”], train[“Y”]) error_model = oce.AggregateErrorModel(error_models = [oce.TargetDistDC(), oce.TrainDistDC()]) error_model.build(model, train[“Drug”], train[“Y”]) error_model.fit(valid[“Drug”], valid[“Y”]) error_model.score(test[“Drug”])

calculate(X: Union[_MockObject.DataFrame, _MockObject.ndarray, list, _MockObject.Series], y_pred: _MockObject.ndarray) _MockObject.ndarray#

Computes aggregate error model score from inputs.

Parameters:
  • X – features, smiles

  • y_pred – predicted values

fit(X: Union[_MockObject.DataFrame, _MockObject.ndarray, list, _MockObject.Series], y: Union[_MockObject.ndarray, list, _MockObject.Series], **kwargs)#

Fits confidence scores to an external dataset

Parameters:
  • X (array-like) – features, smiles

  • y (array-like) – true values

Returns:

plotly figure of fitted model against validation dataset

fit_cv(n_splits: int = 10, **kwargs)#

Fits confidence scores to the training dataset via cross validation.

Parameters:

n_splits (int) – Number of cross validation splits, default 5

Returns:

plotly figure of fitted model against validation dataset

class olorenchemengine.uncertainty.BaseEnsembleModel(ensemble_model=None, n_ensembles=16, log=True, **kwargs)#

Bases: BaseErrorModel

BaseEnsembleModel is the base class for error models that estimate uncertainty based on the variance of an ensemble of models.

calculate(X, y_pred)#

To be implemented by the child class; calculates confidence scores from inputs.

Parameters:
  • X – features, list of SMILES

  • y_pred (1-dimensional np.ndarray) – predicted values

Returns:

scores (1-dimensional np.ndarray)

class olorenchemengine.uncertainty.BaseFingerprintModel(radius=2, log=True, **kwargs)#

Bases: BaseErrorModel

Morgan fingerprint-based error models.

BaseFingerprintModel is the base class for error models that require the computation of Morgan fingerprints.

get_fps(smiles: List[str]) List#
class olorenchemengine.uncertainty.BaseKernelError(kernel='power', h=3, log=True, **kwargs)#

Bases: BaseFingerprintModel

Base class for kernel methods of uncertainty quantification.

class olorenchemengine.uncertainty.BootstrapEnsemble(ensemble_model=None, n_ensembles=12, bootstrap_size=0.25, log=True, **kwargs)#

Bases: BaseEnsembleModel

Ensemble of bootstrap models variance

BootstrapEnsemble estimates uncertainty based on the variance of several models trained on bootstrapped samples of the training data.

Parameters:
  • (BaseModel) (ensemble_model) –

  • (int) (n_ensembles) –

  • (float) (bootstrap_size) –

Example

import olorenchemengine as oce

model = oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000) model.fit_cv(train[“Drug”], train[“Y”], error_model = oce.BootstrapEnsemble(n_ensembles = 10)) model.predict(test[“Drug”], return_ci = True)

class olorenchemengine.uncertainty.KNNSimilarity(*args, **kwargs)#

Bases: BaseDepreceated

class olorenchemengine.uncertainty.KernelDistanceError(kernel='power', h=3, weighted=True, log=True, **kwargs)#

Bases: BaseKernelError

Kernel distance error model.

KernelDistanceError uses an average of kernel distances to each molecule in the training set as the covariate for estimating confidence intervals. The distance function used is 1 - Tanimoto Similarity.

Parameters:
  • (str (kernel) –

  • {"default"}) (Kernel used as a weight-function) –

  • (int (h) – nearest_neighbor kernel

  • float) (Bandwidth for most kernels, number of nearest neighbors for) – nearest_neighbor kernel

  • (bool) (weighted) – True, returns a kernel-weighted average of Tanimoto similarity. If False, returns an average kernel distance.

Example

# 5-nearest neighbor mean error_model = oce.KernelDistanceError(kernel=”nearest_neighbor”, h=5, weighted=True)

# Sum of Distance-weighted contributions (SDC) error_model = oce.KernelDistanceError(kernel=”sdc”, h=3, weighted=False)

calculate(X, y_pred)#

To be implemented by the child class; calculates confidence scores from inputs.

Parameters:
  • X – features, list of SMILES

  • y_pred (1-dimensional np.ndarray) – predicted values

Returns:

scores (1-dimensional np.ndarray)

class olorenchemengine.uncertainty.KernelRegressionError(kernel='power', h=3, predictor='property', log=True, **kwargs)#

Bases: BaseKernelError

Kernel regression error model.

KernelRegressionError uses a kernel-weighted average of prediction errors as the covariate for estimating confidence intervals. It is inspired by the Nadaraya-Watson estimator, which generates a regression using a kernel-weighted average. The distance function used is 1 - Tanimoto Similarity.

This is the recommended error model for general purposes and models.

Parameters:
  • (str (predictor) –

  • {"default"}) (Kernel used as a weight-function) –

  • (int (h) – nearest_neighbor kernel

  • float) (Bandwidth for most kernels, number of nearest neighbors for) – nearest_neighbor kernel

  • (str

  • {"property" (Error predictor being estimated) –

  • "residual"}) (Error predictor being estimated) –

Example

import olorenchemengine as oce model = oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000) model.fit_cv(train[“Drug”], train[“Y”], error_model = oce.KernelRegressionError()) model.predict(test[“Drug”], return_ci = True)

calculate(X, y_pred)#

To be implemented by the child class; calculates confidence scores from inputs.

Parameters:
  • X – features, list of SMILES

  • y_pred (1-dimensional np.ndarray) – predicted values

Returns:

scores (1-dimensional np.ndarray)

class olorenchemengine.uncertainty.Naive(log=True, **kwargs)#

Bases: BaseErrorModel

Static confidence intervals

Naive is an error model that predicts a uniform confidence interval based on the errors of the fitting dataset. Used exclusively for benchmarking error models.

calculate(X, y_pred)#

To be implemented by the child class; calculates confidence scores from inputs.

Parameters:
  • X – features, list of SMILES

  • y_pred (1-dimensional np.ndarray) – predicted values

Returns:

scores (1-dimensional np.ndarray)

score(X: Union[_MockObject.DataFrame, _MockObject.ndarray, list, _MockObject.Series])#

Calculates confidence scores on a dataset.

Parameters:

X (array-like) – target dataset, list of SMILES

Returns:

a list of confidence intervals as tuples for each input

class olorenchemengine.uncertainty.Predicted(log=True, **kwargs)#

Bases: BaseErrorModel

Predicted value

Predicted is an error model that predicts error bars based on only the predicted value of a molecule.

Example

import olorenchemengine as oce

model = oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000) model.fit_cv(train[“Drug”], train[“Y”], error_model = oce.AggregateErrorModel([oce.SDC(), oce.Predicted()]) model.predict(test[“Drug”], return_ci = True)

calculate(X, y_pred)#

To be implemented by the child class; calculates confidence scores from inputs.

Parameters:
  • X – features, list of SMILES

  • y_pred (1-dimensional np.ndarray) – predicted values

Returns:

scores (1-dimensional np.ndarray)

class olorenchemengine.uncertainty.RandomForestEnsemble(log=True, **kwargs)#

Bases: BaseEnsembleModel

Ensemble of random forests

RandomForestEnsemble estimates uncertainty based on the variance of several random forest models initialized to different random states.

Parameters:
  • (BaseModel) (ensemble_model) –

  • (int) (n_ensembles) –

Example

import olorenchemengine as oce

model = oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000) model.fit_cv(train[“Drug”], train[“Y”], error_model = oce.RandomForestEnsemble(n_ensembles = 10)) model.predict(test[“Drug”], return_ci = True) ——————————

class olorenchemengine.uncertainty.SDC(*args, **kwargs)#

Bases: BaseDepreceated

class olorenchemengine.uncertainty.TargetDistDC(*args, **kwargs)#

Bases: BaseDepreceated

class olorenchemengine.uncertainty.TrainDistDC(*args, **kwargs)#

Bases: BaseDepreceated

Module contents#

olorenchemengine.BACEDataset()#
olorenchemengine.ExampleDataFrame()#
olorenchemengine.ExampleDataset()#
olorenchemengine.MISSING_DEPENDENCIES()#
olorenchemengine.create_config_default_param(param: str, value: Union[str, int, float, bool])#

Create a default configuration parameter.

Parameters:
  • param – the parameter to create.

  • value – the value to set the parameter to.

olorenchemengine.online(session_url='https://aws.chemengine.org')#
olorenchemengine.remove_config_param(param: str)#

Remove a configuration parameter.

Parameters:

param – the parameter to remove.

olorenchemengine.set_config_param(param: str, value: Union[str, int, float, bool])#

Set a configuration parameter.

Parameters:
  • param – the parameter to set.

  • value – the value to set the parameter to.

olorenchemengine.test_oce()#

Convenience function to test all functions of the oce package.

olorenchemengine.update_config()#

Update the configuration file.

This function is called when a new parameter is added to the configuration file.