olorenchemengine package#
Subpackages#
- olorenchemengine.beta package
- olorenchemengine.external package
- Subpackages
- olorenchemengine.external.ChemProp package
- olorenchemengine.external.GINNetwork package
- olorenchemengine.external.GaussianProcess package
- olorenchemengine.external.HondaSTRep package
- olorenchemengine.external.MolCLR package
- olorenchemengine.external.SPGNN package
- olorenchemengine.external.mol2vec package
- Submodules
- olorenchemengine.external.piCalculax module
- olorenchemengine.external.stoned module
- Module contents
- Subpackages
- olorenchemengine.pyg namespace
- olorenchemengine.visualizations package
- Submodules
- olorenchemengine.visualizations.attribute_section module
- olorenchemengine.visualizations.compounds module
- olorenchemengine.visualizations.exploratory_analysis module
- olorenchemengine.visualizations.matched_pairs module
- olorenchemengine.visualizations.model_comparisons module
- olorenchemengine.visualizations.visualization module
BaseErrorWaterfall
BaseVisualization
BaseVisualization.JS_NAME
BaseVisualization.from_attributes()
BaseVisualization.get_attributes()
BaseVisualization.get_data()
BaseVisualization.get_html()
BaseVisualization.get_js()
BaseVisualization.get_link()
BaseVisualization.package_urls
BaseVisualization.render()
BaseVisualization.render_data_url()
BaseVisualization.render_ipynb()
BaseVisualization.render_oas()
BaseVisualization.save_html()
BaseVisualization.upload_oas()
ChemicalSpacePlot
CompoundScatterPlot
ModelPR
ModelROC
ModelROCThreshold
MorganContributions
ScatterPlot
VisualizeADAN
VisualizeCompounds
VisualizeCounterfactual
VisualizeDatasetCV
VisualizeDatasetCompounds
VisualizeDatasetDivision
VisualizeDatasetSplit
VisualizeError
VisualizeModelSim
VisualizeModelSim2
VisualizeMoleculePerturbations
get_oas_compatible_visualizations()
- olorenchemengine.visualizations.visualize_interpret module
- Module contents
Submodules#
olorenchemengine.base_class module#
base_class consists of the building blocks of models: the base classes models should extend from and relevant utility functions.
- class olorenchemengine.base_class.BaseErrorModel(ci: float = 0.95, method: str = 'qbin', curvetype: str = 'auto', window: int = None, bins: int = None, log=True, **kwargs)#
Bases:
BaseClass
Base class for error models.
Estimates confidence intervals for trained oce models.
- Parameters:
ci (float) – desired confidence interval
method ({'bin','qbin','roll'}) – whether to fit the error model via binning, quantile binning, or rolling quantile
bins (int) – number of bins for binned quantiles. If None, selects the number of points per bins as n^(2/3) / 2.
window (int) – number of points per window for rolling quantiles. If None, selects the number of points per window as n^(2/3) / 2.
curvetype (str) – function used for regression. If auto, the function is chosen automatically to minimize the mse.
- build()#
builds the error model from a trained BaseModel and dataset
- _build()#
optionally implemented, error model-specific computations
- fit()#
fits confidence scores to a trained model and external dataset
- fit_cv()#
fits confidence scores to k-fold cross validation on the training dataset
- _fit()#
fits confidence scores to residuals
- calculate()#
calculates confidence scores from inputs
- score()#
returns confidence intervals on a dataset
- build(model: BaseModel, X: Union[_MockObject.DataFrame, _MockObject.ndarray, list, _MockObject.Series], y: Union[_MockObject.ndarray, list, _MockObject.Series], **kwargs)#
Builds the error model with a trained model and training dataset
- Parameters:
model (BaseModel) – trained model
X (array-like) – training features, list of SMILES
y (array-like) – training values
- abstract calculate(X: Union[_MockObject.DataFrame, _MockObject.ndarray, list, _MockObject.Series], y_pred: _MockObject.ndarray) _MockObject.ndarray #
To be implemented by the child class; calculates confidence scores from inputs.
- Parameters:
X – features, list of SMILES
y_pred (1-dimensional np.ndarray) – predicted values
- Returns:
scores (1-dimensional np.ndarray)
- copy() BaseErrorModel #
returns a copy of itself
- Returns:
copied instance of itself
- Return type:
- fit(X: Union[pd.DataFrame, np.ndarray, list, pd.Series], y: Union[np.ndarray, list, pd.Series]) plotly.graph_objects.Figure #
Fits confidence scores to an external dataset
- Parameters:
X (array-like) – features, smiles
y (array-like) – true values
- Returns:
plotly figure of fitted model against validation dataset
- class olorenchemengine.base_class.BaseModel(normalization='zscore', setting='auto', name=None, **kwargs)#
Bases:
BaseClass
BaseModel for training and evaluating different models
- Parameters:
normalization (BasePreprocessor or str) – the normalization to be used for the data
setting (str) – whether the model is a “classification” model or a “regression” model. Default is “auto” which automatically detects the setting from the dataset.
model_name (str) – the name of the model. Default is None, which instructs BaseModel to use model_name_from_model to select the name of the model.
- preprocess()#
preprocess the inputted data into the appropriate format
- _fit()#
fit the model to the preprocessed data, to be used internally implemented by child classes
- fit()#
fit the model to the inputted data, user can specify if they want regression or classification using the setting parameter.
- _predict()#
predict the properties of the inputted data, to be used internally implemented by child classes
- predict()#
predict the properties of the inputted data
- test()#
test the model on the inputted data, output metrics and optionally predicted values
- copy()#
returns a copy of the model (internal state not copied)
- calibrate(X_valid, y_valid)#
- create_error_model(error_model: BaseErrorModel, X_train: Union[_MockObject.DataFrame, _MockObject.ndarray, list, _MockObject.Series], y_train: Union[_MockObject.ndarray, list, _MockObject.Series], X_valid: Optional[Union[_MockObject.DataFrame, _MockObject.ndarray, list, _MockObject.Series]] = None, y_valid: Optional[Union[_MockObject.ndarray, list, _MockObject.Series]] = None, **kwargs)#
Initializes, builds, and fits an error model on the input value.
The error model is built with the training dataset and fit via either a validation dataset or cross validation. The error model is stored in model.error_model.
- Parameters:
error_model (BaseErrorModel) – Error model type to be created
X_train (array-like) – Input data for model training
y_train (array-like) – Values for model training
X_valid (array-like) – Input data for error model fitting. If no value passed in, the error model is fit via cross validation on the training dataset.
y_valid (array-like) – Values for error model fitting. If no value passed in, the error model is fit via cross validation on the training dataset.
Example
import olorenchemengine as oce
model = oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000) model.fit(train[“Drug”], train[“Y”]) oce.create_error_model(model, oce.SDC(), train[“Drug”], train[“Y”], valid[“Drug”], valid[“Y”], ci = 0.95, method = “roll”) model.error_model.score(test[“Drug”]) ——————————
- fit(X_train: Union[_MockObject.DataFrame, _MockObject.ndarray], y_train: Union[_MockObject.Series, list, _MockObject.ndarray], valid: Optional[Tuple[Union[_MockObject.DataFrame, _MockObject.ndarray], Union[_MockObject.Series, list, _MockObject.ndarray]]] = None, error_model: Optional[BaseErrorModel] = None)#
Calls the _fit method of the model to fit the model on the provided dataset.
- Parameters:
X_train (Union[pd.DataFrame, np.ndarray]) – Input data to be fit on (structures + optionally features) e.g. a pd.DataFrame containing a “smiles” column or a list of experimental data
y_train (Union[pd.Series, list, np.ndarray]) – Values to predict from the input data
valid (Tuple[Union[pd.DataFrame, np.ndarray], Union[pd.Series, list, np.ndarray]]) – Optional validation data, which can be used as with methods like early stopping and model averaging.
error_model (BaseErrorModel) – Optional error model, which can be used to predict confidence intervals.
- fit_class(X_train: Union[_MockObject.DataFrame, _MockObject.ndarray], y_train: Union[_MockObject.Series, list, _MockObject.ndarray], valid: Optional[Tuple[Union[_MockObject.DataFrame, _MockObject.ndarray], Union[_MockObject.Series, list, _MockObject.ndarray]]] = None)#
- fit_cv(X: ~typing.Union[_MockObject.DataFrame, _MockObject.ndarray], y: ~typing.Union[_MockObject.Series, list, _MockObject.ndarray], kf: ~olorenchemengine.dataset.BaseKFold = <olorenchemengine.dataset.RandomKFold object>, error_model: ~typing.Optional[~olorenchemengine.base_class.BaseErrorModel] = None, scoring: ~typing.Optional[str] = None, **kwargs)#
Trains a production-ready model.
This method trains the model on the entire dataset. It also performs an intermediate cross-validation step over dataset to both generate test metrics across the entire dataset, as well as to generate information which is used to calibrate the trained model.
Calibration means to ensure that the probabilities outputted by classifiers reflect true distributions and to create appropriate confidence intervals for regression data.
- Parameters:
X_train (Union[pd.DataFrame, np.ndarray]) – Input data to be fit on (structures + optionally features) e.g. a pd.DataFrame containing a “smiles” column or a list of experimental data
y_train (Union[pd.Series, list, np.ndarray]) – Values to predict from the input data
n_splits (int) – Number of cross validation splits, default 5
error_model (BaseErrorModel) – Optional error model, which can be used to predict confidence intervals.
ci (float) – the confidence interval predicted by the error model
scoring (str) – Metric function to use for scoring cross validation splits; must be in metric_functions
- Returns:
Cross validation metrics for each split
- Return type:
- predict(X: Union[_MockObject.DataFrame, _MockObject.ndarray, list, _MockObject.Series], return_ci=False, return_vis=False, skip_preprocess=False, **kwargs) _MockObject.ndarray #
Calls the _predict method of the model and returns the predicted values for provided dataset.
- Parameters:
X (Union[pd.DataFrame, np.ndarray, list, pd.Series]) – Input data to be predicted (structures + optionally features), will be preprocessed by the preprocess method.
return_ci (bool) – If error model fitted, whether or not to return the confidence intervals.
return_vis (bool) – If error model fitted, whether or not to return BaseVisualization objects.
- Returns:
- np.ndarray: Predicted values for the provided dataset.
Shape: (number of samples, number of predicted values)
- If return_ci or return_vis are true:
pd.DataFrame: Predicted values, confidence intervals, and/or error plots for the provided dataset.
- Return type:
If return_ci and return_vis are False
- preprocess(X, y, fit=False)#
- Parameters:
X (list of smiles) –
- Returns:
Processed list converted into whatever input for the model
- test(X: Union[_MockObject.DataFrame, _MockObject.ndarray], y: Union[_MockObject.Series, list, _MockObject.ndarray], values: bool = False, fit_error_model: bool = False) dict #
Tests the model on the provided dataset returning a dictionary of metrics and optionally the predicted values.
- Parameters:
X (Union[pd.DataFrame, np.ndarray]) – Input test data to be predicted (structures + optionally features)
y (Union[pd.Series, list, np.ndarray]) – True values for the properties
values (bool, optional) – Whether or not to return the predicted values for the test data. Defaults to False.
fit_error_model (bool) – If present, whether or not to fit the error model on the test data.
- Returns:
Dictionary of metrics and optionally the predicted values
- Return type:
- upload_oas(fname: Optional[str] = None)#
uploads the BaseClass object to cloud for OAS access. Model must be trained.
- Parameters:
fname (str) (optional) – the file name to name the uploaded model file. If left empty/None, names the file with default name associated with the BaseClass object.
- visualize_parameters_ipynb()#
- class olorenchemengine.base_class.BaseReduction#
Bases:
BaseClass
BaseReduction for applying dimensionality reduction on high-dimensional data
- Parameters:
n_components (int) – the number of components to keep
- fit()#
fit the model with input data
- fit_transform()#
fit the model with and apply dimensionality reduction to input data
- transform()#
apply dimensionality reduction to input data
- abstract fit(X)#
- abstract fit_transform(X)#
- abstract transform(X)#
- class olorenchemengine.base_class.BaseSKLearnModel(representation, regression_model, classification_model, log=True, **kwargs)#
Bases:
BaseModel
Base class for creating sklearn-type models, e.g. with a sklearn RandomForestRegressor and RandomForestClassifier.
- representation#
Representation to be used to preprocess the input data.
- Type:
- regression_model#
Model to be used for regression tasks.
- Type:
- classification_model#
Model to be used for classification tasks.
- Type:
- class olorenchemengine.base_class.BaseSKLearnReduction#
Bases:
BaseReduction
Base class for creating sklearn dimensionality reduction
- fit(X)#
- fit_transform(X)#
- transform(X)#
- class olorenchemengine.base_class.MakeMultiClassModel(individual_classifier: BaseModel)#
Bases:
BaseModel
Base class for extending the classification capabilities of BaseModel to more than two classes, e.g. classes {W,X,Y,Z}. Uses the commonly-implemented One-vs-Rest (OvR) strategy. For each classifier, the class is fitted against all the other classes. The probabilities are then normalized and compared for each class.
- Parameters:
individual_classifier (BaseModel) – Model for binary classification tasks, which is to be turned into a multi-class model.
- fit(X_train: Union[_MockObject.DataFrame, _MockObject.ndarray], y_train: Union[_MockObject.Series, list, _MockObject.ndarray], valid: Optional[Tuple[Union[_MockObject.DataFrame, _MockObject.ndarray], Union[_MockObject.Series, list, _MockObject.ndarray]]] = None)#
Calls the _fit method of the model to fit the model on the provided dataset.
- Parameters:
X_train (Union[pd.DataFrame, np.ndarray]) – Input data to be fit on (structures + optionally features) e.g. a pd.DataFrame containing a “smiles” column or a list of experimental data
y_train (Union[pd.Series, list, np.ndarray]) – Values to predict from the input data
valid (Tuple[Union[pd.DataFrame, np.ndarray], Union[pd.Series, list, np.ndarray]]) – Optional validation data, which can be used as with methods like early stopping and model averaging.
error_model (BaseErrorModel) – Optional error model, which can be used to predict confidence intervals.
- predict(X: Union[_MockObject.DataFrame, _MockObject.ndarray, list, _MockObject.Series])#
- Parameters:
X (Union[pd.DataFrame, np.ndarray, list, pd.Series]) – Input data to be predicted (structures + optionally features).
- Returns:
Predicted values for the provided dataset, with multiple columns for the 2+ different classes and each row representing a different prediction.
- Return type:
pd.DataFrame
olorenchemengine.basics module#
Machine learning algorithms for use with molecular vector representations and features from experimental data.
- class olorenchemengine.basics.AutoRandomForestModel(representation, n_iter=100, scoring=None, verbose=2, cv=5, **kwargs)#
Bases:
BaseSKLearnModel
,BaseObject
RandomForestModel where parameters have automatically tuned hyperparameters
- Parameters:
representation (str): The representation to use for the model. n_iter (int): The number of iterations to run the hyperparameter tuning. scoring (str): The scoring metric to use for the hyperparameter tuning. verbose (int): The verbosity level of the hyperparameter tuning. cv (int): The number of folds to use for the hyperparameter tuning.
Example
import olorenchemengine as oce
model = oce.AutoRandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048)) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————
- autofit(model, n_iter, cv, scoring, verbose)#
Takes an XGBoost model and replaces its fit function with one that automatically tunes the model hyperparameters
- Parameters:
- Returns:
The tuned model
- Return type:
model (sklearn model)
- class olorenchemengine.basics.BaseMLPClassifier(*args, **kwargs)#
Bases:
BaseEstimator
Wrapper for sklearn MLP
- class olorenchemengine.basics.BaseMLPRegressor(*args, **kwargs)#
Bases:
BaseEstimator
Wrapper for sklearn MLP
- class olorenchemengine.basics.FeaturesClassification(config='lineardiscriminant')#
Bases:
BaseModel
FeaturesClassification uses machine learning models to classify features based on their experimental data
- obj#
Machine learning model to use.
- Parameters:
config (str) – Configuration to use for the model.
- class olorenchemengine.basics.GuessingRegression(config='full', reg='lr', **kwargs)#
Bases:
BaseModel
Guessing model for regression, used to infer non-linear relationships.
This model tries different non-linear relationships between each feature and property, selecting the best such relationship for each feature. Then the features are transformed then either aggregated using linear regression or averages to obtain the final prediction for the property. This is best used for using experimental features with direct relationships to the properties.
- transformations#
List of transformations to apply to the data i.e. possible relationships between feature and property.
- Type:
List[Callable]
- state#
State of the model, best transformation for each feature.
- reg#
Method to use for combining features, either “lr” linear regression or “avg” average.
- Type:
- linearize(X)#
Linearize the data–apply the best transformation for each feature.
- Parameters:
X (np.ndarray) – List of lists of features.
- Returns:
List of lists of features. Shape: (n_samples, n_features)
- Return type:
np.ndarray
- preprocess(X, y, fit=False)#
This method is used to preprocess the data before training.
- Parameters:
X (np.ndarray) – List of lists of features.
y (np.array) – List of properties.
- Returns:
List of lists of features. Shape: (n_samples, n_features)
- Return type:
np.ndarray
- class olorenchemengine.basics.KBestLinearRegression(k=1, *args, **lwargs)#
Bases:
BaseEstimator
Selects the K-best features and then does linear regression
- class olorenchemengine.basics.KNN(representation, **kwargs)#
Bases:
BaseSKLearnModel
KNN model
- Parameters:
representations (str): The representation to use for the model.
Example
import olorenchemengine as oce
model = oce.KNN(representation = oce.MorganVecRepresentation(radius=2, nbits=2048)) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————
- class olorenchemengine.basics.KNeighborsClassifier(*args, **kwargs)#
Bases:
BaseEstimator
Wrapper for sklearn KNeighborsClassifier
- predict(X)#
Predict the output of the estimator
- Parameters:
X (np.array) – The data to predict the output of the estimator on
- Returns:
The predicted output of the estimator
- Return type:
y (np.array)
- class olorenchemengine.basics.KNeighborsRegressor(*args, **kwargs)#
Bases:
BaseEstimator
Wrapper for sklearn KNeighborsRegressor
- class olorenchemengine.basics.LogisticRegression(*args, **kwargs)#
Bases:
BaseEstimator
Wrapper for sklearn LogisticRegression
- class olorenchemengine.basics.MLP(representation: BaseVecRepresentation, layer_dims=[2048, 512, 128], activation='tanh', epochs=100, batch_size=16, lr=0.0005, dropout=0, kernel_regularizer=0.0001, **kwargs)#
Bases:
BaseSKLearnModel
MLP model
- Parameters:
representation (BaseVecRepresentation): The representation to use for the model. hidden_layer_sizes (List[int]): The hidden layer sizes to use for the model. activation (str): The activation function to use for the model. epochs (int): The number of epochs to use for the model. batch_size (int): The batch size to use for the model. lr (float): The learning rate to use for the model. dropout (float): The dropout rate to use for the model. kernel_regularizer (float): The kernel regularizer to use for the model.
Example
import olorenchemengine as oce
model = oce.MLP(representation = oce.MorganVecRepresentation(radius=2, nbits=2048)) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————
- class olorenchemengine.basics.RandomForestClassifier(*args, **kwargs)#
Bases:
BaseEstimator
Wrapper for sklearn RandomForestClassifier
- predict(X)#
Predict the output of the estimator
- Parameters:
X (np.array) – The data to predict the output of the estimator on
- Returns:
The predicted output of the estimator
- Return type:
y (np.array)
- class olorenchemengine.basics.RandomForestModel(representation, max_features='log2', max_depth=None, criterion='entropy', class_weight=None, bootstrap=True, n_estimators=100, random_state=None, **kwargs)#
Bases:
BaseSKLearnModel
Random forest model
- Parameters:
n_estimators (int): The number of trees in the forest. max_depth (int): The maximum depth of the tree. max_features (int): The number of features to consider when looking for the best split. bootstrap (bool): Whether bootstrap samples are used when building trees. criterion (str): The function to measure the quality of a split. class_weight (str): Dict or ‘balanced’, defaults to None.
Example
import olorenchemengine as oce
model = oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————
- class olorenchemengine.basics.RandomForestRegressor(*args, **kwargs)#
Bases:
BaseEstimator
Wrapper for sklearn RandomForestRegressor
- class olorenchemengine.basics.RandomizedSearchCVModel(*args, **kwargs)#
Bases:
BaseEstimator
Wrapper class for RandomizedSearchCV
- fit(*args, **kwargs)#
Fit the estimator to the data
- Parameters:
X (np.array) – The data to fit the estimator to
y (np.array) – The target data to fit the estimator to
- Returns:
The estimator object fit to the data
- Return type:
self (object)
- predict(*args, **kwargs)#
Predict the output of the estimator
- Parameters:
X (np.array) – The data to predict the output of the estimator on
- Returns:
The predicted output of the estimator
- Return type:
y (np.array)
- class olorenchemengine.basics.SVC(*args, **kwargs)#
Bases:
BaseEstimator
Wrapper for sklearn SVC
- class olorenchemengine.basics.SVR(*args, **kwargs)#
Bases:
BaseEstimator
Wrapper for sklearn SVR
- class olorenchemengine.basics.SklearnMLP(representation, hidden_layer_sizes=[100], activation='relu', solver='adam', alpha=0.0001, batch_size=32, learning_rate='constant', learning_rate_init=0.001, power_t=0.5, max_iter=200, log=True, **kwargs)#
Bases:
BaseSKLearnModel
MLP Model based on sklearn implementation
- Parameters:
representation (BaseVecRepresentation): The representation to use for the model. hidden_layer_sizes (list): The number of neurons in each hidden layer. activation (str): The activation function to use. solver (str): The solver to use. alpha (float): Learning rate. batch_size (int): The size of the minibatches for stochastic optimizers. learning_rate (str): The learning rate schedule. learning_rate_init (float): The initial learning rate for the solver. power_t (float): The exponent for inverse scaling learning rate. max_iter (int): Maximum number of iterations.
Example
import olorenchemengine as oce
model = oce.SklearnMLP(representation = oce.MorganVecRepresentation(radius=2, nbits=2048)) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————
- class olorenchemengine.basics.SupportVectorMachine(representation, C=0.8, kernel='rbf', gamma='scale', coef0=0, cache_size=500, **kwargs)#
Bases:
BaseSKLearnModel
Support vector machine
- Parameters:
representations (str): The representation to use for the model. kernel (str): The kernel to use for the model. C (float): The C parameter for the model. gamma (float): The gamma parameter for the model. coef0 (float): The coef0 parameter for the model. cache_size (int): The cache size parameter for the model.
Example
import olorenchemengine as oce
model = oce.SupportVectorMachine(representation = oce.MorganVecRepresentation(radius=2, nbits=2048)) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————
- class olorenchemengine.basics.TorchMLP(representation, hidden_layer_sizes=[100], norm_layer: str = None, activation_layer: str = None, dropout: float = 0.0, epochs: int = 100, log=True, **kwargs)#
Bases:
BaseModel
MLP Model based on torch implementation
- Parameters:
representation (BaseVecRepresentation): The representation to use for the model. hidden_layer_sizes (list): The number of neurons in each hidden layer. norm_layer (str): The normalization to use for a final normalization layer. Default None. activation_layer (str): The activation function to use for a final activation layer. Default None. dropout (float): The dropout rate to use for the model.
Example
import olorenchemengine as oce
model = oce.TorchMLP(representation = oce.MorganVecRepresentation(radius=2, nbits=2048)) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————
- class olorenchemengine.basics.XGBClassifier(*args, **kwargs)#
Bases:
BaseEstimator
Wrapper for xgboost XGBRegressor
- class olorenchemengine.basics.XGBRegressor(*args, **kwargs)#
Bases:
BaseEstimator
Wrapper for xgboost XGBRegressor
- class olorenchemengine.basics.XGBoostModel(representation, n_estimators=2000, max_depth=6, subsample=0.5, max_leaves=5, learning_rate=0.05, colsample_bytree=0.8, min_child_weight=1, log=True, **kwargs)#
Bases:
BaseSKLearnModel
,BaseObject
XGBoost model
- Parameters:
representation (str): The representation to use for the model n_iter (int): Number of iterations to run the hyperparameter tuning cv (int): Number of folds to use for cross-validation scoring (str): Scoring metric to use for cross-validation verbose (int): Verbosity level
Example
import olorenchemengine as oce
model = oce.XGBoostModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048)) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————
- class olorenchemengine.basics.ZWK_XGBoostModel(representation, n_iter=100, scoring=None, verbose=2, cv=5, **kwargs)#
Bases:
BaseSKLearnModel
,BaseObject
XGBoost model from https://github.com/smu-tao-group/ADMET_XGBoost
- Parameters:
representation (str): The representation to use for the model. n_iter (int): The number of iterations to run the hyperparameter tuning. scoring (str): The scoring metric to use for the hyperparameter tuning. verbose (int): The verbosity level of the hyperparameter tuning. cv (int): The number of folds to use for the hyperparameter tuning.
Example
import olorenchemengine as oce
model = oce.ZWK_XGBoostModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048)) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————
- autofit(model, n_iter, cv, scoring, verbose)#
Takes an XGBoost model and replaces its fit function with one that automatically tunes the model hyperparameters
- Parameters:
- Returns:
The tuned model
- Return type:
model (sklearn model)
olorenchemengine.dataset module#
- class olorenchemengine.dataset.BaseDataset(name: str = None, data: str = None, structure_col: str = None, property_col: str = None, feature_cols: list = [], date_col: str = None, log=True, **kwargs)#
Bases:
BaseClass
BaseDataset for all dataset objects
BaseDataset holds its data in a Pandas DataFrame.
- Parameters:
name (str) – Name of the dataset
data (str) – The output when df.to_csv() where df is the pd.DataFrame containing the dataset.
structure_col (str) – Name of column containing structure information, e.g. “smiles”
feature_cols (list[str]) – List of names of columns containing features, e.g. [“X1”, “X2”]
property_col (str) – Name of property of interest, e.g. “Y”
- property entire_dataset#
Returns the entire dataset
- Returns:
The entire dataset
- Return type:
pd.DataFrame
- property entire_dataset_split#
Returns a tuple of three elements where the first is the input train data, the second is the input validation data, and the third is the input test data
- Returns:
(train_data, val_data, test_data)
- Return type:
- property size#
- property test_dataset#
Gives a tuple of two elements where the first is the input test data and the second is the property of interest
- Returns:
The test data
- Return type:
pd.DataFrame
- property train_dataset#
Returns the train dataset
- property trainval_dataset#
Returns the train and validation dataset
- transform(dataset: _MockObject.Self)#
Combines this dataset with the passed dataset object
- property valid_dataset#
Gives a tuple of two elements where the first is the input val data and the second is the property of interest
- Returns:
The validation data
- Return type:
pd.DataFrame
- class olorenchemengine.dataset.BaseDatasetTransform(log=True)#
Bases:
BaseClass
Applies a transformation onto the inputted BaseDataset.
Transformation applied as defined in the abstract method transform.
- Parameters:
dataset (BaseDataset) – The dataset to transform.
- abstract transform(dataset: BaseDataset) BaseDataset #
Applies a transformation onto the inputted BaseDataset.
Parameters: dataset (BaseDataset): The dataset to transform.
- class olorenchemengine.dataset.BaseKFold(n_splits: int = 10, log=True)#
Bases:
BaseDatasetTransform
Base class for all classes which split the data into KFolds for cross- validation with various strategies.
- get_n_splits()#
- abstract transform(dataset: BaseDataset, random_state: int = 42, *args, **kwargs)#
Splits document into folds, identified by 1, …, n_splits in the ‘cv’ column.
- class olorenchemengine.dataset.CleanStructures(log=True)#
Bases:
BaseDatasetTransform
CleanStructures creates a new dataset from the original dataset by removing structures that are not valid.
- Parameters:
dataset (BaseDataset) – The dataset to clean.
- transform(dataset: BaseDataset, dropna_property: bool = True, **kwargs)#
Applies a transformation onto the inputted BaseDataset.
Parameters: dataset (BaseDataset): The dataset to transform.
- class olorenchemengine.dataset.DatasetFromCDDSearch(search_id, cache_file_path=None, update=True, log=True, **kwargs)#
Bases:
BaseDataset
Dataset for retreiving data from CDD via a saved search.
Requires a CDD Token to be set.
- Parameters:
- check_export_status(export_id)#
Uses the export_id passed as a parameter to find the pertinent dataset and return its export status
Parameters: export_id (str): The unique export ID of the dataset searched for
- get_dataset_cdd_saved_search(search_id)#
Uses a CDD Token (passed as search_id) to search saved datasets to find and return its related dataset export id. Using the export id, it then checks the export status and returns the dataset’s data in CSV format.
Parameters: search_id (str): The ID of the saved CDD search to use.
- get_export(export_id)#
Uses the export_id passed as a parameter to find the pertinent dataset and return the dataset’s data in CSV format.
Parameters: export_id (str): The unique export ID of the dataset searched for
- run_saved_search(search_id)#
Uses a CDD Token (passed as search_id) to search saved datasets to find and return its related dataset export id.
Parameters: search_id (str): The ID of the saved CDD search to use.
- class olorenchemengine.dataset.DatasetFromCSV(file_path, log=True, **kwargs)#
Bases:
BaseDataset
DatasetFromFile for all dataset objects
- Parameters:
file_path (str) – Relative or absolute to a local CSV file
- class olorenchemengine.dataset.Discretize(prop_cutoff: float, dir: str = 'larger', log=True, **kwargs)#
Bases:
BaseDatasetTransform
Discretize creates a new dataset from the original dataset by discretizing the property column.
- Parameters:
- transform(dataset: BaseDataset, **kwargs)#
Applies a transformation onto the inputted BaseDataset.
Parameters: dataset (BaseDataset): The dataset to transform.
- class olorenchemengine.dataset.KMeansKFold(rep: BaseVecRepresentation, n_splits: int = 10, log=True)#
Bases:
BaseKFold
- transform(dataset: BaseDataset, random_state: int = 42, *args, **kwargs)#
Splits document into folds, identified by 1, …, n_splits in the ‘cv’ column.
- class olorenchemengine.dataset.OneHotEncode(feature_col: str, log=True, **kwargs)#
Bases:
BaseDatasetTransform
This one hot encodes a given feature column
- Parameters:
feature_col (str) – The feature column to one hot encode.
- transform(dataset: BaseDataset, **kwargs)#
Applies a transformation onto the inputted BaseDataset.
Parameters: dataset (BaseDataset): The dataset to transform.
- class olorenchemengine.dataset.RandomKFold(n_splits: int = 10, log=True)#
Bases:
BaseKFold
- transform(dataset: BaseDataset, *args, random_state: int = 42, **kwargs)#
Splits document into folds, identified by 1, …, n_splits in the ‘cv’ column.
- class olorenchemengine.dataset.ScaffoldKFold(n_splits: int = 10, log=True)#
Bases:
BaseKFold
- transform(dataset: BaseDataset, *args, random_state: int = 42, **kwargs)#
Splits document into folds, identified by 1, …, n_splits in the ‘cv’ column.
- class olorenchemengine.dataset.ScaffoldKMeansKFold(rep: BaseVecRepresentation, n_splits: int = 10, log=True)#
Bases:
BaseKFold
- transform(dataset: BaseDataset, random_state: int = 42, *args, **kwargs)#
Splits document into folds, identified by 1, …, n_splits in the ‘cv’ column.
- olorenchemengine.dataset.func(self: BaseDataset, other: BaseDatasetTransform) BaseDataset #
olorenchemengine.ensemble module#
Ensembling methods to combine `BaseModel`s to create better, combined models.
- class olorenchemengine.ensemble.Averager(models: List[BaseModel], n: int = 1, log: bool = True, **kwargs)#
Bases:
BaseModel
Averager averages the predictions of multiple models for an ensembled prediction.
- Parameters:
Example
import olorenchemengine as oce
- model = oce.Averager(models = [
oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000), oce.SupportVectorMachine(representation = oce.Mol2Vec())
]) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————
- preprocess(X, y, fit=False)#
Preprocesses the data for the model.
- Parameters:
X (pd.DataFrame) – Dataframe of features.
y (pd.DataFrame) – Dataframe of labels.
- Returns:
Dataframe of features.
- Return type:
X (pd.DataFrame)
- class olorenchemengine.ensemble.BaseBoosting(models: List[BaseModel], n: int = 1, oof=False, nfolds=5, log: bool = True, **kwargs)#
Bases:
BaseModel
BaseBoosting uses models in a gradient boosting fashion to create an ensembled model.
- Parameters:
models (List[BaseModel]) – list of models to use for the learners to be stacked together.
n (int, optional) – Number of times to repeat the given models. Defaults to 1.
oof (bool, optional) – Whether or not to use out-of-fold predictions for the ensembled model. Defaults to False.
log (bool, optional) – Whether or not to log the arguments of this constructor. Defaults to True.
Example
import olorenchemengine as oce
- model = oce.BaseBoosting(
- models = [
oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000), oce.SupportVectorMachine(representation = oce.Mol2Vec())]
) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————
- class olorenchemengine.ensemble.BaseStacking(models: List[BaseModel], stacker_model: BaseModel, n: int = 1, oof: bool = False, split: float = 0.0, log: bool = True, nfolds=5, **kwargs)#
Bases:
BaseModel
BaseStacking stacks the predictions of models for an ensembled prediction.
- Parameters:
Called only by child classes. Not to be called directly by user.
- featurize(X)#
Featurizes the data for the model.
- Parameters:
X (pd.DataFrame) – Dataframe of features.
y (pd.DataFrame) – Dataframe of labels.
- Returns:
featurized dataset.
- Return type:
data
- class olorenchemengine.ensemble.BestStacker(models: List[BaseModel], n: int = 1, k: int = 1, log: bool = True, **kwargs)#
Bases:
BaseStacking
BestStacker is a stacking method that uses the best model from a collection of models to make an ensembled prediction.
- Parameters:
Example
import olorenchemengine as oce
- model = oce.BestStacker(models = [
oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000), oce.SupportVectorMachine(representation = oce.Mol2Vec())
]) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————
- class olorenchemengine.ensemble.LinearRegressionStacker(models: List[BaseModel], n: int = 1, log: bool = True, **kwargs)#
Bases:
BaseStacking
- LinearRegressionStacker is a stacking method that uses linear regression on the predictions
from a collection of models to make an ensembled prediction.
- Parameters:
models (List[BaseModel]): list of models to use for the learners to be stacked together. n (int, optional): Number of times to repeat the given models. Defaults to 1. log (bool, optional): Whether or not to log the arguments of this constructor. Defaults to True.
Example
import olorenchemengine as oce
- model = oce.LinearRegressionStacker(models = [
oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000), oce.SupportVectorMachine(representation = oce.Mol2Vec())
]) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————
- class olorenchemengine.ensemble.MLPStacker(models, layer_dims=[2048, 512, 128], activation='tanh', epochs=100, batch_size=16, verbose=0, n=1, log=True, **kwargs)#
Bases:
SKLearnStacker
MLPStacker is a subclass of SKLearnStacker that uses a multi-layer perceptron model to make an ensembled prediction.
- Parameters:
models (List[BaseModel]): list of models to use for the learners to be stacked together. layer_dims (List[int]): list of layer dimensions for the MLP. activation (str, optional): activation function to use for the MLP. Defaults to ‘tanh’. epochs (int, optional): number of epochs to train the MLP. Defaults to 100. batch_size (int, optional): batch size for the MLP. Defaults to 16. verbose (int, optional): verbosity level for the MLP. Defaults to 0. n (int, optional): Number of times to repeat the given models. Defaults to 1. log (bool, optional): Whether or not to log the arguments of this constructor. Defaults to True.
Example
import olorenchemengine as oce
- model = oce.MLPStacker(
- models = [
oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000), oce.SupportVectorMachine(representation = oce.Mol2Vec())],
layer_dims = [32, 32], activation = ‘tanh’, epochs = 15, batch_size = 16, verbose = 0,
) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————
- class olorenchemengine.ensemble.RFStacker(models: List[BaseModel], n_estimators: int = 100, max_features: str = 'log2', n: int = 1, log: bool = True, **kwargs)#
Bases:
SKLearnStacker
RFStacker is a subclass of SKLearnStacker that uses a random forest models to make an ensembled prediction.
- Parameters:
models (List[BaseModel]): list of models to use for the learners to be stacked together. n_estimators (int, optional): Number of trees in the forest. Defaults to 100. n (int, optional): Number of times to repeat the given models. Defaults to 1. log (bool, optional): Whether or not to log the arguments of this constructor. Defaults to True.
Example
import olorenchemengine as oce
- model = oce.RFStacker(
- models = [
oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000), oce.SupportVectorMachine(representation = oce.Mol2Vec())],
n_estimators = 100
) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————
- class olorenchemengine.ensemble.Resample1(model: BaseModel, log=True)#
Bases:
BaseModel
Sample from imbalanced dataset. Take all compounds from smaller class and then sample an equal number from the larger class.
- Parameters:
model (BaseModel) – Model to use for classification.
Example
import olorenchemengine as oce
- model = oce.Resample1(
oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000)
) model.fit(train[‘Drug’], train[‘Y’]) preds = model.predict(test[‘Drug’]) ——————————
Note: may only be used on binary classification data.
- fit(X_train, y_train)#
Calls the _fit method of the model to fit the model on the provided dataset.
- Parameters:
X_train (Union[pd.DataFrame, np.ndarray]) – Input data to be fit on (structures + optionally features) e.g. a pd.DataFrame containing a “smiles” column or a list of experimental data
y_train (Union[pd.Series, list, np.ndarray]) – Values to predict from the input data
valid (Tuple[Union[pd.DataFrame, np.ndarray], Union[pd.Series, list, np.ndarray]]) – Optional validation data, which can be used as with methods like early stopping and model averaging.
error_model (BaseErrorModel) – Optional error model, which can be used to predict confidence intervals.
- class olorenchemengine.ensemble.Resample2(model: BaseModel, log=True)#
Bases:
BaseModel
Sample from imbalanced dataset. Take all compounds from smaller class and then sample an equal number from the larger class.
- fit(X_train, y_train)#
Calls the _fit method of the model to fit the model on the provided dataset.
- Parameters:
X_train (Union[pd.DataFrame, np.ndarray]) – Input data to be fit on (structures + optionally features) e.g. a pd.DataFrame containing a “smiles” column or a list of experimental data
y_train (Union[pd.Series, list, np.ndarray]) – Values to predict from the input data
valid (Tuple[Union[pd.DataFrame, np.ndarray], Union[pd.Series, list, np.ndarray]]) – Optional validation data, which can be used as with methods like early stopping and model averaging.
error_model (BaseErrorModel) – Optional error model, which can be used to predict confidence intervals.
- class olorenchemengine.ensemble.ResampleAdaboost(models: List[BaseModel], n: int = 1, factor: int = 8, size: int = None, equation: str = 'abs', log: bool = True, **kwargs)#
Bases:
BaseBoosting
ResampleAdaBoost performs the Adaboost with sampling weighting being done via resampling of the dataset to create an ensembled model.
- Parameters:
models (List[BaseModel]) – list of models to use for the learners to be stacked together.
n (int, optional) – Number of times to repeat the given models. Defaults to 1.
size (int, optional) – Size of the resampled dataset. Defaults to None.
factor (int, optional) – Factor by which to resample the dataset. Defaults to 8.
equation (str, optional) – Equation to use for resampling. Defaults to “abs”.
log (bool, optional) – Whether or not to log the arguments of this constructor. Defaults to True.
Example
import olorenchemengine as oce
- model = oce.ResampleAdaboost(
- models = [
oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000), oce.SupportVectorMachine(representation = oce.Mol2Vec())],
factor = 8, equation = ‘abs’
) model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————
- class olorenchemengine.ensemble.SKLearnStacker(models: List[BaseModel], regression_stacker_model: BaseEstimator, classification_stacker_model: BaseEstimator, n: int = 1, log: bool = True, **kwargs)#
Bases:
BaseSKLearnModel
SKLearnStacker is a stacking method that uses a sklearn-like models to make an ensembled prediction.
- reg_stack#
a sklearn-like model to use for stacking the models for regression tasks.
- class_stack#
a sklearn-like model to use for stacking the models for classification tasks.
Called only by child classes. Not to be called directly by user.
- olorenchemengine.ensemble.get_oof(self, model, X, y, kf)#
olorenchemengine.gnn module#
Building blocks for graph neural networks.
- class olorenchemengine.gnn.AttentiveFP(hidden_channels=4, out_channels=1, num_layers=1, num_timesteps=1, dropout=0, skip_lin=True, layer_dims=[512, 128], activation='leakyrelu', optim='adamw', **kwargs)#
Bases:
BaseLightningModule
AttentiveFP is a wrapper for the PyTorch Geometric interpretation of https://pubs.acs.org/doi/10.1021/acs.jmedchem.9b00959.
- Parameters:
hidden_channels (int, optional) – the number of hidden channels to use in the model. Defaults to 4.
out_channels (int, optional) – the number of output channels to use in the model. Defaults to 1.
num_layers (int, optional) – the number of layers to use in the model. Defaults to 1.
num_timesteps (int, optional) – the number of timesteps to use in the model. Defaults to 1.
dropout (float, optional) – the dropout rate to use in the model. Defaults to 0.
skip_lin (bool, optional) – the number of skip connections to use in the model. Defaults to True.
layer_dims (List, optional) – the dimensions to use for each layer in the model. Defaults to [512, 128].
activation (str, optional) – the activation function to use in the model. Defaults to “leakyrelu”.
optim (str, optional) – the optimizer to use in the model. Defaults to “adamw”.
- create(dimensions)#
Create the model.
- Parameters:
dimensions (list) – the dimensions of the input data.
- forward(data)#
- class olorenchemengine.gnn.BaseLightningModule(*args, optim: str = 'adam', input_dimensions: Optional[Tuple] = None, **kwargs)#
Bases:
BaseClass
BaseLightningModule allows for the use of a Pytorch Lightning module as a BaseClass to be incorporated into the framework.
- Parameters:
optim (str, optional) – parameter describing what kind of optimizer to use. Defaults to “adam”.
input_dimensions (Tuple, optional) – Tulpe describing the dimensions of the input data. Defaults to None.
- configure_optimizers()#
- forward(batch)#
- loss(y_pred, y_true)#
Calculate the loss for the model.
- Parameters:
y_pred (torch.tensor) – the predictions for the model.
y_true (torch.tensor) – the true labels for the model.
- Returns:
the loss for the model.
- Return type:
torch.tensor
- set_task_type(task_type, pos_weight=<MagicMock name='mock.tensor()' id='140680899396464'>)#
Sets the task type for the model.
- Parameters:
task_type (str) – the task type to set the model to.
pos_weight (torch.tensor, optional) – the weight to use for the positive class. Defaults to torch.tensor([1]).
- test_step(batch, batch_idx)#
- training_step(batch, batch_idx)#
- validation_step(batch, batch_idx)#
- class olorenchemengine.gnn.BaseTorchGeometricModel(network: ~olorenchemengine.gnn.BaseLightningModule, representation: ~olorenchemengine.internal.BaseRepresentation = <olorenchemengine.representations.TorchGeometricGraph object>, epochs: int = 1, batch_size: int = 16, lr: float = 0.0001, auto_lr_find: bool = True, pos_weight: str = 'balanced', preinitialized: bool = False, log: bool = True, **kwargs)#
Bases:
BaseModel
BaseTorchGeometricModel is a base class for models in the PyTorch Geometric framework.
- Parameters:
network (BaseLightningModule) – The network to be used for the model.
representation (BaseRepresentation, optional) – The representation to be used for the model. Note that the representation must be compatible with the network, so the default, TorchGeometricGraph() is highly reccomended
epochs (int, optional) – The number of epochs to train the model for.
batch_size (int, optional) – The batch size to use for training.
lr (float, optional) – The learning rate to use for training.
auto_lr_find (bool, optional) – Whether to automatically adjust the learning rate.
pos_weight (str, optional) – Strategy for weighting positives in classification
preinitialized (bool, optional) – Whether to the network is pre-initialized.
log (bool, optional) – Log arguments or not. Should only be true if it is not nested. Defaults to True.
- class olorenchemengine.gnn.TLFromCheckpoint(model_path, num_tasks: int = 2048, dropout: float = 0.1, lr: float = 0.0001, optim: str = 'adam', reset: bool = False, **kwargs)#
Bases:
BaseLightningModule
TLFromCheckpoint is a class for transfer-learning from an OlorenVec PyTorch-lightning checkpoint.
- Parameters:
model_path (str, option) – The path to the PyTorch-lightning checkpoint. Ise “default” to use a pretrained OlorenVec model.
map_location (str, optional) – The location to map the model to. Default is “cuda:0”.
num_tasks (int, optional) – The number of tasks in the OlorenVec model
dropout (float, optional) – The dropout rate to use for the model. Default is 0.1.
lr (float, optional) – The learning rate to use for training. Default is 1e-4.
optim (str, optional) – The optimizer to use for training. Default is “adam”.
olorenchemengine.hyperparameters module#
Contains the basic framework for hyperparameter optimization.
We use hyperopt as our framework for hyperparameter optimization, and the class Opt functions as the bridge between olorenchemengine and hyperopt. Hyperparameters are defined in Opt which is used as an argument in a BaseClass object’s instantiation. These hyperparameters are then collated and used for hyperparameter optimization.
The following is a brief introduction to hyperopt and is a useful starting point for understanding our hyperparameter optimization engine: https://github.com/hyperopt/hyperopt/wiki/FMin.
- class olorenchemengine.hyperparameters.Opt(label, *args, use_int=False, **kwargs)#
Bases:
BaseClass
- abstract property get_hp#
- class olorenchemengine.hyperparameters.OptChoice(label, *args, use_int=False, **kwargs)#
Bases:
Opt
- property get_hp#
- class olorenchemengine.hyperparameters.OptLogNormal(label, *args, use_int=False, **kwargs)#
Bases:
Opt
- property get_hp#
- class olorenchemengine.hyperparameters.OptLogUniform(label, *args, use_int=False, **kwargs)#
Bases:
Opt
- property get_hp#
- class olorenchemengine.hyperparameters.OptQLogNormal(label, *args, use_int=False, **kwargs)#
Bases:
Opt
- property get_hp#
- class olorenchemengine.hyperparameters.OptQLogUniform(label, *args, use_int=False, **kwargs)#
Bases:
Opt
- property get_hp#
- class olorenchemengine.hyperparameters.OptQNormal(label, *args, use_int=False, **kwargs)#
Bases:
Opt
- property get_hp#
- class olorenchemengine.hyperparameters.OptQUniform(label, *args, use_int=False, **kwargs)#
Bases:
Opt
- property get_hp#
- class olorenchemengine.hyperparameters.OptRandInt(label, *args, use_int=False, **kwargs)#
Bases:
Opt
- property get_hp#
- class olorenchemengine.hyperparameters.OptUniform(label, *args, use_int=False, **kwargs)#
Bases:
Opt
- property get_hp#
- olorenchemengine.hyperparameters.index_hyperparameters(object: BaseClass) dict #
Returns a dictionary of hyperparameters for the model.
- olorenchemengine.hyperparameters.load_hyperparameters(object: BaseClass, hyperparameter_dictionary: dict) dict #
olorenchemengine.internal module#
- class olorenchemengine.internal.BaseClass(log=True)#
Bases:
BaseRemoteSymbol
BaseClass is the base class for all models.
All classes in Oloren ChemEngine should inherit from BaseClass to enable for universal saving and loading of both parameters and internal state. This requires the implementation of abstract methods _save and _load.
- Registry()#
returns a dictionary mapping the name of a class to the class itself for all subclasses of the class.
- _save()#
saves an instance of a BaseClass to a dictionary (abstract method to be implmented by subclasses)
- _load()#
loads an instance of a BaseClass from a dictionary (abstract method to be implmented by subclasses)
- classmethod AllInstances()#
AllTypes returns a list of all standard instances of all subclasses of BaseClass.
Standard instances means that all required parameters for instantiation of the subclasses are set with canonical values.
- classmethod Opt(*args, **kwargs)#
- classmethod Registry()#
Registry is a recursive method to create a dictionary of all subclasses of BaseClass, with the key being the name of the subclass and the value being the subclass itself.
- copy()#
- class olorenchemengine.internal.BaseDepreceated(*args, **kwargs)#
Bases:
BaseClass
BaseDepreceated is a class which is used to deprecate a class.
Depreceated classes will raise Exception and will not run.
- class olorenchemengine.internal.BaseEstimator(obj=None)#
Bases:
BaseObject
Utility class used to wrap any object with a fit and predict method
- fit(X, y)#
Fit the estimator to the data
- Parameters:
X (np.array) – The data to fit the estimator to
y (np.array) – The target data to fit the estimator to
- Returns:
The estimator object fit to the data
- Return type:
self (object)
- predict(X)#
Predict the output of the estimator
- Parameters:
X (np.array) – The data to predict the output of the estimator on
- Returns:
The predicted output of the estimator
- Return type:
y (np.array)
- class olorenchemengine.internal.BaseObject(obj=None)#
Bases:
BaseClass
BaseObject is the parent class for all classes which directly wrap some object to be saved via joblib.
- class olorenchemengine.internal.BasePreprocessor(obj=None)#
Bases:
BaseObject
BasePreprocessor is the parent class for all preprocessors which transform the features or properties of a dataset.
- fit()#
fit the preprocessor to the dataset
- fit_transform()#
fit the preprocessor to the dataset return the transformed values
- transform()#
return the transformed values
- inverse_transform()#
return the original values from the transformed values
- fit(X)#
Fits the preprocessor to the dataset.
- Parameters:
X (np.ndarray) – the dataset
- Returns:
The fit preprocessor instance
- fit_transform(X)#
Fits the preprocessor to the dataset and returns the transformed values.
- Parameters:
X (np.ndarray) – the dataset
- Returns:
The transformed values of the dataset as a numpy array
- inverse_transform(X)#
Returns the original values from the transformed values.
- Parameters:
X (np.ndarray) – the transformed values
- Returns:
The original values from the transformed values
- transform(X)#
Returns the transformed values of the dataset as a numpy array.
- Parameters:
X (np.ndarray) – the dataset
- Returns:
The transformed values of the dataset as a numpy array
- class olorenchemengine.internal.BaseRemoteSymbol(REMOTE_SYMBOL_NAME, REMOTE_PARENT, args=None, kwargs=None)#
Bases:
object
- classmethod from_rid(rid)#
- class olorenchemengine.internal.BaseRepresentation(log=True)#
Bases:
BaseClass
BaseClass for all molecular representations (PyTorch Geometric graphs, descriptors, fingerprints, etc.) :param log: Whether to log the representation or not :type log: boolean
- _convert(smiles
str, y: Union[int, float, np.number] = None) -> Any: converts a single structure (represented by a SMILES string) to a representation
- convert(Xs
Union[list, pd.DataFrame, dict, str], ys: Union[list, pd.Series, np.ndarray]=None) -> List[Any]: converts input data to a list of representations
- convert(Xs: Union[list, _MockObject.DataFrame, dict, str], ys: Optional[Union[list, _MockObject.Series, _MockObject.ndarray]] = None, **kwargs) List[Any] #
Converts input data to a list of representations :param Xs: input data :type Xs: Union[list, pd.DataFrame, dict, str] :param ys: target values of the input data :type ys: Union[list, pd.Series, np.ndarray]=None
- Returns:
list of representations of the input data
- Return type:
List[Any]
- class olorenchemengine.internal.BaseVecRepresentation(*args, collinear_thresh=1.01, scale=<olorenchemengine.internal.StandardScaler object>, names=None, log=True, **kwargs)#
Bases:
BaseRepresentation
Representation where given input data, returns a vector representation for each compound.
- calculate_distance(x1: Union[str, List[str]], x2: Union[str, List[str]], metric: str = 'cosine', **kwargs) _MockObject.ndarray #
Calculates the distance between two molecules or list of molecules.
Returns a 2D array of distances between each pair of molecules of shape len(x1) by len(x2).
This uses pairwise_distances from sklearn.metrics to calculate distances between the vector representations of the molecules. Options for distances are Valid values for metric are:
- From scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’,
‘manhattan’]. These metrics support sparse matrix inputs. [‘nan_euclidean’] but it does not yet support sparse matrices.
- From scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’,
‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’].
See the documentation for scipy.spatial.distance for details on these metrics.
- convert(Xs: Union[list, _MockObject.DataFrame, dict, str], ys: Optional[Union[list, _MockObject.Series, _MockObject.ndarray]] = None, lambda_convert: Optional[Callable] = None, fit=False, **kwargs) List[_MockObject.ndarray] #
BaseVecRepresentation’s convert returns a list of numpy arrays.
- property names#
- class olorenchemengine.internal.ConcatenatedVecRepresentation(rep1: BaseVecRepresentation, rep2: BaseVecRepresentation, log=True, **kwargs)#
Bases:
BaseVecRepresentation
Creates a structure vector representation by concatenating multiple representations.
- Parameters:
rep1 (BaseVecRepresentation) – first representation to concatenate
rep2 (BaseVecRepresentation) – second representation to concatenate
log (bool) – whether to log the representations or not
Can be created by adding two representations together using + operator.
Example
import olorenchemengine as oce combo_rep = oce.MorganVecRepresentation(radius=2, nbits=2048) + oce.Mol2Vec() model = oce.RandomForestModel(representation = combo_rep, n_estimators = 1000)
model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————
- convert(smiles_list, ys=None, fit=False, **kwargs)#
BaseVecRepresentation’s convert returns a list of numpy arrays.
- class olorenchemengine.internal.LinearRegression(*args, **kwargs)#
Bases:
BaseEstimator
Wrapper for sklearn LinearRegression
- class olorenchemengine.internal.LogScaler(min_value=0, with_mean=True, with_std=True)#
Bases:
BasePreprocessor
LogScaler is a BasePreprocessor which standardizes the data by taking the log and then removing the mean and scaling to unit variance.
- fit(X)#
Fits the preprocessor to the dataset.
- Parameters:
X (np.ndarray) – the dataset
- Returns:
The fit preprocessor instance
- fit_transform(X)#
Fits the preprocessor to the dataset and returns the transformed values.
- Parameters:
X (np.ndarray) – the dataset
- Returns:
The transformed values of the dataset as a numpy array
- inverse_transform(X)#
Returns the original values from the transformed values.
- Parameters:
X (np.ndarray) – the transformed values
- Returns:
The original values from the transformed values
- transform(X)#
Returns the transformed values of the dataset as a numpy array.
- Parameters:
X (np.ndarray) – the dataset
- Returns:
The transformed values of the dataset as a numpy array
- class olorenchemengine.internal.OASConnector#
Bases:
object
Class which links oce to OAS and can move data between them using Firestore and an API
- authenticate()#
- upload_model(model, model_name)#
- upload_vis(visualization)#
- class olorenchemengine.internal.QuantileTransformer(n_quantiles=1000, output_distribution='normal', subsample=100000.0, random_state=None)#
Bases:
BasePreprocessor
QuantileTransformer is a BasePreprocessor which transforms a dataset by quantile transformation to specified distribution.
- obj#
the object which is wrapped by the BasePreprocessor
- class olorenchemengine.internal.Remote(remote_url, session_id=None, keep_alive=False, debug=False)#
Bases:
object
- class olorenchemengine.internal.RemoteObj(remote_id)#
Bases:
BaseRemoteSymbol
Dummy object to represent remote objects.
- class olorenchemengine.internal.SMILESRepresentation(log=True)#
Bases:
BaseRepresentation
Extracts the SMILES strings from inputted data
- convert(Xs
Union[list, pd.DataFrame, dict, str], ys: Union[list, pd.Series, np.ndarray]=None) -> List[Any]: converts input data to a list of SMILES strings Data types:
pd.DataFrames will have columns “smiles” or “Smiles” or “SMILES” extracted lists and tuples of multiple elements will have their first element extracted strings will be converted to a list of one element everything else will be returned as inputted
- convert(Xs, ys=None, **kwargs)#
Converts input data to a list of representations :param Xs: input data :type Xs: Union[list, pd.DataFrame, dict, str] :param ys: target values of the input data :type ys: Union[list, pd.Series, np.ndarray]=None
- Returns:
list of representations of the input data
- Return type:
List[Any]
- class olorenchemengine.internal.StandardScaler(with_mean=True, with_std=True)#
Bases:
BasePreprocessor
StandardScaler is a BasePreprocessor which standardizes the data by removing the mean and scaling to unit variance.
- obj#
the object which is wrapped by the BasePreprocessor
- olorenchemengine.internal.all_subclasses(cls)#
Helper function to return all subclasses of class
- olorenchemengine.internal.create_BC(d: dict) BaseClass #
create_BC is a method which creates a BaseClass object from a dictionary of parameters.
Note the instances variables of the object are not specified.
- olorenchemengine.internal.deparametrize_args_kwargs(params)#
- olorenchemengine.internal.detect_setting(data)#
- olorenchemengine.internal.download_public_file(path, redownload=False)#
Download a public file from Oloren’s Public Storage, and returns the contents.
@param path: The path to the file to read. @param redownload: Whether to redownload the file if it already exists.
- olorenchemengine.internal.generate_uuid()#
- olorenchemengine.internal.get_all_reps()#
- olorenchemengine.internal.get_default_args(func)#
- olorenchemengine.internal.get_runtime()#
- olorenchemengine.internal.import_or_install(package_name: str, statement: Optional[str] = None, scope: Optional[dict] = None)#
- olorenchemengine.internal.json_params_str(base: Union[BaseClass, dict]) str #
Returns a json string of the parameters of the passed BaseClass object so that the model parameter dictionary can be reconstructed with json.load(params_str)
- olorenchemengine.internal.loads(d: dict) BaseClass #
loads is a method which recreates a BaseClass object from a save.
- olorenchemengine.internal.log_arguments(func: Callable[[...], None]) Callable[[...], None] #
- log_arguments is a decorator which logs the arguments of a BaseClass constructor to instance variables for use in
model parameterization.
- olorenchemengine.internal.mock_imports(g, *args)#
- olorenchemengine.internal.model_name_from_model(model: BaseClass) str #
model_name_from_model creates a unique name for a model.
- olorenchemengine.internal.model_name_from_params(param_dict: dict) str #
model_name_from_params creates a unique name for a model based on the parameters passed to it.
- olorenchemengine.internal.package_available(package_name: str) bool #
Checks if a package is available.
- olorenchemengine.internal.parameterize(object: Optional[Union[BaseClass, list, dict, int, float, str]]) dict #
parameterize is a recursive method which creates a dictionary of all arguments necessary to instantiate a BaseClass object.
Note that only objects which are instances of subclasses of BaseClass can be parameterized, other supported objects are to enable to recursive use of parameterize but cannot themselves be parameterized.
- Parameters:
object (Union[BaseClass, list, dict, int, float, str, None]) – parameterize is a recursive method which creates a dictionary of all arguments necessary to instantiate a BaseClass object.
- Raises:
ValueError – Object is not of type that can be parameterized
- Returns:
dictionary of parameters necessary to instantiate the object.
- Return type:
- olorenchemengine.internal.parametrize_args_kwargs(args, kwargs)#
- olorenchemengine.internal.pretty_args_kwargs(args, kwargs)#
- olorenchemengine.internal.pretty_params(base: Union[BaseClass, dict]) dict #
Returns a dictionary of the parameters of the passed BaseClass object, formatted such that they are in a human readable format, with the names of the arguments included.
- olorenchemengine.internal.pretty_params_str(base: Union[BaseClass, dict]) str #
Returns a string of the parameters of the passed BaseClass object, formatted such that they are in a human readable format
- olorenchemengine.internal.recursive_get_attr(parent, attr)#
- olorenchemengine.internal.saves(object: Optional[Union[BaseClass, dict, list, int, float, str]]) dict #
saves is a method which saves BaseClass object, which can be recovered via loads.
- olorenchemengine.internal.set_runner(runner)#
olorenchemengine.interpret module#
- class olorenchemengine.interpret.CounterfactualEngine(model: BaseModel, perturbation_engine: PerturbationEngine = 'default')#
Bases:
BaseClass
Generates counterfactual compounds based on:
exmol GitHub repository Model agnostic generation of counterfactual explanations for molecules
- generate_cfs(delta: Union[int, float, Tuple] = (-1, 1), n: int = 4) None #
Generates counterfactuals and stores them in self.cfs as a list of dictionaries.
- Parameters:
delta – margin defining counterfactuals for regression models
n – number of counterfactuals
- generate_samples(smiles: str) None #
Generates candidate counterfactuals and stores them in self.samples as a list of dictionaries.
- Parameters:
smiles – SMILES string of the target prediction
- get_cfs() _MockObject.DataFrame #
Returns counterfactuals as a pandas dataframe.
- Returns:
pandas dataframe of counterfactuals
- get_samples() _MockObject.DataFrame #
Returns candidate counterfactuals as a pandas dataframe.
- Returns:
pandas dataframe of candidate counterfactuals
- class olorenchemengine.interpret.PerturbationEngine(log=True)#
Bases:
BaseClass
PerturbationEngine is the base class for techniques which mutate or perturb a compound into a similar one with a small difference.
- get_compound_at_idx()#
returns a compound with a modification at a given atom index
- get_compound()#
returns a compound with a randomly chosen modification
- get_compound_list()#
returns a list of compounds with modifications, the list is meant to be comprehensive of the result of the application of an entire class of modifications.
- class olorenchemengine.interpret.STONEDMutations(mutations: int = 1, log=True)#
Bases:
PerturbationEngine
Implements STONED-SELFIES algorithm for generating modified compounds.
STONED-SELFIES GitHub repository Beyond Generative Models: Superfast Traversal, Optimization, Novelty, Exploration and Discovery (STONED) Algorithm for Molecules using SELFIES
- get_compound_at_idx()#
returns a compound with a randomly chosen modification at idx
- get_compound()#
returns a compound with a randomly chosen modification
- get_compound_list()#
returns a list of num_samples compounds with randomly chosen modifications
- class olorenchemengine.interpret.SwapMutations(radius=0, log=True)#
Bases:
PerturbationEngine
SwapMutations replaces substructures with radius r with another substructure with radius < r. The substructure is chosen such that it has the same outgoing bonds, and this set of substructures is identified through a comprehensive ennumeration of a large set of lead-like compounds.
- get_compound()#
returns a compound with a randomly chosen modification
- get_compound_list()#
returns a list of compounds with modifications, the list is meant to be comprehensive of the result of the application of an entire class of modifications
- get_compound(smiles, **kwargs)#
- get_compound_at_idx(mol, idx, **kwargs)#
- get_entry(m, idx, r=1)#
- get_substitution(m, idx, r=1)#
- stitch(m)#
- olorenchemengine.interpret.model_molecule_sensitivity(model: BaseModel, smiles: str, perturbation_engine: PerturbationEngine = 'default', n: int = 30) _MockObject.Chem.Mol #
Calculates the sensitivity of a model to perturbations in on each of a molecule’s atoms, outputting a rdkit molecule, with sensitivity as an atom property.
- Parameters:
model – model to be used for sensitivity calculation
smiles – SMILES string of the target prediction
perturbation_engine – perturbation engine to be used for sensitivity calculation
n – number of perturbations to be used for sensitivity calculation
- Returns:
rdkit molecule with sensitivity as an atom property
olorenchemengine.manager module#
- class olorenchemengine.manager.BaseModelManager(dataset: BaseDataset, metrics: List[str], file_path: str = None, primary_metric: str = None, verbose=True, log=True)#
Bases:
BaseClass
BaseModelManager is the base class for all model managers.
- Parameters:
- property direction#
- get_dataset()#
- get_model_database()#
- primary_metric()#
- class olorenchemengine.manager.FirebaseModelManager(dataset: BaseDataset, metrics: List[str], uid: str, primary_metric: str = None, file_path: str = None, log=True)#
Bases:
BaseModelManager
FirebaseModelManager is a ModelManager that saves model parameters and performances to a Firebase database.
A Firebase service account key in oce.CONFIG is required for database access.
Model information is saved to a collection called ‘models’ in the database. For each document, the following is saved:
uid: the user id of the user associated with the model
did: the dataset_id of on which the model was trained
model_parameters: parameters of the BaseModel oce object
model_name
model_status
fit_time
metrics: model training metrics
Dataset information is saved to a collection called ‘datasets’ in the database. For each document, the following is saved:
dataset: map representation of the BaseDataset oce object
hashed_dataset: md5 hash of the dataset data
uid: the user id of the user associated with the dataset
- Parameters:
dataset (BaseDataset) – The dataset to use for model development.
metrics (list[Str]) – The metrics to track e.g. ROC AUC, Root Mean Squared Error.
file_path (str) – The path to save the model manager to.
uid (str) – The user id associated with the model manager
- class olorenchemengine.manager.ModelManager(dataset: BaseDataset, metrics: List[str], file_path: str = None, primary_metric: str = None, verbose=True, log=True)#
Bases:
BaseModelManager
ModelManager is the class that tracks model development against a specified dataset. It is responsible for saving parameter settings and metrics.
- Parameters:
dataset (BaseDataset) – The dataset to use for model development.
metrics (list[Str]) – The metrics to track e.g. ROC AUC, Root Mean Squared Error.
autosave (str) – The path to save the model manager to. Optional.
- class olorenchemengine.manager.SheetsModelManager(dataset: BaseDataset, metrics: List[str], file_path: str = None, primary_metric: str = None, name: str = 'SheetsModelManager', email: str = '', log=True)#
Bases:
BaseModelManager
SheetsModelManager is the class that tracks model development against a specified dataset on Google Sheets. It is responsible for saving parameter settings and metrics.
- Parameters:
dataset (BaseDataset) – The dataset to use for model development.
metrics (list[Str]) – The metrics to track e.g. ROC AUC, Root Mean Squared Error.
name (str) – The name of the Google Sheets to save this to. Optional.
email (str) – The email to share the results to. Optional, Default is share to anyone with the link.
olorenchemengine.reduction module#
- class olorenchemengine.reduction.FactorAnalysis(*args, **kwargs)#
Bases:
BaseSKLearnReduction
Wrapper for sklearn FactorAnalysis
- class olorenchemengine.reduction.PCA(*args, **kwargs)#
Bases:
BaseSKLearnReduction
Wrapper for sklearn PCA
olorenchemengine.representations module#
A library of various molecular representations.
- class olorenchemengine.representations.AtomFeaturizer(log=True)#
Bases:
BaseClass
Abstract class for atom featurizers, which create a vector representation for a single atom.
- length(self) int #
returns the length of the atom vector representation, to be implemented by subclasses
- convert(self, atom
Chem.Atom) -> np.ndarray: converts a single Chem.Atom string to a vector representation, to be implemented by subclasses
- abstract convert(atom: _MockObject.Chem.Atom) _MockObject.ndarray #
- class olorenchemengine.representations.BaseCompoundVecRepresentation(normalize=False, **kwargs)#
Bases:
BaseVecRepresentation
Computes a vector representation from each structure.
- Parameters:
- class olorenchemengine.representations.BondFeaturizer(log=True)#
Bases:
BaseClass
Abstract class for bond featurizers, which create a vector representation for a single bond.
- length(self) int #
returns the length of the bond vector representation, to be implemented by subclasses
- convert(self, bond
Chem.Bond) -> np.ndarray: converts a single Chem.Bond string to a vector representation, to be implemented by subclasses
- abstract convert(bond: _MockObject.Chem.Bond) _MockObject.ndarray #
- class olorenchemengine.representations.ConcatenatedAtomFeaturizers(atom_featurizers: List[AtomFeaturizer])#
Bases:
AtomFeaturizer
Concatenates multiple atom featurizers into a single vector representation.
- length(self) int #
returns the length of the atom vector representation, to be implemented by subclasses
- convert(self, atom
Chem.Atom) -> np.ndarray: converts a single Chem.Atom string to a vector representation, to be implemented by subclasses
- convert(atom: _MockObject.Chem.Atom) _MockObject.ndarray #
- class olorenchemengine.representations.ConcatenatedBondFeaturizers(bond_featurizers: List[BondFeaturizer])#
Bases:
BondFeaturizer
Concatenates multiple bond featurizers into a single vector representation.
- length(self) int #
returns the length of the bond vector representation, to be implemented by subclasses
- convert(self, bond
Chem.Bond) -> np.ndarray: converts a single Chem.Bond string to a vector representation, to be implemented by subclasses
- convert(bond: _MockObject.Chem.Bond) _MockObject.ndarray #
- class olorenchemengine.representations.ConcatenatedStructVecRepresentation(rep1: BaseCompoundVecRepresentation, rep2: BaseCompoundVecRepresentation, log=True, **kwargs)#
Bases:
BaseCompoundVecRepresentation
Creates a structure vector representation by concatenating multiple representations.
DEPRECEATED, use ConcatenatedVecRepresentation instead.
- Parameters:
rep1 (BaseVecRepresentation) – first representation to concatenate
rep2 (BaseVecRepresentation) – second representation to concatenate
log (bool) – whether to log the representations or not
- class olorenchemengine.representations.DatasetFeatures(*args, collinear_thresh=1.01, scale=<olorenchemengine.internal.StandardScaler object>, names=None, log=True, **kwargs)#
Bases:
BaseVecRepresentation
Selects features from the input dataset as the vector representation
- convert(X, **kwargs)#
BaseVecRepresentation’s convert returns a list of numpy arrays.
- class olorenchemengine.representations.DescriptastorusDescriptor(name, *args, log=True, scale=None, **kwargs)#
Bases:
BaseCompoundVecRepresentation
Wrapper for DescriptaStorus descriptors (https://github.com/bp-kelley/descriptastorus)
- Parameters:
- classmethod AllInstances()#
AllTypes returns a list of all standard instances of all subclasses of BaseClass.
Standard instances means that all required parameters for instantiation of the subclasses are set with canonical values.
- available_descriptors = ['atompaircounts', 'morgan3counts', 'morganchiral3counts', 'morganfeature3counts', 'rdkit2d', 'rdkit2dnormalized', 'rdkitfpbits']#
- class olorenchemengine.representations.FragmentIndicator(log=True, **kwargs)#
Bases:
BaseCompoundVecRepresentation
Indicator variables for all fragments in rdkit.Chem.Fragments
- class olorenchemengine.representations.GobbiPharma2D(normalize=False, **kwargs)#
Bases:
BaseCompoundVecRepresentation
2D Gobbi pharmacophore descriptor (implemented in RDKit, from https://doi.org/10.1002/(SICI)1097-0290(199824)61:1<47::AID-BIT9>3.0.CO;2-Z)
- class olorenchemengine.representations.GobbiPharma3D(normalize=False, **kwargs)#
Bases:
BaseCompoundVecRepresentation
3D Gobbi pharmacophore descriptor (implemented in RDKit, from https://doi.org/10.1002/(SICI)1097-0290(199824)61:1<47::AID-BIT9>3.0.CO;2-Z)
- class olorenchemengine.representations.LipinskiDescriptor(log=True, **kwargs)#
Bases:
BaseCompoundVecRepresentation
Wrapper for Lipinski descriptors (https://www.rdkit.org/docs/RDKit_Book.html#Lipinski_Descriptors)
- Parameters:
log (bool) – whether to log the representations or not
- class olorenchemengine.representations.MACCSKeys#
Bases:
BaseCompoundVecRepresentation
Calculate MACCS (Molecular ACCess System) Keys fingerprint.
Durant, Joseph L., et al. “Reoptimization of MDL keys for use in drug discovery.” Journal of chemical information and computer sciences 42.6 (2002): 1273-1280.
- class olorenchemengine.representations.MCSClusterRep(dataset: BaseDataset, *args, eval_set='train', timeout: int = 5, threshold: float = 0.9, cached=False, log=True, **kwargs)#
Bases:
BaseCompoundVecRepresentation
Clusters a train set of compounds and then finds the maximum common substructure (MCS) within each set. The presence of each cluster’s MCS is used as a feature
- class olorenchemengine.representations.ModelAsRep(model: Union[BaseModel, str], name='ModelAsRep', download_public_file=False, log=True, **kwargs)#
Bases:
BaseCompoundVecRepresentation
Uses a trained model itself as a representation.
If we are trying to predict property A, and there is a highly related property B that has a lot of data we could train a model on property B and use that model with ModelAsRep as a representation for property A.
- Parameters:
model (BaseModel, str) – A trained model to be used as the representation, either a BaseModel object or a path to a saved model
download_public_file (bool, optional) – If True, will download the specified model from OCE’s public warehouse of models. Defaults to False.
name (str) –
Name of the property the passed model predicts, which is usefully for clear save files/interpretability visualizations.
Optional.
- class olorenchemengine.representations.MordredDescriptor(descriptor_set: Union[str, list] = '2d', log: bool = True, normalize: bool = False, **kwargs)#
Bases:
BaseCompoundVecRepresentation
Wrapper for Mordred descriptors (https://github.com/mordred-descriptor/mordred)
- Parameters:
- convert(Xs, ys=None, **kwargs)#
Computes a vector representation from each structure in Xs.
- convert_full(Xs, ys=None, **kwargs)#
Convert list of SMILES to descriptors in the form of a numpy array.
- class olorenchemengine.representations.MorganVecRepresentation(radius=2, nbits=1024, scale=None, log=True, **kwargs)#
Bases:
BaseCompoundVecRepresentation
- info(smiles)#
- class olorenchemengine.representations.NoisyVec(rep: BaseVecRepresentation, *args, a_std=0.1, m_std=0.1, **kwargs)#
Bases:
BaseVecRepresentation
Adds noise to a given BaseVecRepresentation
- Parameters:
rep (BaseVecRepresentation) – BaseVecRepresentation to add noise to
a_std (float) – standard deviation of the additive noise. Defaults to 0.1.
m_std (float) – standard deviation of the multiplicative noise. Defaults to 0.1.
names (List[str]) – list of the names of the features in the vector representation, optional.
Example
import olorenchemengine as oce model = oce.RandomForestModel(representation = oce.’’’BaseCompoundVecRepresentation(Params)’’’, n_estimators=1000)
model.fit(train[‘Drug’], train[‘Y’]) model.predict(test[‘Drug’]) ——————————
- class olorenchemengine.representations.OGBAtomFeaturizer#
Bases:
AtomFeaturizer
Creates a vector representation for a single atom using the Open Graph Benchmark’s atom_to_feature_vector function.
- convert(atom: _MockObject.Chem.Atom)#
- property length#
- class olorenchemengine.representations.OGBBondFeaturizer#
Bases:
BondFeaturizer
Creates a vector representation for a single bond using the Open Graph Benchmark’s bond_to_feature_vector function.
- convert(bond: _MockObject.Chem.Bond)#
- property length#
- class olorenchemengine.representations.OlorenCheckpoint(model_path: str, num_tasks: int = 2048, log: bool = True, **kwargs)#
Bases:
BaseCompoundVecRepresentation
Use OlorenVec from checkpoint as a molecular representation
- Parameters:
- classmethod AllInstances()#
AllTypes returns a list of all standard instances of all subclasses of BaseClass.
Standard instances means that all required parameters for instantiation of the subclasses are set with canonical values.
- molecule2graph(mol, include_mol=False)#
Convert a molecule to a PyG graph with features and labels
- Parameters:
mol (rdkit.Chem.rdchem.Mol) – molecule to convert
include_mol (bool, optional) – Whether or not include the molecule in the graph. Defaults to False.
- Returns:
PyG graph
- Return type:
graph
- smiles2pyg(smiles_str, y, morgan_params={'nBits': 1024, 'radius': 2})#
Convert a SMILES string to a PyG graph with features and labels
- class olorenchemengine.representations.PeptideDescriptors1(log=True, **kwargs)#
- class olorenchemengine.representations.PubChemFingerprint#
Bases:
BaseCompoundVecRepresentation
PubChem Fingerprint
Implemented as a fingerprint, which runs locally vs by calling the PubChem Fingerprint (PCFP) webservice, using RDKIT to calculate the fingerprint.
Specs described in ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt. Search patterns from https://bitbucket.org/caodac/pcfp/src/master/src/tripod/fingerprint/PCFP.java. —–
- class olorenchemengine.representations.PubChemFingerprint_local#
Bases:
BaseCompoundVecRepresentation
PubChem Fingerprint
Implemented as a fingerprint, which runs locally vs by calling the PubChem Fingerprint (PCFP) webservice, using RDKIT to calculate the fingerprint.
On a validation set of 400 compounds from the FDA Orange Book, the PubCheFP_local matches the PubChem server-based version on 331/400 compounds, and is within 1 bit on 360/400 compounds. There are however 28/400 compounds where it is between 50 and 100 bits off.
Specs described in ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt. Search patterns from https://bitbucket.org/caodac/pcfp/src/master/src/tripod/fingerprint/PCFP.java. —–
- class olorenchemengine.representations.TorchGeometricGraph(atom_featurizer: ~olorenchemengine.representations.AtomFeaturizer = <olorenchemengine.representations.OGBAtomFeaturizer object>, bond_featurizer: ~olorenchemengine.representations.BondFeaturizer = <olorenchemengine.representations.OGBBondFeaturizer object>, **kwargs)#
Bases:
BaseRepresentation
Representation which returns torch_geometric.data.Data objects.
- Parameters:
atom_featurizer (AtomFeaturizer) – featurizer for atoms
bond_featurizer (BondFeaturizer) – featurizer for bonds
- _convert(self, smiles
str, y: Any=None) -> Data: converts a single SMILES string to a torch_geometric.data.Data object
- convert(Xs: Union[list, _MockObject.DataFrame, dict, str], ys: Optional[Union[list, _MockObject.Series, _MockObject.ndarray]] = None, **kwargs) List[Any] #
Converts input data to a list of representations :param Xs: input data :type Xs: Union[list, pd.DataFrame, dict, str] :param ys: target values of the input data :type ys: Union[list, pd.Series, np.ndarray]=None
- Returns:
list of representations of the input data
- Return type:
List[Any]
- property dimensions#
- olorenchemengine.representations.countAnyRing(mol, rings, size)#
- olorenchemengine.representations.countAromaticRing(mol, rings)#
- olorenchemengine.representations.countHeteroAromaticRing(mol, rings)#
- olorenchemengine.representations.countHeteroInRing(mol, rings, size)#
- olorenchemengine.representations.countNitrogenInRing(mol, rings, size)#
- olorenchemengine.representations.countSaturatedOrAromaticCarbonOnlyRing(mol, rings, size)#
- olorenchemengine.representations.countSaturatedOrAromaticHeteroContainingRing(mol, rings, size)#
- olorenchemengine.representations.countSaturatedOrAromaticNitrogenContainingRing(mol, rings, size)#
- olorenchemengine.representations.countUnsaturatedCarbonOnlyRing(mol, rings, size)#
- olorenchemengine.representations.countUnsaturatedHeteroContainingRing(mol, rings, size)#
- olorenchemengine.representations.countUnsaturatedNitrogenContainingRing(mol, rings, size)#
- olorenchemengine.representations.get_valid_combinations(sets)#
- olorenchemengine.representations.isAromaticRing(mol, atoms)#
- olorenchemengine.representations.isCarbonOnlyRing(mol, atoms)#
- olorenchemengine.representations.isRingSaturated(mol, atoms)#
- olorenchemengine.representations.isRingUnsaturated(mol, atoms, all_rings)#
olorenchemengine.splitters module#
For creating splits on the data
- class olorenchemengine.splitters.BaseSplitter(split_proportions=[0.8, 0.1, 0.1], log=True)#
Bases:
BaseDatasetTransform
Base class for all splitters.
- Parameters:
- abstract split(data, *args, **kwargs)#
Split data into train/val/test sets.
- Parameters:
data (pandas.DataFrame) – Dataset to split, must have a structure column.
- Returns:
Tuple of training, validation, and testing dataframes.
- Return type:
(tuple)
- transform(dataset: BaseDataset, *args, **kwargs) BaseDataset #
Applies a transformation onto the inputted BaseDataset.
Parameters: dataset (BaseDataset): The dataset to transform.
- class olorenchemengine.splitters.DateSplitter(log=True, **kwargs)#
Bases:
BaseSplitter
Split data into train/val/test sets by date range.
- Parameters:
- split(data, date_col)#
Return array of train/val/test dataframes in format [train, val, test].
Example
import olorenchemengine as oce
df = pd.read_csv(“Your Dataset”) dataset = (
oce.BaseDataset(data = df.to_csv(), structure_col = “SMILES COLUMN”, property_col = “PROPERTY COLUMN”) + oce.DateSplitter(split_proportions = [0.8, 0.1, 0.1], date_col=”DATE COLUMN”)
) #OR train, val, test = oce.DateSplitter(split_proportions = [0.8, 0.1, 0.1], date_col=”DATE COLUMN”).split(df) ——————————
- split(data, date_col, *args, **kwargs)#
Split data into train/val/test sets.
- Parameters:
data (pandas.DataFrame) – Dataset to split, must have a structure column.
- Returns:
Tuple of training, validation, and testing dataframes.
- Return type:
(tuple)
- class olorenchemengine.splitters.PropertySplit(property_col, threshold=None, noise=0.1, categorical=False, log=True, **kwargs)#
Bases:
BaseSplitter
Split molecules into train/val/test based on user-defined property.
- Parameters:
property_col (string) – column in dataset with property values to split data on
threshold (int) (optional) – user-defined value to split data. If set to None (default), threshold will be determined based on split_proportions. User defines a single threshold for train/test split.
noise (int) – random noise to add to dataset before splitting. Note: data is minmax scaled to [0, 1] range before noise is introduced.
categorical (bool) – Set True to convert property values to categorical format ([0, 1, 2]) based on threshold.
Methods – split(data): Return array of train/val/test dataframes in format [train, val, test].
Example –
------------------------------ –
oce (import olorenchemengine as) –
pd.read_csv (df =) –
( (dataset =) – oce.BaseDataset(data = df.to_csv(), structure_col = “SMILES COLUMN”, property_col = “PROPERTY COLUMN”) + oce.PropertySplit(split_proportions = [0.8, 0.1, 0.1], property_col = “PROPERTY COLUMN”, threshold = 0.5, noise = 0.1, categorical = False)
) –
#OR –
train (split_proportions = [0.8, 0.1, 0.1], property_col = "PROPERTY COLUMN", threshold = 0.5, noise = 0.1, categorical = False).split(df) –
val (split_proportions = [0.8, 0.1, 0.1], property_col = "PROPERTY COLUMN", threshold = 0.5, noise = 0.1, categorical = False).split(df) –
oce.PropertySplit (test =) –
- split(data: _MockObject.DataFrame, *args, **kwargs)#
Split data into train/val/test sets.
- Parameters:
data (pandas.DataFrame) – Dataset to split, must have a structure column.
- Returns:
Tuple of training, validation, and testing dataframes.
- Return type:
(tuple)
- class olorenchemengine.splitters.RandomSplit(log=True, **kwargs)#
Bases:
BaseSplitter
Split data randomly into train/val/test sets.
- Parameters:
data (pandas.DataFrame) – Dataset to split.
split_proportions (tuple[int]) – Tuple of train/val/test proportions of data to split into.
log (bool) – Whether to log the data or not.
- split(data)#
Return array of train/val/test dataframes in format [train, val, test].
Example
import olorenchemengine as oce
df = pd.read_csv(“Your Dataset”) dataset = (
oce.BaseDataset(data = df.to_csv(), structure_col = “SMILES COLUMN”, property_col = “PROPERTY COLUMN”) + oce.RandomSplit(split_proportions = [0.8, 0.1, 0.1])
) #OR train, val, test = oce.RandomSplit(split_proportions = [0.8, 0.1, 0.1]).split(df) ——————————
- split(data, *args, **kwargs)#
Split data into train/val/test sets.
- Parameters:
data (pandas.DataFrame) – Dataset to split, must have a structure column.
- Returns:
Tuple of training, validation, and testing dataframes.
- Return type:
(tuple)
- class olorenchemengine.splitters.ScaffoldSplit(scaffold_filter_threshold: int = 0, split_type='murcko', log=True, **kwargs)#
Bases:
BaseSplitter
Split data into train/val/test sets by scaffold. Makes sure that the same Bemis-Murcko scaffold is not used in both train and test.
- Parameters:
scaffold_filter_threshold (float) – Threshold for minimum number of compounds per scaffold class for a scaffold class to be included.
split_proportions (tuple[int]) – Tuple of train/val/test proportions of data to split into.
split_type (string) – type of split murcko: split data by bemis-murcko scaffold kmeans_murcko: split data by kmeans clustering murcko scaffolds
- split(data, structure_col)#
Return array of train/val/test dataframes in format [train, val, test].
Example
import olorenchemengine as oce
df = pd.read_csv(“Your Dataset”) dataset = (
oce.BaseDataset(data = df.to_csv(), structure_col = “SMILES COLUMN”, property_col = “PROPERTY COLUMN”) + oce.ScaffoldSplit(split_proportions = [0.8, 0.1, 0.1], scaffold_filter_threshold = 5, split_type = “murcko”)
) #OR train, val, test = oce.ScaffoldSplitter(split_proportions = [0.8, 0.1, 0.1], scaffold_filter_threshold = 5, split_type = “murcko”).split(df, structure_col = “SMILES COLUMN”) ——————————
- split(data: _MockObject.DataFrame, *args, structure_col: str = 'Smiles', **kwargs)#
Split data into train/val/test sets.
- Parameters:
data (pandas.DataFrame) – Dataset to split, must have a structure column.
- Returns:
Tuple of training, validation, and testing dataframes.
- Return type:
(tuple)
- class olorenchemengine.splitters.StratifiedSplitter(value_col, log=True, **kwargs)#
Bases:
BaseSplitter
Split data into train/val/test sets stratified by a value column (generally the label).
- Parameters:
- split(data)#
Return array of train/val/test dataframes in format [train, val, test].
Example
import olorenchemengine as oce
df = pd.read_csv(“Your Dataset”) dataset = (
oce.BaseDataset(data = df.to_csv(), structure_col = “SMILES COLUMN”, property_col = “PROPERTY COLUMN”) + oce.StratifiedSplitter(split_proportions = [0.8, 0.1, 0.1], value_col=”PROPERTY COLUMN”)
) #OR train, val, test = oce.StratifiedSplitter(split_proportions = [0.8, 0.1, 0.1], value_col=”PROPERTY COLUMN”).split(df) ——————————
- split(data, *args, **kwargs)#
Split data into train/val/test sets.
- Parameters:
data (pandas.DataFrame) – Dataset to split, must have a structure column.
- Returns:
Tuple of training, validation, and testing dataframes.
- Return type:
(tuple)
- class olorenchemengine.splitters.dc_ScaffoldSplit(log=True, **kwargs)#
Bases:
BaseSplitter
Split data into train/val/test sets by scaffold using DeepChem implementation. https://deepchem.readthedocs.io/en/latest/api_reference/splitters.html#scaffoldsplitter
- Parameters:
split_proportions (tuple[int]) – Tuple of train/val/test proportions of data to split into.
- split(data, structure_col)#
Return array of train/val/test dataframes in format [train, val, test].
Example
import olorenchemengine as oce
df = pd.read_csv(“Your Dataset”) dataset = (
oce.BaseDataset(data = df.to_csv(), structure_col = “SMILES COLUMN”, property_col = “PROPERTY COLUMN”) + oce.dc_ScaffoldSplit(split_proportions = [0.8, 0.1, 0.1])
) #OR train, val, test = oce.dc_ScaffoldSplitter(split_proportions = [0.8, 0.1, 0.1]).split(df, structure_col = “SMILES COLUMN”) ——————————
- split(data: _MockObject.DataFrame, *args, structure_col: str = 'Smiles', **kwargs)#
Split data into train/val/test sets.
- Parameters:
data (pandas.DataFrame) – Dataset to split, must have a structure column.
- Returns:
Tuple of training, validation, and testing dataframes.
- Return type:
(tuple)
- class olorenchemengine.splitters.gg_ScaffoldSplit(log=True, **kwargs)#
Bases:
BaseSplitter
Split data into train/val/test sets by scaffold using implementation from https://www.nature.com/articles/s42256-021-00438-4, https://github.com/PaddlePaddle/PaddleHelix
- Parameters:
split_proportions (tuple[int]) – Tuple of train/val/test proportions of data to split into.
- split(data, structure_col)#
Return array of train/val/test dataframes in format [train, val, test].
Example
import olorenchemengine as oce
df = pd.read_csv(“Your Dataset”) dataset = (
oce.BaseDataset(data = df.to_csv(), structure_col = “SMILES COLUMN”, property_col = “PROPERTY COLUMN”) + oce.gg_ScaffoldSplit(split_proportions = [0.8, 0.1, 0.1])
) #OR train, val, test = oce.gg_ScaffoldSplitter(split_proportions = [0.8, 0.1, 0.1]).split(df, structure_col = “SMILES COLUMN”) ——————————
- generate_scaffold(smiles, include_chirality=False)#
Obtain Bemis-Murcko scaffold from smiles.
- Parameters:
smiles – smiles sequence
include_chirality – Default=False
- Returns:
the scaffold of the given smiles.
- gg_split(dataset, frac_train=None, frac_valid=None, frac_test=None, structure_col='smiles')#
- Parameters:
dataset (InMemoryDataset) – the dataset to split. Make sure each element in the dataset has key “smiles” which will be used to calculate the scaffold.
frac_train (float) – the fraction of data to be used for the train split.
frac_valid (float) – the fraction of data to be used for the valid split.
frac_test (float) – the fraction of data to be used for the test split.
- split(data: _MockObject.DataFrame, *args, structure_col: str = 'smiles', **kwargs)#
Split data into train/val/test sets.
- Parameters:
data (pandas.DataFrame) – Dataset to split, must have a structure column.
- Returns:
Tuple of training, validation, and testing dataframes.
- Return type:
(tuple)
olorenchemengine.uncertainty module#
Techniques for quantifying uncertainty and estimating confidence intervals for all oce models.
- class olorenchemengine.uncertainty.ADAN(criterion: str = 'Category', rep: BaseCompoundVecRepresentation = None, dim_reduction: str = 'pls', explvar: float = 0.8, threshold: float = 0.95, log=True, **kwargs)#
Bases:
BaseErrorModel
Applicability Domain Analysis
ADAN is an error model that predicts error bars based on one or multiple ADAN categories: Applicability Domain Analysis (ADAN): A Robust Method for Assessing the Reliability of Drug Property Predictions
- Parameters:
(str) (criterion) –
(BaseCompoundVecRepresentation) (rep) – usees the representation of the BaseModel object.
({"pls" (dim_reduction) –
"pca"}) (the dimensionality reduction to use.) –
(float) (threshold) – reduction components as a proportion of total variance.
(float) – its standard range.
Example
import olorenchemengine as oce
model = oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000) model.fit_cv(train[“Drug”], train[“Y”], error_model = oce.ADAN(“E_raw”)) model.predict(test[“Drug”], return_ci = True)
- DModX(X: _MockObject.ndarray, Xp: _MockObject.ndarray) _MockObject.ndarray #
Computes the distance to the model (DmodX).
Computes the distance between a datapoint and the PLS model plane. See <https://www.jmp.com/support/help/en/15.2/index.shtml#page/jmp/dmodx-calculation.shtml> for more details about the statistic.
- Parameters:
X (np.ndarray) – queries
Xp (np.ndarray) – queries transformed into latent space
- SDEP(Xp: _MockObject.ndarray, n_drop: int = 0, neighbor_thresh: float = 0.05) _MockObject.ndarray #
Computes the standard deviation error of predictions (SDEP).
Computes the standard deviation training error of the neighbor_thresh fraction of closest training queries to each query in Xp in latent space.
- calculate_full(X, standardize: bool = True)#
Calculates complete confidence scores for visualization.
- preprocess(X, y=None)#
Preprocesses data into the appropriate representation.
- class olorenchemengine.uncertainty.AggregateErrorModel(*error_models: ~olorenchemengine.base_class.BaseErrorModel, reduction: ~olorenchemengine.base_class.BaseReduction = <olorenchemengine.reduction.FactorAnalysis object>, log=True, **kwargs)#
Bases:
BaseErrorModel
- AggregateErrorModel estimates uncertainty by aggregating ucertainty scores from
several different BaseErrorModels.
- Parameters:
(BaseErrorModel) (error_models) –
(BaseReduction) (reduction) – Must output 1 component. Default FactorAnalysis().
Example
import olorenchemengine as oce
model = oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000) model.fit(train[“Drug”], train[“Y”]) error_model = oce.AggregateErrorModel(error_models = [oce.TargetDistDC(), oce.TrainDistDC()]) error_model.build(model, train[“Drug”], train[“Y”]) error_model.fit(valid[“Drug”], valid[“Y”]) error_model.score(test[“Drug”])
- calculate(X: Union[_MockObject.DataFrame, _MockObject.ndarray, list, _MockObject.Series], y_pred: _MockObject.ndarray) _MockObject.ndarray #
Computes aggregate error model score from inputs.
- Parameters:
X – features, smiles
y_pred – predicted values
- fit(X: Union[_MockObject.DataFrame, _MockObject.ndarray, list, _MockObject.Series], y: Union[_MockObject.ndarray, list, _MockObject.Series], **kwargs)#
Fits confidence scores to an external dataset
- Parameters:
X (array-like) – features, smiles
y (array-like) – true values
- Returns:
plotly figure of fitted model against validation dataset
- class olorenchemengine.uncertainty.BaseEnsembleModel(ensemble_model=None, n_ensembles=16, log=True, **kwargs)#
Bases:
BaseErrorModel
BaseEnsembleModel is the base class for error models that estimate uncertainty based on the variance of an ensemble of models.
- calculate(X, y_pred)#
To be implemented by the child class; calculates confidence scores from inputs.
- Parameters:
X – features, list of SMILES
y_pred (1-dimensional np.ndarray) – predicted values
- Returns:
scores (1-dimensional np.ndarray)
- class olorenchemengine.uncertainty.BaseFingerprintModel(radius=2, log=True, **kwargs)#
Bases:
BaseErrorModel
Morgan fingerprint-based error models.
BaseFingerprintModel is the base class for error models that require the computation of Morgan fingerprints.
- class olorenchemengine.uncertainty.BaseKernelError(kernel='power', h=3, log=True, **kwargs)#
Bases:
BaseFingerprintModel
Base class for kernel methods of uncertainty quantification.
- class olorenchemengine.uncertainty.BootstrapEnsemble(ensemble_model=None, n_ensembles=12, bootstrap_size=0.25, log=True, **kwargs)#
Bases:
BaseEnsembleModel
Ensemble of bootstrap models variance
BootstrapEnsemble estimates uncertainty based on the variance of several models trained on bootstrapped samples of the training data.
- Parameters:
(BaseModel) (ensemble_model) –
(int) (n_ensembles) –
(float) (bootstrap_size) –
Example
import olorenchemengine as oce
model = oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000) model.fit_cv(train[“Drug”], train[“Y”], error_model = oce.BootstrapEnsemble(n_ensembles = 10)) model.predict(test[“Drug”], return_ci = True)
- class olorenchemengine.uncertainty.KNNSimilarity(*args, **kwargs)#
Bases:
BaseDepreceated
- class olorenchemengine.uncertainty.KernelDistanceError(kernel='power', h=3, weighted=True, log=True, **kwargs)#
Bases:
BaseKernelError
Kernel distance error model.
KernelDistanceError uses an average of kernel distances to each molecule in the training set as the covariate for estimating confidence intervals. The distance function used is 1 - Tanimoto Similarity.
- Parameters:
(str (kernel) –
{"default"}) (Kernel used as a weight-function) –
(int (h) – nearest_neighbor kernel
float) (Bandwidth for most kernels, number of nearest neighbors for) – nearest_neighbor kernel
(bool) (weighted) – True, returns a kernel-weighted average of Tanimoto similarity. If False, returns an average kernel distance.
Example
# 5-nearest neighbor mean error_model = oce.KernelDistanceError(kernel=”nearest_neighbor”, h=5, weighted=True)
# Sum of Distance-weighted contributions (SDC) error_model = oce.KernelDistanceError(kernel=”sdc”, h=3, weighted=False)
- calculate(X, y_pred)#
To be implemented by the child class; calculates confidence scores from inputs.
- Parameters:
X – features, list of SMILES
y_pred (1-dimensional np.ndarray) – predicted values
- Returns:
scores (1-dimensional np.ndarray)
- class olorenchemengine.uncertainty.KernelRegressionError(kernel='power', h=3, predictor='property', log=True, **kwargs)#
Bases:
BaseKernelError
Kernel regression error model.
KernelRegressionError uses a kernel-weighted average of prediction errors as the covariate for estimating confidence intervals. It is inspired by the Nadaraya-Watson estimator, which generates a regression using a kernel-weighted average. The distance function used is 1 - Tanimoto Similarity.
This is the recommended error model for general purposes and models.
- Parameters:
(str (predictor) –
{"default"}) (Kernel used as a weight-function) –
(int (h) – nearest_neighbor kernel
float) (Bandwidth for most kernels, number of nearest neighbors for) – nearest_neighbor kernel
(str –
{"property" (Error predictor being estimated) –
"residual"}) (Error predictor being estimated) –
Example
import olorenchemengine as oce model = oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000) model.fit_cv(train[“Drug”], train[“Y”], error_model = oce.KernelRegressionError()) model.predict(test[“Drug”], return_ci = True)
- calculate(X, y_pred)#
To be implemented by the child class; calculates confidence scores from inputs.
- Parameters:
X – features, list of SMILES
y_pred (1-dimensional np.ndarray) – predicted values
- Returns:
scores (1-dimensional np.ndarray)
- class olorenchemengine.uncertainty.Naive(log=True, **kwargs)#
Bases:
BaseErrorModel
Static confidence intervals
Naive is an error model that predicts a uniform confidence interval based on the errors of the fitting dataset. Used exclusively for benchmarking error models.
- calculate(X, y_pred)#
To be implemented by the child class; calculates confidence scores from inputs.
- Parameters:
X – features, list of SMILES
y_pred (1-dimensional np.ndarray) – predicted values
- Returns:
scores (1-dimensional np.ndarray)
- class olorenchemengine.uncertainty.Predicted(log=True, **kwargs)#
Bases:
BaseErrorModel
Predicted value
Predicted is an error model that predicts error bars based on only the predicted value of a molecule.
Example
import olorenchemengine as oce
model = oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000) model.fit_cv(train[“Drug”], train[“Y”], error_model = oce.AggregateErrorModel([oce.SDC(), oce.Predicted()]) model.predict(test[“Drug”], return_ci = True)
- calculate(X, y_pred)#
To be implemented by the child class; calculates confidence scores from inputs.
- Parameters:
X – features, list of SMILES
y_pred (1-dimensional np.ndarray) – predicted values
- Returns:
scores (1-dimensional np.ndarray)
- class olorenchemengine.uncertainty.RandomForestEnsemble(log=True, **kwargs)#
Bases:
BaseEnsembleModel
Ensemble of random forests
RandomForestEnsemble estimates uncertainty based on the variance of several random forest models initialized to different random states.
- Parameters:
(BaseModel) (ensemble_model) –
(int) (n_ensembles) –
Example
import olorenchemengine as oce
model = oce.RandomForestModel(representation = oce.MorganVecRepresentation(radius=2, nbits=2048), n_estimators = 1000) model.fit_cv(train[“Drug”], train[“Y”], error_model = oce.RandomForestEnsemble(n_ensembles = 10)) model.predict(test[“Drug”], return_ci = True) ——————————
- class olorenchemengine.uncertainty.SDC(*args, **kwargs)#
Bases:
BaseDepreceated
- class olorenchemengine.uncertainty.TargetDistDC(*args, **kwargs)#
Bases:
BaseDepreceated
- class olorenchemengine.uncertainty.TrainDistDC(*args, **kwargs)#
Bases:
BaseDepreceated
Module contents#
- olorenchemengine.BACEDataset()#
- olorenchemengine.ExampleDataFrame()#
- olorenchemengine.ExampleDataset()#
- olorenchemengine.MISSING_DEPENDENCIES()#
- olorenchemengine.create_config_default_param(param: str, value: Union[str, int, float, bool])#
Create a default configuration parameter.
- Parameters:
param – the parameter to create.
value – the value to set the parameter to.
- olorenchemengine.online(session_url='https://aws.chemengine.org')#
- olorenchemengine.remove_config_param(param: str)#
Remove a configuration parameter.
- Parameters:
param – the parameter to remove.
- olorenchemengine.set_config_param(param: str, value: Union[str, int, float, bool])#
Set a configuration parameter.
- Parameters:
param – the parameter to set.
value – the value to set the parameter to.
- olorenchemengine.test_oce()#
Convenience function to test all functions of the oce package.
- olorenchemengine.update_config()#
Update the configuration file.
This function is called when a new parameter is added to the configuration file.