olorenchemengine.external.mol2vec package#

Submodules#

olorenchemengine.external.mol2vec.main module#

class olorenchemengine.external.mol2vec.main.Mol2Vec#

Bases: BaseVecRepresentation

olorenchemengine.external.mol2vec.operations module#

BSD 3-Clause License

Copyright (c) 2017, Mol2vec developers All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  • Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

class olorenchemengine.external.mol2vec.operations.DfVec(vec)#

Bases: object

Helper class to store vectors in a pandas DataFrame

Parameters:

vec (np.array) –

class olorenchemengine.external.mol2vec.operations.MolSentence(sentence)#

Bases: object

Class for storing mol sentences in pandas DataFrame

contains(word)#

Contains (and __contains__) method enables usage of “‘Word’ in MolSentence

olorenchemengine.external.mol2vec.operations.featurize(in_file, out_file, model_path, r, uncommon=None)#

Featurize mols in SDF, SMI. SMILES are regenerated with RDKit to get canonical SMILES without chirality information. :param in_file: Input SDF, SMI, ISM (or GZ) :type in_file: str :param out_file: Output csv :type out_file: str :param model_path: File path to pre-trained Gensim word2vec model :type model_path: str :param r: Radius of morgan fingerprint :type r: int :param uncommon: String to used to replace uncommon words/identifiers while training. Vector obtained for ‘uncommon’ will be used

to encode new (unseen) identifiers

olorenchemengine.external.mol2vec.operations.generate_corpus(in_file, out_file, r, sentence_type='alt', n_jobs=1)#

Generates corpus file from sdf

Parameters:
  • in_file (str) – Input sdf

  • out_file (str) – Outfile name prefix, suffix is either _r0, _r1, etc. or _alt_r1 (max radius in alt sentence)

  • r (int) – Radius of morgan fingerprint

  • sentence_type (str) –

    Options: ‘all’ - generates all corpus files for all types of sentences,

    ’alt’ - generates a corpus file with only combined alternating sentence, ‘individual’ - generates corpus files for each radius

  • n_jobs (int) – Number of cores to use (only ‘alt’ sentence type is parallelized)

olorenchemengine.external.mol2vec.operations.insert_unk(corpus, out_corpus, threshold=3, uncommon='UNK')#

Handling of uncommon “words” (i.e. identifiers). It finds all least common identifiers (defined by threshold) and replaces them by ‘uncommon’ string. :param corpus: Input corpus file :type corpus: str :param out_corpus: Outfile corpus file :type out_corpus: str :param threshold: Number of identifier occurrences to consider it uncommon :type threshold: int :param uncommon: String to use to replace uncommon words/identifiers :type uncommon: str

olorenchemengine.external.mol2vec.operations.mol2alt_sentence(mol, radius)#

Same as mol2sentence() expect it only returns the alternating sentence Calculates ECFP (Morgan fingerprint) and returns identifiers of substructures as ‘sentence’ (string). Returns a tuple with 1) a list with sentence for each radius and 2) a sentence with identifiers from all radii combined. NOTE: Words are ALWAYS reordered according to atom order in the input mol object. NOTE: Due to the way how Morgan FPs are generated, number of identifiers at each radius is smaller

Parameters:
  • mol (rdkit.Chem.rdchem.Mol) –

  • radius (float) – Fingerprint radius

Returns:

  • list – alternating sentence

  • combined

olorenchemengine.external.mol2vec.operations.mol2sentence(mol, radius)#

Calculates ECFP (Morgan fingerprint) and returns identifiers of substructures as ‘sentence’ (string). Returns a tuple with 1) a list with sentence for each radius and 2) a sentence with identifiers from all radii combined. NOTE: Words are ALWAYS reordered according to atom order in the input mol object. NOTE: Due to the way how Morgan FPs are generated, number of identifiers at each radius is smaller

Parameters:
  • mol (rdkit.Chem.rdchem.Mol) –

  • radius (float) – Fingerprint radius

Returns:

  • identifier sentence – List with sentences for each radius

  • alternating sentence – Sentence (list) with identifiers from all radii combined

olorenchemengine.external.mol2vec.operations.remove_salts_solvents(smiles, hac=3)#

Remove solvents and ions have max ‘hac’ heavy atoms. This function removes any fragment in molecule that has number of heavy atoms <= “hac” and it might not be an actual solvent or salt

Parameters:
  • smiles (str) – SMILES

  • hac (int) – Max number of heavy atoms

Returns:

smiles

Return type:

str

olorenchemengine.external.mol2vec.operations.sentences2vec(sentences, model, unseen=None)#

Generate vectors for each sentence (list) in a list of sentences. Vector is simply a sum of vectors for individual words.

Parameters:
Return type:

np.array

olorenchemengine.external.mol2vec.operations.train_word2vec_model(infile_name, outfile_name=None, vector_size=100, window=10, min_count=3, n_jobs=1, method='skip-gram', **kwargs)#

Trains word2vec (Mol2vec, ProtVec) model on corpus file extracted from molecule/protein sequences. The corpus file is treated as LineSentence corpus (one sentence = one line, words separated by whitespaces)

Parameters:
  • infile_name (str) – Corpus file, e.g. proteins split in n-grams or compound identifier

  • outfile_name (str) – Name of output file where word2vec model should be saved

  • vector_size (int) – Number of dimensions of vector

  • window (int) – Number of words considered as context

  • min_count (int) – Number of occurrences a word should have to be considered in training

  • n_jobs (int) – Number of cpu cores used for calculation

  • method (str) – Method to use in model training. Options cbow and skip-gram, default: skip-gram)

Return type:

word2vec.Word2Vec

Module contents#