olorenchemengine.external.mol2vec package#
Submodules#
olorenchemengine.external.mol2vec.main module#
- class olorenchemengine.external.mol2vec.main.Mol2Vec#
Bases:
BaseVecRepresentation
olorenchemengine.external.mol2vec.operations module#
BSD 3-Clause License
Copyright (c) 2017, Mol2vec developers All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- class olorenchemengine.external.mol2vec.operations.DfVec(vec)#
Bases:
object
Helper class to store vectors in a pandas DataFrame
- Parameters:
vec (np.array) –
- class olorenchemengine.external.mol2vec.operations.MolSentence(sentence)#
Bases:
object
Class for storing mol sentences in pandas DataFrame
- contains(word)#
Contains (and __contains__) method enables usage of “‘Word’ in MolSentence
- olorenchemengine.external.mol2vec.operations.featurize(in_file, out_file, model_path, r, uncommon=None)#
Featurize mols in SDF, SMI. SMILES are regenerated with RDKit to get canonical SMILES without chirality information. :param in_file: Input SDF, SMI, ISM (or GZ) :type in_file: str :param out_file: Output csv :type out_file: str :param model_path: File path to pre-trained Gensim word2vec model :type model_path: str :param r: Radius of morgan fingerprint :type r: int :param uncommon: String to used to replace uncommon words/identifiers while training. Vector obtained for ‘uncommon’ will be used
to encode new (unseen) identifiers
- olorenchemengine.external.mol2vec.operations.generate_corpus(in_file, out_file, r, sentence_type='alt', n_jobs=1)#
Generates corpus file from sdf
- Parameters:
in_file (str) – Input sdf
out_file (str) – Outfile name prefix, suffix is either _r0, _r1, etc. or _alt_r1 (max radius in alt sentence)
r (int) – Radius of morgan fingerprint
sentence_type (str) –
- Options: ‘all’ - generates all corpus files for all types of sentences,
’alt’ - generates a corpus file with only combined alternating sentence, ‘individual’ - generates corpus files for each radius
n_jobs (int) – Number of cores to use (only ‘alt’ sentence type is parallelized)
- olorenchemengine.external.mol2vec.operations.insert_unk(corpus, out_corpus, threshold=3, uncommon='UNK')#
Handling of uncommon “words” (i.e. identifiers). It finds all least common identifiers (defined by threshold) and replaces them by ‘uncommon’ string. :param corpus: Input corpus file :type corpus: str :param out_corpus: Outfile corpus file :type out_corpus: str :param threshold: Number of identifier occurrences to consider it uncommon :type threshold: int :param uncommon: String to use to replace uncommon words/identifiers :type uncommon: str
- olorenchemengine.external.mol2vec.operations.mol2alt_sentence(mol, radius)#
Same as mol2sentence() expect it only returns the alternating sentence Calculates ECFP (Morgan fingerprint) and returns identifiers of substructures as ‘sentence’ (string). Returns a tuple with 1) a list with sentence for each radius and 2) a sentence with identifiers from all radii combined. NOTE: Words are ALWAYS reordered according to atom order in the input mol object. NOTE: Due to the way how Morgan FPs are generated, number of identifiers at each radius is smaller
- Parameters:
mol (rdkit.Chem.rdchem.Mol) –
radius (float) – Fingerprint radius
- Returns:
list – alternating sentence
combined
- olorenchemengine.external.mol2vec.operations.mol2sentence(mol, radius)#
Calculates ECFP (Morgan fingerprint) and returns identifiers of substructures as ‘sentence’ (string). Returns a tuple with 1) a list with sentence for each radius and 2) a sentence with identifiers from all radii combined. NOTE: Words are ALWAYS reordered according to atom order in the input mol object. NOTE: Due to the way how Morgan FPs are generated, number of identifiers at each radius is smaller
- Parameters:
mol (rdkit.Chem.rdchem.Mol) –
radius (float) – Fingerprint radius
- Returns:
identifier sentence – List with sentences for each radius
alternating sentence – Sentence (list) with identifiers from all radii combined
- olorenchemengine.external.mol2vec.operations.remove_salts_solvents(smiles, hac=3)#
Remove solvents and ions have max ‘hac’ heavy atoms. This function removes any fragment in molecule that has number of heavy atoms <= “hac” and it might not be an actual solvent or salt
- olorenchemengine.external.mol2vec.operations.sentences2vec(sentences, model, unseen=None)#
Generate vectors for each sentence (list) in a list of sentences. Vector is simply a sum of vectors for individual words.
- Parameters:
sentences (list, array) – List with sentences
model (word2vec.Word2Vec) – Gensim word2vec model
unseen (None, str) – Keyword for unseen words. If None, those words are skipped. https://stats.stackexchange.com/questions/163005/how-to-set-the-dictionary-for-text-analysis-using-neural-networks/163032#163032
- Return type:
np.array
- olorenchemengine.external.mol2vec.operations.train_word2vec_model(infile_name, outfile_name=None, vector_size=100, window=10, min_count=3, n_jobs=1, method='skip-gram', **kwargs)#
Trains word2vec (Mol2vec, ProtVec) model on corpus file extracted from molecule/protein sequences. The corpus file is treated as LineSentence corpus (one sentence = one line, words separated by whitespaces)
- Parameters:
infile_name (str) – Corpus file, e.g. proteins split in n-grams or compound identifier
outfile_name (str) – Name of output file where word2vec model should be saved
vector_size (int) – Number of dimensions of vector
window (int) – Number of words considered as context
min_count (int) – Number of occurrences a word should have to be considered in training
n_jobs (int) – Number of cpu cores used for calculation
method (str) – Method to use in model training. Options cbow and skip-gram, default: skip-gram)
- Return type:
word2vec.Word2Vec