Featurization for Machine Learning
3. Featurization for Machine Learning#
import pyrfume
molecules = pyrfume.load_data(
"my_data/molecules.csv"
) # Load the data we created in the last chapter
cids = molecules.index.tolist()
Many applications, such as predictive models, propose that molecular structure of an odorant is causal to neural activity or behavior (a reasonable proposition). Training these models requires that each molecule be represented by a vector of predictors, or features, which in some way describe the structure quantitatively. Traditionally the chemoinformatic feature-calculation software called Dragon has been used for this, but since Dragon is commercial software this creates challenges for replication and reproducibility. Some have found that open source feature-calculation packages such as Mordred perform as or nearly as well.
Feature-calculation on molecules involves several steps including 1) expression of structure in a standard representation (such as SMILES or InChi), 2) conversion of that structure to a molecule object in memory (e.g. rdkit Mol), 3) addition of hydrogens (omitted from the SMILES by design), removal of salts, etc. 4) embedding in 3 dimensions (without violating steric or electostatic principles), 5) feature calculation on this 3d structure. Pyrfume
abstracts all of this away into one step.
smiles = molecules["IsomericSMILES"].tolist()
smiles
['CC1=CC[C@@H](CC1)C(=C)C',
'CC(=O)C1=CC=CC=C1',
'CC1=CC[C@@H](CC1=O)C(=C)C',
'CCCCCC=O',
'CC/C=C\\CC=O']
from pyrfume.features import smiles_to_mordred
mordred_features = smiles_to_mordred(smiles)
Computing Mordred features...
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00, 2.30it/s]
There are 5 molecules and 1826 features
The code above implements all of the steps required for featurization (except for obtaining the SMILES string, which was done in Part 2). One can also obtain a SMILES string directly by calling pyrfume.odorants.cids_to_smiles
.
mordred_features
ABC | ABCGG | nAcid | nBase | SpAbs_A | SpMax_A | SpDiam_A | SpAD_A | SpMAD_A | LogEE_A | ... | SRW10 | TSRW10 | MW | AMW | WPath | WPol | Zagreb1 | Zagreb2 | mZagreb1 | mZagreb2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CC1=CC[C@@H](CC1)C(=C)C | 7.358797 | 6.909742 | 0 | 0 | 11.936238 | 2.236068 | 4.472136 | 11.936238 | 1.193624 | 3.197236 | ... | 8.815964 | 39.140584 | 136.125201 | 5.235585 | 120 | 11 | 46.0 | 50.0 | 4.333333 | 2.277778 |
CC(=O)C1=CC=CC=C1 | 6.542301 | 6.236096 | 0 | 0 | 11.189957 | 2.193993 | 4.387987 | 11.189957 | 1.243329 | 3.089765 | ... | 8.590258 | 37.289972 | 120.057515 | 7.062207 | 88 | 9 | 40.0 | 43.0 | 3.472222 | 2.111111 |
CC1=CC[C@@H](CC1=O)C(=C)C | 8.134854 | 7.731889 | 0 | 0 | 13.152542 | 2.292456 | 4.584911 | 13.152542 | 1.195686 | 3.294652 | ... | 9.071423 | 40.991311 | 150.104465 | 6.004179 | 152 | 14 | 52.0 | 58.0 | 5.194444 | 2.472222 |
CCCCCC=O | 4.242641 | 4.53095 | 0 | 0 | 8.054679 | 1.847759 | 3.695518 | 8.054679 | 1.150668 | 2.739193 | ... | 6.900731 | 30.25721 | 100.088815 | 5.267832 | 56 | 4 | 22.0 | 20.0 | 3.25 | 2.0 |
CC/C=C\CC=O | 4.242641 | 4.53095 | 0 | 0 | 8.054679 | 1.847759 | 3.695518 | 8.054679 | 1.150668 | 2.739193 | ... | 6.900731 | 30.25721 | 98.073165 | 5.76901 | 56 | 4 | 22.0 | 20.0 | 3.25 | 2.0 |
5 rows × 1826 columns
The mordred_features
Pandas dataframe has (for the current version of Mordred, installed by Pyrfume
) 1826 computed physicochemical features for each molecule. What do these features mean? That’s a question for the computational chemists. As investigators asking how molecular structure might explain or be represented in brain activity or behavior, we just want a set of predictors with a track record of success in structure-based predictive models.
It has been shown previously that the physics-based features of Dragon (or of Mordred, which implements similar calculations) are good at predicting some kinds of structure-driven outcomes, but not others. For example, in Keller et al 2017 (Science) it was shown that such features are good for predicting perceived intensity, but not as good at predicting perceived “bakery” smell. That paper showed that the latter was better predicted by a template-matching approach – molecules that smell like “bakery” are best predicted by asking whether they are structurally similar to other molecules known to smell like “bakery”.
from pyrfume.features import smiles_to_morgan
morgan_features = smiles_to_morgan(smiles)
morgan_features
10565946 | 56091288 | 118067289 | 121328373 | 153234704 | 180311863 | 226893386 | 245578872 | 256538328 | 269476445 | ... | 3910734833 | 3967719433 | 3983702751 | 3990544841 | 4056449865 | 4084831613 | 4166400509 | 4175270308 | 4209952873 | 4277593716 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CC1=CC[C@@H](CC1)C(=C)C | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 16.0 |
CC(=O)C1=CC=CC=C1 | 0.0 | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 8.0 |
CC1=CC[C@@H](CC1=O)C(=C)C | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 14.0 |
CCCCCC=O | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 12.0 |
CC/C=C\CC=O | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 10.0 |
5 rows × 105 columns
The dataframe above contains fragment counts, i.e. the number of times, within each molecule (represented by a SMILES string) that a given sub-molecular pattern (e.g. an acetyl group, a benzene ring, or any other possible substructure) occurs. Each substructure has a unique “hash”, which is simply a long integer that can be resolved to that sub-structure (also called a fingerprint). The larger and more diverse the number of molecules that we want to featurize, the larger the number of potential substructures, so the number of predictors can become quite larger (here since we are only working with 5 molecules, there are only 105 unique substructures, of some maximum size).
These features can be used directly, but they still don’t tell us how similar two molecules are.
from pyrfume.features import smiles_to_morgan_sim
morgan_sim_features = smiles_to_morgan_sim(smiles, smiles)
morgan_sim_features
CC1=CC[C@@H](CC1)C(=C)C | CC(=O)C1=CC=CC=C1 | CC1=CC[C@@H](CC1=O)C(=C)C | CCCCCC=O | CC/C=C\CC=O | |
---|---|---|---|---|---|
CC1=CC[C@@H](CC1)C(=C)C | 1.000000 | 0.272000 | 0.680556 | 0.360656 | 0.372881 |
CC(=O)C1=CC=CC=C1 | 0.272000 | 1.000000 | 0.308943 | 0.297030 | 0.309278 |
CC1=CC[C@@H](CC1=O)C(=C)C | 0.680556 | 0.308943 | 1.000000 | 0.383333 | 0.396552 |
CCCCCC=O | 0.360656 | 0.297030 | 0.383333 | 1.000000 | 0.574468 |
CC/C=C\CC=O | 0.372881 | 0.309278 | 0.396552 | 0.574468 | 1.000000 |
By using smiles_to_morgan_sim
we have the option of a second argument, which is a list of SMILES to compare the original SMILES to. The result is a measure of similarity between two molecules, defined as their similarity in substructure counts.
But rather than restrict the the similarity to only those molecules we want to predict, why not compute the similarity to other known odorous (or non-odorous) molecules?
from pyrfume.odorants import all_smiles
reference_smiles = all_smiles()
# smiles_to_morgan_sim(smiles, reference_smiles)
The code above (after uncommenting the last line) will compute a dataframe of similarities between your molecule of interest and several thousand known odorants curated through the Pyrfume
project. This list consists of odorants used in >40 notable journal articles and industrial databases.
Because this list is so large, there is a good chance that all of the molecules you are using for your project are already in it. This means that the feature values can be computed once and you can simply look them up. These pre-computed values have already been filtred to remove non-informative (e.g. zero variance) features or features with a large fraction of missing values (some features, especially for Mordred, are only computable for esoteric molecules). Furthermore, the remaining missing values have been filled with KNN imputation. This means that they are “ready to go” for machine learning applications (which typically require finite and non-missing values in all predictors).
morgan_sim = pyrfume.load_data("morgan/features_sim.csv", cids=cids)
mordred = pyrfume.load_data("mordred/features.csv", cids=cids)
These files are quite large (thousands of molecules (rows) by thousands of features (columns)), but we’ve selected only the rows (feature vectors) for the 5 CIDs of interest.
Because every stored value in Pyrfume
is indexed by CID, we can access the molecules we care about (that we want to get features for) using CIDs as keys. We can then join them to produce one final set of features for machine learning applications.
features = mordred.join(morgan_sim)
features
ABC | ABCGG | nAcid | nBase | SpAbs_A | SpMax_A | SpDiam_A | SpAD_A | SpMAD_A | LogEE_A | ... | CC(=O)O.C1=CC=C2C(=C1)C(=CC3=C2N=C4C=CC(=N)C=C4O3)N | CCC(C)N1CCC2(CC1)NC3=C4C5=C(C(=NC(=O)/C(=C/C=C/[C@@H]([C@@H]([C@H]([C@H]([C@H]([C@@H]([C@@H]([C@H](/C=C/O[C@@]6(C(=O)C4=C(O6)C(=C5O)C)C)OC)C)OC(=O)C)C)O)C)O)C)/C)C3=N2)O | C[C@H]1/C=C/C=C(\C(=O)NC2=C(C(=C3C(=C2O)C(=C(C4=C3C(=O)[C@](O4)(O/C=C/[C@@H]([C@H]([C@H]([C@@H]([C@@H]([C@@H]([C@H]1O)C)O)C)OC(=O)C)C)OC)C)C)O)O)/C=N\N5CCN(CC5)C)/C | C[C@H]1/C=C/C=C(\C(=O)NC2=C(C(=C3C(=C2O)C(=C(C4=C3C(=O)[C@](O4)(O/C=C/[C@@H]([C@H]([C@H]([C@@H]([C@@H]([C@@H]([C@H]1O)C)O)C)OC(=O)C)C)OC)C)C)O)O)/C=N\N5CCN(CC5)C6CCCC6)/C | COC1=CC(=CC(=C1O)OC)/C=N\N=C/C2=CC(=C(C(=C2)OC)O)OC | C[N+]1=CC=CC=C1/C=N\O.[Cl-] | COC(=O)N/N=C\C1=[N+](C2=CC=CC=C2[N+](=C1)[O-])[O-] | C1=CC(=C(C(=C1)Cl)CC(=O)NC(=N)N)Cl.Cl | C1(=C(N=C(C(=N1)Cl)N)N)C(=O)NC(=N)N.O.O.Cl | CCCCCCCOC1=C(C=C(C=C1)CC=C)OC | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CID | |||||||||||||||||||||
6184 | 4.242641 | 4.530950 | 0.0 | 0.0 | 8.054679 | 1.847759 | 3.695518 | 8.054679 | 1.150668 | 2.739193 | ... | 0.124294 | 0.144681 | 0.141907 | 0.133056 | 0.267281 | 0.270270 | 0.226415 | 0.232558 | 0.174603 | 0.385787 |
7410 | 6.542301 | 6.236096 | 0.0 | 0.0 | 11.189957 | 2.193993 | 4.387987 | 11.189957 | 1.243329 | 3.089765 | ... | 0.288889 | 0.114165 | 0.123348 | 0.115702 | 0.254545 | 0.456140 | 0.395062 | 0.409091 | 0.232558 | 0.250000 |
16724 | 8.134854 | 7.731889 | 0.0 | 0.0 | 13.152542 | 2.292456 | 4.584911 | 13.152542 | 1.195686 | 3.294652 | ... | 0.150754 | 0.174797 | 0.190275 | 0.178926 | 0.251046 | 0.285714 | 0.243094 | 0.251656 | 0.189189 | 0.283105 |
440917 | 7.358797 | 6.909742 | 0.0 | 0.0 | 11.936238 | 2.236068 | 4.472136 | 11.936238 | 1.193624 | 3.197236 | ... | 0.139303 | 0.174089 | 0.185263 | 0.178218 | 0.257261 | 0.266667 | 0.218579 | 0.222222 | 0.160000 | 0.289593 |
643941 | 4.242641 | 4.530950 | 0.0 | 0.0 | 8.054679 | 1.847759 | 3.695518 | 8.054679 | 1.150668 | 2.739193 | ... | 0.115607 | 0.103004 | 0.111857 | 0.104822 | 0.225352 | 0.280374 | 0.245161 | 0.256000 | 0.196721 | 0.279793 |
5 rows × 11144 columns
We now have one giant dataframe with all 10000+ features (all physicochemical features for Mordred and all Morgan fingerprint similarity features), indexed by CID. You are now ready for prediction on targets (receptor activation, glomerular imaging data, PCx firing rates, human perception, animal behavior, etc.) using your favoriate ML framework (scikit-learn, pytorch, etc.)
Pyrfume
also supports some additional featurizations (e.g. NSPDK) and is working on supporting more (e.g. from Graph Convolution Networks or Auto-Encoders).