3. Featurization for Machine Learning#

import pyrfume

molecules = pyrfume.load_data(
    "my_data/molecules.csv"
)  # Load the data we created in the last chapter
cids = molecules.index.tolist()

Many applications, such as predictive models, propose that molecular structure of an odorant is causal to neural activity or behavior (a reasonable proposition). Training these models requires that each molecule be represented by a vector of predictors, or features, which in some way describe the structure quantitatively. Traditionally the chemoinformatic feature-calculation software called Dragon has been used for this, but since Dragon is commercial software this creates challenges for replication and reproducibility. Some have found that open source feature-calculation packages such as Mordred perform as or nearly as well.

Feature-calculation on molecules involves several steps including 1) expression of structure in a standard representation (such as SMILES or InChi), 2) conversion of that structure to a molecule object in memory (e.g. rdkit Mol), 3) addition of hydrogens (omitted from the SMILES by design), removal of salts, etc. 4) embedding in 3 dimensions (without violating steric or electostatic principles), 5) feature calculation on this 3d structure. Pyrfume abstracts all of this away into one step.

smiles = molecules["IsomericSMILES"].tolist()
smiles
['CC1=CC[C@@H](CC1)C(=C)C',
 'CC(=O)C1=CC=CC=C1',
 'CC1=CC[C@@H](CC1=O)C(=C)C',
 'CCCCCC=O',
 'CC/C=C\\CC=O']
from pyrfume.features import smiles_to_mordred

mordred_features = smiles_to_mordred(smiles)
Computing Mordred features...
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.30it/s]
There are 5 molecules and 1826 features

The code above implements all of the steps required for featurization (except for obtaining the SMILES string, which was done in Part 2). One can also obtain a SMILES string directly by calling pyrfume.odorants.cids_to_smiles.

mordred_features
ABC ABCGG nAcid nBase SpAbs_A SpMax_A SpDiam_A SpAD_A SpMAD_A LogEE_A ... SRW10 TSRW10 MW AMW WPath WPol Zagreb1 Zagreb2 mZagreb1 mZagreb2
CC1=CC[C@@H](CC1)C(=C)C 7.358797 6.909742 0 0 11.936238 2.236068 4.472136 11.936238 1.193624 3.197236 ... 8.815964 39.140584 136.125201 5.235585 120 11 46.0 50.0 4.333333 2.277778
CC(=O)C1=CC=CC=C1 6.542301 6.236096 0 0 11.189957 2.193993 4.387987 11.189957 1.243329 3.089765 ... 8.590258 37.289972 120.057515 7.062207 88 9 40.0 43.0 3.472222 2.111111
CC1=CC[C@@H](CC1=O)C(=C)C 8.134854 7.731889 0 0 13.152542 2.292456 4.584911 13.152542 1.195686 3.294652 ... 9.071423 40.991311 150.104465 6.004179 152 14 52.0 58.0 5.194444 2.472222
CCCCCC=O 4.242641 4.53095 0 0 8.054679 1.847759 3.695518 8.054679 1.150668 2.739193 ... 6.900731 30.25721 100.088815 5.267832 56 4 22.0 20.0 3.25 2.0
CC/C=C\CC=O 4.242641 4.53095 0 0 8.054679 1.847759 3.695518 8.054679 1.150668 2.739193 ... 6.900731 30.25721 98.073165 5.76901 56 4 22.0 20.0 3.25 2.0

5 rows × 1826 columns

The mordred_features Pandas dataframe has (for the current version of Mordred, installed by Pyrfume) 1826 computed physicochemical features for each molecule. What do these features mean? That’s a question for the computational chemists. As investigators asking how molecular structure might explain or be represented in brain activity or behavior, we just want a set of predictors with a track record of success in structure-based predictive models.

It has been shown previously that the physics-based features of Dragon (or of Mordred, which implements similar calculations) are good at predicting some kinds of structure-driven outcomes, but not others. For example, in Keller et al 2017 (Science) it was shown that such features are good for predicting perceived intensity, but not as good at predicting perceived “bakery” smell. That paper showed that the latter was better predicted by a template-matching approach – molecules that smell like “bakery” are best predicted by asking whether they are structurally similar to other molecules known to smell like “bakery”.

from pyrfume.features import smiles_to_morgan

morgan_features = smiles_to_morgan(smiles)
morgan_features
10565946 56091288 118067289 121328373 153234704 180311863 226893386 245578872 256538328 269476445 ... 3910734833 3967719433 3983702751 3990544841 4056449865 4084831613 4166400509 4175270308 4209952873 4277593716
CC1=CC[C@@H](CC1)C(=C)C 0.0 1.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 16.0
CC(=O)C1=CC=CC=C1 0.0 5.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.0
CC1=CC[C@@H](CC1=O)C(=C)C 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 ... 1.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 14.0
CCCCCC=O 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 12.0
CC/C=C\CC=O 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 10.0

5 rows × 105 columns

The dataframe above contains fragment counts, i.e. the number of times, within each molecule (represented by a SMILES string) that a given sub-molecular pattern (e.g. an acetyl group, a benzene ring, or any other possible substructure) occurs. Each substructure has a unique “hash”, which is simply a long integer that can be resolved to that sub-structure (also called a fingerprint). The larger and more diverse the number of molecules that we want to featurize, the larger the number of potential substructures, so the number of predictors can become quite larger (here since we are only working with 5 molecules, there are only 105 unique substructures, of some maximum size).

These features can be used directly, but they still don’t tell us how similar two molecules are.

from pyrfume.features import smiles_to_morgan_sim

morgan_sim_features = smiles_to_morgan_sim(smiles, smiles)
morgan_sim_features
CC1=CC[C@@H](CC1)C(=C)C CC(=O)C1=CC=CC=C1 CC1=CC[C@@H](CC1=O)C(=C)C CCCCCC=O CC/C=C\CC=O
CC1=CC[C@@H](CC1)C(=C)C 1.000000 0.272000 0.680556 0.360656 0.372881
CC(=O)C1=CC=CC=C1 0.272000 1.000000 0.308943 0.297030 0.309278
CC1=CC[C@@H](CC1=O)C(=C)C 0.680556 0.308943 1.000000 0.383333 0.396552
CCCCCC=O 0.360656 0.297030 0.383333 1.000000 0.574468
CC/C=C\CC=O 0.372881 0.309278 0.396552 0.574468 1.000000

By using smiles_to_morgan_sim we have the option of a second argument, which is a list of SMILES to compare the original SMILES to. The result is a measure of similarity between two molecules, defined as their similarity in substructure counts.

But rather than restrict the the similarity to only those molecules we want to predict, why not compute the similarity to other known odorous (or non-odorous) molecules?

from pyrfume.odorants import all_smiles

reference_smiles = all_smiles()
# smiles_to_morgan_sim(smiles, reference_smiles)

The code above (after uncommenting the last line) will compute a dataframe of similarities between your molecule of interest and several thousand known odorants curated through the Pyrfume project. This list consists of odorants used in >40 notable journal articles and industrial databases.

Because this list is so large, there is a good chance that all of the molecules you are using for your project are already in it. This means that the feature values can be computed once and you can simply look them up. These pre-computed values have already been filtred to remove non-informative (e.g. zero variance) features or features with a large fraction of missing values (some features, especially for Mordred, are only computable for esoteric molecules). Furthermore, the remaining missing values have been filled with KNN imputation. This means that they are “ready to go” for machine learning applications (which typically require finite and non-missing values in all predictors).

morgan_sim = pyrfume.load_data("morgan/features_sim.csv", cids=cids)
mordred = pyrfume.load_data("mordred/features.csv", cids=cids)

These files are quite large (thousands of molecules (rows) by thousands of features (columns)), but we’ve selected only the rows (feature vectors) for the 5 CIDs of interest.

Because every stored value in Pyrfume is indexed by CID, we can access the molecules we care about (that we want to get features for) using CIDs as keys. We can then join them to produce one final set of features for machine learning applications.

features = mordred.join(morgan_sim)
features
ABC ABCGG nAcid nBase SpAbs_A SpMax_A SpDiam_A SpAD_A SpMAD_A LogEE_A ... CC(=O)O.C1=CC=C2C(=C1)C(=CC3=C2N=C4C=CC(=N)C=C4O3)N CCC(C)N1CCC2(CC1)NC3=C4C5=C(C(=NC(=O)/C(=C/C=C/[C@@H]([C@@H]([C@H]([C@H]([C@H]([C@@H]([C@@H]([C@H](/C=C/O[C@@]6(C(=O)C4=C(O6)C(=C5O)C)C)OC)C)OC(=O)C)C)O)C)O)C)/C)C3=N2)O C[C@H]1/C=C/C=C(\C(=O)NC2=C(C(=C3C(=C2O)C(=C(C4=C3C(=O)[C@](O4)(O/C=C/[C@@H]([C@H]([C@H]([C@@H]([C@@H]([C@@H]([C@H]1O)C)O)C)OC(=O)C)C)OC)C)C)O)O)/C=N\N5CCN(CC5)C)/C C[C@H]1/C=C/C=C(\C(=O)NC2=C(C(=C3C(=C2O)C(=C(C4=C3C(=O)[C@](O4)(O/C=C/[C@@H]([C@H]([C@H]([C@@H]([C@@H]([C@@H]([C@H]1O)C)O)C)OC(=O)C)C)OC)C)C)O)O)/C=N\N5CCN(CC5)C6CCCC6)/C COC1=CC(=CC(=C1O)OC)/C=N\N=C/C2=CC(=C(C(=C2)OC)O)OC C[N+]1=CC=CC=C1/C=N\O.[Cl-] COC(=O)N/N=C\C1=[N+](C2=CC=CC=C2[N+](=C1)[O-])[O-] C1=CC(=C(C(=C1)Cl)CC(=O)NC(=N)N)Cl.Cl C1(=C(N=C(C(=N1)Cl)N)N)C(=O)NC(=N)N.O.O.Cl CCCCCCCOC1=C(C=C(C=C1)CC=C)OC
CID
6184 4.242641 4.530950 0.0 0.0 8.054679 1.847759 3.695518 8.054679 1.150668 2.739193 ... 0.124294 0.144681 0.141907 0.133056 0.267281 0.270270 0.226415 0.232558 0.174603 0.385787
7410 6.542301 6.236096 0.0 0.0 11.189957 2.193993 4.387987 11.189957 1.243329 3.089765 ... 0.288889 0.114165 0.123348 0.115702 0.254545 0.456140 0.395062 0.409091 0.232558 0.250000
16724 8.134854 7.731889 0.0 0.0 13.152542 2.292456 4.584911 13.152542 1.195686 3.294652 ... 0.150754 0.174797 0.190275 0.178926 0.251046 0.285714 0.243094 0.251656 0.189189 0.283105
440917 7.358797 6.909742 0.0 0.0 11.936238 2.236068 4.472136 11.936238 1.193624 3.197236 ... 0.139303 0.174089 0.185263 0.178218 0.257261 0.266667 0.218579 0.222222 0.160000 0.289593
643941 4.242641 4.530950 0.0 0.0 8.054679 1.847759 3.695518 8.054679 1.150668 2.739193 ... 0.115607 0.103004 0.111857 0.104822 0.225352 0.280374 0.245161 0.256000 0.196721 0.279793

5 rows × 11144 columns

We now have one giant dataframe with all 10000+ features (all physicochemical features for Mordred and all Morgan fingerprint similarity features), indexed by CID. You are now ready for prediction on targets (receptor activation, glomerular imaging data, PCx firing rates, human perception, animal behavior, etc.) using your favoriate ML framework (scikit-learn, pytorch, etc.)

Pyrfume also supports some additional featurizations (e.g. NSPDK) and is working on supporting more (e.g. from Graph Convolution Networks or Auto-Encoders).