3. Featurization for Machine Learning#

import pyrfume

molecules = pyrfume.load_data(
    "my_data/molecules.csv"
)  # Load the data we created in the last chapter
cids = molecules.index.tolist()

Many applications, such as predictive models, propose that molecular structure of an odorant is causal to neural activity or behavior (a reasonable proposition). Training these models requires that each molecule be represented by a vector of predictors, or features, which in some way describe the structure quantitatively. Traditionally the chemoinformatic feature-calculation software called Dragon has been used for this, but since Dragon is commercial software this creates challenges for replication and reproducibility. Some have found that open source feature-calculation packages such as Mordred perform as or nearly as well.

Feature-calculation on molecules involves several steps including 1) expression of structure in a standard representation (such as SMILES or InChi), 2) conversion of that structure to a molecule object in memory (e.g. rdkit Mol), 3) addition of hydrogens (omitted from the SMILES by design), removal of salts, etc. 4) embedding in 3 dimensions (without violating steric or electostatic principles), 5) feature calculation on this 3d structure. Pyrfume abstracts all of this away into one step.

smiles = molecules["IsomericSMILES"].tolist()
smiles

['CC1=CC[C@@H](CC1)C(=C)C',
 'CC(=O)C1=CC=CC=C1',
 'CC1=CC[C@@H](CC1=O)C(=C)C',
 'CCCCCC=O',
 'CC/C=C\\CC=O']

from pyrfume.features import smiles_to_mordred

mordred_features = smiles_to_mordred(smiles)

Computing Mordred features...

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.30it/s]

There are 5 molecules and 1826 features

The code above implements all of the steps required for featurization (except for obtaining the SMILES string, which was done in Part 2). One can also obtain a SMILES string directly by calling pyrfume.odorants.cids_to_smiles.

mordred_features

	ABC	ABCGG	SpAbs_A	SpMax_A	SpDiam_A	SpAD_A	SpMAD_A	LogEE_A	...	SRW10	TSRW10	MW	AMW	WPath	WPol	Zagreb1	Zagreb2	mZagreb1	mZagreb2
CC1=CC[C@@H](CC1)C(=C)C	7.358797	6.909742	11.936238	2.236068	4.472136	11.936238	1.193624	3.197236	...	8.815964	39.140584	136.125201	5.235585	120	11	46.0	50.0	4.333333	2.277778
CC(=O)C1=CC=CC=C1	6.542301	6.236096	11.189957	2.193993	4.387987	11.189957	1.243329	3.089765	...	8.590258	37.289972	120.057515	7.062207	88	9	40.0	43.0	3.472222	2.111111
CC1=CC[C@@H](CC1=O)C(=C)C	8.134854	7.731889	13.152542	2.292456	4.584911	13.152542	1.195686	3.294652	...	9.071423	40.991311	150.104465	6.004179	152	14	52.0	58.0	5.194444	2.472222
CCCCCC=O	4.242641	4.53095	8.054679	1.847759	3.695518	8.054679	1.150668	2.739193	...	6.900731	30.25721	100.088815	5.267832	56	4	22.0	20.0	3.25	2.0
CC/C=C\CC=O	4.242641	4.53095	8.054679	1.847759	3.695518	8.054679	1.150668	2.739193	...	6.900731	30.25721	98.073165	5.76901	56	4	22.0	20.0	3.25	2.0

5 rows × 1826 columns

The mordred_features Pandas dataframe has (for the current version of Mordred, installed by Pyrfume) 1826 computed physicochemical features for each molecule. What do these features mean? That’s a question for the computational chemists. As investigators asking how molecular structure might explain or be represented in brain activity or behavior, we just want a set of predictors with a track record of success in structure-based predictive models.

It has been shown previously that the physics-based features of Dragon (or of Mordred, which implements similar calculations) are good at predicting some kinds of structure-driven outcomes, but not others. For example, in Keller et al 2017 (Science) it was shown that such features are good for predicting perceived intensity, but not as good at predicting perceived “bakery” smell. That paper showed that the latter was better predicted by a template-matching approach – molecules that smell like “bakery” are best predicted by asking whether they are structurally similar to other molecules known to smell like “bakery”.

from pyrfume.features import smiles_to_morgan

morgan_features = smiles_to_morgan(smiles)

morgan_features

	10565946	56091288	118067289	121328373	153234704	180311863	226893386	245578872	256538328	269476445	...	3910734833	3967719433	3983702751	3990544841	4056449865	4084831613	4166400509	4175270308	4209952873	4277593716
CC1=CC[C@@H](CC1)C(=C)C	0.0	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	16.0
CC(=O)C1=CC=CC=C1	0.0	5.0	0.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	...	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	8.0
CC1=CC[C@@H](CC1=O)C(=C)C	1.0	1.0	1.0	1.0	1.0	0.0	0.0	0.0	1.0	1.0	...	1.0	0.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0	14.0
CCCCCC=O	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	12.0
CC/C=C\CC=O	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	10.0

5 rows × 105 columns

The dataframe above contains fragment counts, i.e. the number of times, within each molecule (represented by a SMILES string) that a given sub-molecular pattern (e.g. an acetyl group, a benzene ring, or any other possible substructure) occurs. Each substructure has a unique “hash”, which is simply a long integer that can be resolved to that sub-structure (also called a fingerprint). The larger and more diverse the number of molecules that we want to featurize, the larger the number of potential substructures, so the number of predictors can become quite larger (here since we are only working with 5 molecules, there are only 105 unique substructures, of some maximum size).

These features can be used directly, but they still don’t tell us how similar two molecules are.

from pyrfume.features import smiles_to_morgan_sim

morgan_sim_features = smiles_to_morgan_sim(smiles, smiles)

morgan_sim_features

	CC1=CC[C@@H](CC1)C(=C)C	CC(=O)C1=CC=CC=C1	CC1=CC[C@@H](CC1=O)C(=C)C	CCCCCC=O	CC/C=C\CC=O
CC1=CC[C@@H](CC1)C(=C)C	1.000000	0.272000	0.680556	0.360656	0.372881
CC(=O)C1=CC=CC=C1	0.272000	1.000000	0.308943	0.297030	0.309278
CC1=CC[C@@H](CC1=O)C(=C)C	0.680556	0.308943	1.000000	0.383333	0.396552
CCCCCC=O	0.360656	0.297030	0.383333	1.000000	0.574468
CC/C=C\CC=O	0.372881	0.309278	0.396552	0.574468	1.000000

By using smiles_to_morgan_sim we have the option of a second argument, which is a list of SMILES to compare the original SMILES to. The result is a measure of similarity between two molecules, defined as their similarity in substructure counts.

But rather than restrict the the similarity to only those molecules we want to predict, why not compute the similarity to other known odorous (or non-odorous) molecules?

from pyrfume.odorants import all_smiles

reference_smiles = all_smiles()
# smiles_to_morgan_sim(smiles, reference_smiles)

The code above (after uncommenting the last line) will compute a dataframe of similarities between your molecule of interest and several thousand known odorants curated through the Pyrfume project. This list consists of odorants used in >40 notable journal articles and industrial databases.

Because this list is so large, there is a good chance that all of the molecules you are using for your project are already in it. This means that the feature values can be computed once and you can simply look them up. These pre-computed values have already been filtred to remove non-informative (e.g. zero variance) features or features with a large fraction of missing values (some features, especially for Mordred, are only computable for esoteric molecules). Furthermore, the remaining missing values have been filled with KNN imputation. This means that they are “ready to go” for machine learning applications (which typically require finite and non-missing values in all predictors).

morgan_sim = pyrfume.load_data("morgan/features_sim.csv", cids=cids)

mordred = pyrfume.load_data("mordred/features.csv", cids=cids)

These files are quite large (thousands of molecules (rows) by thousands of features (columns)), but we’ve selected only the rows (feature vectors) for the 5 CIDs of interest.

Because every stored value in Pyrfume is indexed by CID, we can access the molecules we care about (that we want to get features for) using CIDs as keys. We can then join them to produce one final set of features for machine learning applications.

features = mordred.join(morgan_sim)
features

	ABC	ABCGG	nAcid	nBase	SpAbs_A	SpMax_A	SpDiam_A	SpAD_A	SpMAD_A	LogEE_A	...	CC(=O)O.C1=CC=C2C(=C1)C(=CC3=C2N=C4C=CC(=N)C=C4O3)N	CCC(C)N1CCC2(CC1)NC3=C4C5=C(C(=NC(=O)/C(=C/C=C/[C@@H]([C@@H]([C@H]([C@H]([C@H]([C@@H]([C@@H]([C@H](/C=C/O[C@@]6(C(=O)C4=C(O6)C(=C5O)C)C)OC)C)OC(=O)C)C)O)C)O)C)/C)C3=N2)O	C[C@H]1/C=C/C=C(\C(=O)NC2=C(C(=C3C(=C2O)C(=C(C4=C3C(=O)[C@](O4)(O/C=C/[C@@H]([C@H]([C@H]([C@@H]([C@@H]([C@@H]([C@H]1O)C)O)C)OC(=O)C)C)OC)C)C)O)O)/C=N\N5CCN(CC5)C)/C	C[C@H]1/C=C/C=C(\C(=O)NC2=C(C(=C3C(=C2O)C(=C(C4=C3C(=O)[C@](O4)(O/C=C/[C@@H]([C@H]([C@H]([C@@H]([C@@H]([C@@H]([C@H]1O)C)O)C)OC(=O)C)C)OC)C)C)O)O)/C=N\N5CCN(CC5)C6CCCC6)/C	COC1=CC(=CC(=C1O)OC)/C=N\N=C/C2=CC(=C(C(=C2)OC)O)OC	C[N+]1=CC=CC=C1/C=N\O.[Cl-]	COC(=O)N/N=C\C1=[N+](C2=CC=CC=C2[N+](=C1)[O-])[O-]	C1=CC(=C(C(=C1)Cl)CC(=O)NC(=N)N)Cl.Cl	C1(=C(N=C(C(=N1)Cl)N)N)C(=O)NC(=N)N.O.O.Cl	CCCCCCCOC1=C(C=C(C=C1)CC=C)OC
CID
6184	4.242641	4.530950	0.0	0.0	8.054679	1.847759	3.695518	8.054679	1.150668	2.739193	...	0.124294	0.144681	0.141907	0.133056	0.267281	0.270270	0.226415	0.232558	0.174603	0.385787
7410	6.542301	6.236096	0.0	0.0	11.189957	2.193993	4.387987	11.189957	1.243329	3.089765	...	0.288889	0.114165	0.123348	0.115702	0.254545	0.456140	0.395062	0.409091	0.232558	0.250000
16724	8.134854	7.731889	0.0	0.0	13.152542	2.292456	4.584911	13.152542	1.195686	3.294652	...	0.150754	0.174797	0.190275	0.178926	0.251046	0.285714	0.243094	0.251656	0.189189	0.283105
440917	7.358797	6.909742	0.0	0.0	11.936238	2.236068	4.472136	11.936238	1.193624	3.197236	...	0.139303	0.174089	0.185263	0.178218	0.257261	0.266667	0.218579	0.222222	0.160000	0.289593
643941	4.242641	4.530950	0.0	0.0	8.054679	1.847759	3.695518	8.054679	1.150668	2.739193	...	0.115607	0.103004	0.111857	0.104822	0.225352	0.280374	0.245161	0.256000	0.196721	0.279793

5 rows × 11144 columns

We now have one giant dataframe with all 10000+ features (all physicochemical features for Mordred and all Morgan fingerprint similarity features), indexed by CID. You are now ready for prediction on targets (receptor activation, glomerular imaging data, PCx firing rates, human perception, animal behavior, etc.) using your favoriate ML framework (scikit-learn, pytorch, etc.)

Pyrfume also supports some additional featurizations (e.g. NSPDK) and is working on supporting more (e.g. from Graph Convolution Networks or Auto-Encoders).

Pyrfume: a python library for odorant-linked research

Featurization for Machine Learning

3. Featurization for Machine Learning#