1. Standardizing your own data

You may already have your own data, and you would like to link the odorants you’ve used in one dataset with those from other datasets, or simply be able to do analyses that require your odorants to be well-descibed of featurized.

!pip install -q pyrfume  # Install pyrfume if it is not already installed
import pandas as pd
import pyrfume

Pyrfume operates under the principle that the proper identifier for a single odorant molecule (e.g. d-Limonene) is the PubChem compound ID (440917), for a single (known) mixture (e.g. light mineral oil) is the PubChem substance ID (402315722).

  • A PubChem compound ID uniquely identifiers a molecular structure (unlike a CAS registry number).

  • A given structure resolves to only one PubChem ID (unlike a SMILES string which depends on implementation).

  • PubChem itself is indexed by these IDs and provides a wealth of additional records covering experimental data, computable properties, safety information, and other externally linked data.

In order to get access to all of this information, and to link the same molecule across datasets, the first step is to obtain PubChem IDs (henceforth, CIDs) for the molecules in question.

names = ['d-limonene', '98-86-2', '(+)-carvone', 'CCCCCC=O', 'GXANMBISFKBPEX-ARJAWSKDSA-N']

Above we have 5 different molecules, represented with a mix of names (with different annotations), CAS numbers, SMILES strings, and InChiKeys. Your data may use one of these formats, or a mix of them, or some other format entirely. The PubChem exchange identifier service can do a good job of converting between (some of) these format, or identifying potential CIDs. Pyrfume does the extra work of auto-identifying the current identifier, checking for alternative conversions, and providing information about names that did not match or had multiple matches.

from pyrfume import get_cids
cids = get_cids(names)

The process above can be a little bit slow (resolving only a few identifers per second) because the PubChem database itself is not indexed by most of these (only CIDs and InChiKeys). Still, it returns a dictionary of unique identifiers (CIDs) for each original identifier:

cids
{'d-limonene': 440917,
 '98-86-2': 7410,
 '(+)-carvone': 16724,
 'CCCCCC=O': 6184,
 'GXANMBISFKBPEX-ARJAWSKDSA-N': 643941}

Which looks a bit nicer as a Pandas series

cids = pd.Series(cids)
cids
d-limonene                     440917
98-86-2                          7410
(+)-carvone                     16724
CCCCCC=O                         6184
GXANMBISFKBPEX-ARJAWSKDSA-N    643941
dtype: int64

Now that you have unique identifiers, you can access a lot more information:

from pyrfume import from_cids
info = from_cids(cids.values)
Retrieving 0 through 4

That part was quite fast and scales very well, because PubChem is indexed by CID. Pyrfume runs this in batches of 100 CIDs, and each batch takes about 1 second.

molecules = pd.DataFrame(info).set_index('CID')
molecules
MolecularWeight IsomericSMILES IUPACName name
CID
440917 136.23 CC1=CC[C@@H](CC1)C(=C)C (4R)-1-methyl-4-prop-1-en-2-ylcyclohexene d-limonene
7410 120.15 CC(=O)C1=CC=CC=C1 1-phenylethanone acetophenone
16724 150.22 CC1=CC[C@@H](CC1=O)C(=C)C (5S)-2-methyl-5-prop-1-en-2-ylcyclohex-2-en-1-one d-carvone
6184 100.16 CCCCCC=O hexanal hexanal
643941 98.14 CC/C=C\CC=O (Z)-hex-3-enal cis-3-hexenal

The above contains the original set of molecules, indexed by CID, but also containing some other useful identifiers that (unlike CAS or InChiKey) actually tell you something about the molecule in question just by looking at them. The “IsomericSMILES” columns is standardized SMILES string computed using the same software (on PubChem) for every molecule. The “IUPACName” is similarly, a standardized nomenclature for molecle names. “name” is simply the most common name (sometimes a trade name) of the molecule, as you might see it in a publication. CID, IsomericSMILES, and IUPACName, all uniquely describe the molecule. If you have multiple datasets from multiple sources, and you want to integrate them together, you can use stock Pandas functions for merging and/or concatenating data.

This representation for a set of molecules will recur again and again in Part 4, when looking at external datasets.

Now that you have the molecules from your data in a standard format, save them to disk for future use:

pyrfume.save_data(molecules, 'my_data/molecules.csv')

You can load them back again with:

molecules = pyrfume.load_data('my_data/molecules.csv')

You can change the location that Pyrfume uses for its (local copy of) the data archives with pyrfume.set_data_path.