5. Published Data: Working with the Pyrfume Datasets#

import pyrfume

Whether you want to build predictive models or simply organize data, it is essential to begin integrating across datasets and wherever possible bringing the largest datasets to bear on the problem. Most breakthroughs don’t come from new algorithms but from new, large datasets.

bushdid_molecules = pyrfume.load_data("bushdid_2014/molecules.csv")
bushdid_molecules.head()
MolecularWeight IsomericSMILES IUPACName name Odorant name C.A.S. % odorant Solvent
CID
176 60.05 CC(=O)O acetic acid acetic acid acetic acid 64-19-7 10.00 mineral oil
177 44.05 CC=O acetaldehyde acetaldehyde acetaldehyde 75-07-0 5.00 water
179 88.11 CC(C(=O)C)O 3-hydroxybutan-2-one acetoin acetoin 513-86-0 0.10 1,2-propanediol
180 58.08 CC(=O)C propan-2-one acetone propan-2-one 67-64-1 25.00 water
240 106.12 C1=CC=C(C=C1)C=O benzaldehyde benzaldehyde benzaldehyde 100-52-7 0.25 mineral oil

The above shows the first 5 molecules (all 128 are in the full dataframe) from Bushdid et al, 2014 which looked at the perceptual discriminability of random mixtures in humans. Even though this data was not on your disk, Pyrfume fell back to loading it remotely; future loads of the same data will come from your disk for speed. The first thing to note is that the index and the first 5 columns are structured identically to what we generated in Part 2 from our own data. ALL Pyrfume datasets have this structure, whether there were obtained from supplemental figures and tables, excel files or pdfs, industrial databases, books, or papyrus scrolls. Additional columns (such as the final 4 shown above) may also be present, on case-by-case basis, depending on what the authors chose to include in their source materials.

What else has the Pyrfume project extracted from this data source?

pyrfume.show_files("bushdid_2014")
{'behavior.csv': 'Triangle test results',
 'molecules.csv': 'Molecules used',
 'stimuli.csv': 'Maps stimulus (Test UID) to composition of mixtures used'}

The above shows a manifest of (processed) files available. Curated file names are simple and memorable (typically, “molecules”, “behavior”, “stimuli”, etc.) which means you will often not even need to examine the manifest before retrieving the files you care about. Importantly, every data archive contains at least one Python script (main.py) which provides the full processing workflow going from the raw data provided in the original data source to the cleaned, standardized, organized and mutually compatible datasets provided by Pyrfume.

bushdid_stimuli = pyrfume.load_data("bushdid_2014/stimuli.csv")
bushdid_behavior = pyrfume.load_data("bushdid_2014/behavior.csv")

Information about each stimulus (in this case a mixture), including the CID (not provided in the original source data) is given in the stimuli file.

bushdid_stimuli.head()
Answer Components in mixtures Components that differ % mixture overlap Stimulus dilution Molecule 1 Molecule 2 Molecule 3 Molecule 4 Molecule 5 ... Molecule 21 Molecule 22 Molecule 23 Molecule 24 Molecule 25 Molecule 26 Molecule 27 Molecule 28 Molecule 29 Molecule 30
Stimulus
1 right 30 30 0.0 1.00 12232 7731 7888 7966 7848 ... 8103 7344 8051 7921 460 7583 11002 12367 6590 7991
1 wrong 30 30 0.0 0.25 440917 5281515 8030 3314 31272 ... 6561 22386 11509 443162 8797 16666 7749 7793 7799 7762
1 wrong 30 30 0.0 0.50 440917 5281515 8030 3314 31272 ... 6561 22386 11509 443162 8797 16666 7749 7793 7799 7762
2 right 10 4 60.0 0.50 3314 62433 5281515 7749 6561 ... 0 0 0 0 0 0 0 0 0 0
2 wrong 10 4 60.0 0.25 3314 62433 5281515 7749 6561 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 35 columns

Finally, the human behavioral results for each mixture are provided in the behavior file. How one chooses to join these tables to produce prediction target or otherwise explore the data is up to individual taste, but with standards in place it is much less work than it would have been with the source data alone!

bushdid_behavior.head()
Subject Correct
Stimulus
1 1 False
1 2 True
1 3 False
1 4 True
1 5 True

Many other datasets are available now, with several dozen additional datasets ready for release in the next 12 months. Licensing issues stand in the way of some, but the Pyrfume maintainers are working this out.

pyrfume.list_archives()
['abraham_2012',
 'arctander_1960',
 'aromadb',
 'arshamian_2022',
 'burton_2022',
 'bushdid_2014',
 'chae_2019',
 'dravnieks_1985',
 'embedding',
 'flavordb',
 'flavornet',
 'foodb',
 'foodcomex',
 'fragrancedb',
 'freesolve',
 'goodscents',
 'gras',
 'haddad_2008',
 'ifra_2019',
 'keller_2012',
 'keller_2016',
 'knapsack',
 'leffingwell',
 'leon',
 'ma_2012',
 'ma_2021',
 'mainland_2015',
 'manoel_2021',
 'mayhew_2022',
 'molecules',
 'mordred',
 'morgan',
 'nakayama_2022',
 'nat_geo_1986',
 'nhanes_2014',
 'optical_rotation',
 'prediction_targets',
 'prestwick',
 'qian_2022',
 'ravia_2020',
 'scott_2014',
 'sharma_2021a',
 'sharma_2021b',
 'sigma_2014',
 'snitz_2013',
 'snitz_2019',
 'soh_2013',
 'superscent',
 't3db',
 'tools',
 'wakayama_2019',
 'weiss_2012']