5. Published Data: Working with the Pyrfume Datasets#

import pyrfume

Whether you want to build predictive models or simply organize data, it is essential to begin integrating across datasets and wherever possible bringing the largest datasets to bear on the problem. Most breakthroughs don’t come from new algorithms but from new, large datasets.

bushdid_molecules = pyrfume.load_data("bushdid_2014/molecules.csv")
bushdid_molecules.head()

	MolecularWeight	IsomericSMILES	IUPACName	name	Odorant name	C.A.S.	% odorant	Solvent
CID
176	60.05	CC(=O)O	acetic acid	acetic acid	acetic acid	64-19-7	10.00	mineral oil
177	44.05	CC=O	acetaldehyde	acetaldehyde	acetaldehyde	75-07-0	5.00	water
179	88.11	CC(C(=O)C)O	3-hydroxybutan-2-one	acetoin	acetoin	513-86-0	0.10	1,2-propanediol
180	58.08	CC(=O)C	propan-2-one	acetone	propan-2-one	67-64-1	25.00	water
240	106.12	C1=CC=C(C=C1)C=O	benzaldehyde	benzaldehyde	benzaldehyde	100-52-7	0.25	mineral oil

The above shows the first 5 molecules (all 128 are in the full dataframe) from Bushdid et al, 2014 which looked at the perceptual discriminability of random mixtures in humans. Even though this data was not on your disk, Pyrfume fell back to loading it remotely; future loads of the same data will come from your disk for speed. The first thing to note is that the index and the first 5 columns are structured identically to what we generated in Part 2 from our own data. ALL Pyrfume datasets have this structure, whether there were obtained from supplemental figures and tables, excel files or pdfs, industrial databases, books, or papyrus scrolls. Additional columns (such as the final 4 shown above) may also be present, on case-by-case basis, depending on what the authors chose to include in their source materials.

What else has the Pyrfume project extracted from this data source?

pyrfume.show_files("bushdid_2014")

{'behavior.csv': 'Triangle test results',
 'molecules.csv': 'Molecules used',
 'stimuli.csv': 'Maps stimulus (Test UID) to composition of mixtures used'}

The above shows a manifest of (processed) files available. Curated file names are simple and memorable (typically, “molecules”, “behavior”, “stimuli”, etc.) which means you will often not even need to examine the manifest before retrieving the files you care about. Importantly, every data archive contains at least one Python script (main.py) which provides the full processing workflow going from the raw data provided in the original data source to the cleaned, standardized, organized and mutually compatible datasets provided by Pyrfume.

bushdid_stimuli = pyrfume.load_data("bushdid_2014/stimuli.csv")
bushdid_behavior = pyrfume.load_data("bushdid_2014/behavior.csv")

Information about each stimulus (in this case a mixture), including the CID (not provided in the original source data) is given in the stimuli file.

bushdid_stimuli.head()

	Answer	Components in mixtures	Components that differ	% mixture overlap	Stimulus dilution	Molecule 1	Molecule 2	Molecule 3	Molecule 4	Molecule 5	...	Molecule 21	Molecule 22	Molecule 23	Molecule 24	Molecule 25	Molecule 26	Molecule 27	Molecule 28	Molecule 29	Molecule 30
Stimulus
1	right	30	30	0.0	1.00	12232	7731	7888	7966	7848	...	8103	7344	8051	7921	460	7583	11002	12367	6590	7991
1	wrong	30	30	0.0	0.25	440917	5281515	8030	3314	31272	...	6561	22386	11509	443162	8797	16666	7749	7793	7799	7762
1	wrong	30	30	0.0	0.50	440917	5281515	8030	3314	31272	...	6561	22386	11509	443162	8797	16666	7749	7793	7799	7762
2	right	10	4	60.0	0.50	3314	62433	5281515	7749	6561	...	0	0	0	0	0	0	0	0	0	0
2	wrong	10	4	60.0	0.25	3314	62433	5281515	7749	6561	...	0	0	0	0	0	0	0	0	0	0

5 rows × 35 columns

Finally, the human behavioral results for each mixture are provided in the behavior file. How one chooses to join these tables to produce prediction target or otherwise explore the data is up to individual taste, but with standards in place it is much less work than it would have been with the source data alone!

bushdid_behavior.head()

	Subject	Correct
Stimulus
1	1	False
1	2	True
1	3	False
1	4	True
1	5	True

Many other datasets are available now, with several dozen additional datasets ready for release in the next 12 months. Licensing issues stand in the way of some, but the Pyrfume maintainers are working this out.

pyrfume.list_archives()

['abraham_2012',
 'arctander_1960',
 'aromadb',
 'arshamian_2022',
 'burton_2022',
 'bushdid_2014',
 'chae_2019',
 'dravnieks_1985',
 'embedding',
 'flavordb',
 'flavornet',
 'foodb',
 'foodcomex',
 'fragrancedb',
 'freesolve',
 'goodscents',
 'gras',
 'haddad_2008',
 'ifra_2019',
 'keller_2012',
 'keller_2016',
 'knapsack',
 'leffingwell',
 'leon',
 'ma_2012',
 'ma_2021',
 'mainland_2015',
 'manoel_2021',
 'mayhew_2022',
 'molecules',
 'mordred',
 'morgan',
 'nakayama_2022',
 'nat_geo_1986',
 'nhanes_2014',
 'optical_rotation',
 'prediction_targets',
 'prestwick',
 'qian_2022',
 'ravia_2020',
 'scott_2014',
 'sharma_2021a',
 'sharma_2021b',
 'sigma_2014',
 'snitz_2013',
 'snitz_2019',
 'soh_2013',
 'superscent',
 't3db',
 'tools',
 'wakayama_2019',
 'weiss_2012']

Pyrfume: a python library for odorant-linked research

Published Data: Working with the Pyrfume Datasets

5. Published Data: Working with the Pyrfume Datasets#