Published Data: Working with the Pyrfume Datasets
5. Published Data: Working with the Pyrfume Datasets#
import pyrfume
Whether you want to build predictive models or simply organize data, it is essential to begin integrating across datasets and wherever possible bringing the largest datasets to bear on the problem. Most breakthroughs don’t come from new algorithms but from new, large datasets.
bushdid_molecules = pyrfume.load_data("bushdid_2014/molecules.csv")
bushdid_molecules.head()
MolecularWeight | IsomericSMILES | IUPACName | name | Odorant name | C.A.S. | % odorant | Solvent | |
---|---|---|---|---|---|---|---|---|
CID | ||||||||
176 | 60.05 | CC(=O)O | acetic acid | acetic acid | acetic acid | 64-19-7 | 10.00 | mineral oil |
177 | 44.05 | CC=O | acetaldehyde | acetaldehyde | acetaldehyde | 75-07-0 | 5.00 | water |
179 | 88.11 | CC(C(=O)C)O | 3-hydroxybutan-2-one | acetoin | acetoin | 513-86-0 | 0.10 | 1,2-propanediol |
180 | 58.08 | CC(=O)C | propan-2-one | acetone | propan-2-one | 67-64-1 | 25.00 | water |
240 | 106.12 | C1=CC=C(C=C1)C=O | benzaldehyde | benzaldehyde | benzaldehyde | 100-52-7 | 0.25 | mineral oil |
The above shows the first 5 molecules (all 128 are in the full dataframe) from Bushdid et al, 2014 which looked at the perceptual discriminability of random mixtures in humans. Even though this data was not on your disk, Pyrfume
fell back to loading it remotely; future loads of the same data will come from your disk for speed. The first thing to note is that the index and the first 5 columns are structured identically to what we generated in Part 2 from our own data. ALL Pyrfume
datasets have this structure, whether there were obtained from supplemental figures and tables, excel files or pdfs, industrial databases, books, or papyrus scrolls. Additional columns (such as the final 4 shown above) may also be present, on case-by-case basis, depending on what the authors chose to include in their source materials.
What else has the Pyrfume
project extracted from this data source?
pyrfume.show_files("bushdid_2014")
{'behavior.csv': 'Triangle test results',
'molecules.csv': 'Molecules used',
'stimuli.csv': 'Maps stimulus (Test UID) to composition of mixtures used'}
The above shows a manifest of (processed) files available. Curated file names are simple and memorable (typically, “molecules”, “behavior”, “stimuli”, etc.) which means you will often not even need to examine the manifest before retrieving the files you care about. Importantly, every data archive contains at least one Python script (main.py
) which provides the full processing workflow going from the raw data provided in the original data source to the cleaned, standardized, organized and mutually compatible datasets provided by Pyrfume
.
bushdid_stimuli = pyrfume.load_data("bushdid_2014/stimuli.csv")
bushdid_behavior = pyrfume.load_data("bushdid_2014/behavior.csv")
Information about each stimulus (in this case a mixture), including the CID (not provided in the original source data) is given in the stimuli file.
bushdid_stimuli.head()
Answer | Components in mixtures | Components that differ | % mixture overlap | Stimulus dilution | Molecule 1 | Molecule 2 | Molecule 3 | Molecule 4 | Molecule 5 | ... | Molecule 21 | Molecule 22 | Molecule 23 | Molecule 24 | Molecule 25 | Molecule 26 | Molecule 27 | Molecule 28 | Molecule 29 | Molecule 30 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Stimulus | |||||||||||||||||||||
1 | right | 30 | 30 | 0.0 | 1.00 | 12232 | 7731 | 7888 | 7966 | 7848 | ... | 8103 | 7344 | 8051 | 7921 | 460 | 7583 | 11002 | 12367 | 6590 | 7991 |
1 | wrong | 30 | 30 | 0.0 | 0.25 | 440917 | 5281515 | 8030 | 3314 | 31272 | ... | 6561 | 22386 | 11509 | 443162 | 8797 | 16666 | 7749 | 7793 | 7799 | 7762 |
1 | wrong | 30 | 30 | 0.0 | 0.50 | 440917 | 5281515 | 8030 | 3314 | 31272 | ... | 6561 | 22386 | 11509 | 443162 | 8797 | 16666 | 7749 | 7793 | 7799 | 7762 |
2 | right | 10 | 4 | 60.0 | 0.50 | 3314 | 62433 | 5281515 | 7749 | 6561 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | wrong | 10 | 4 | 60.0 | 0.25 | 3314 | 62433 | 5281515 | 7749 | 6561 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 35 columns
Finally, the human behavioral results for each mixture are provided in the behavior file. How one chooses to join these tables to produce prediction target or otherwise explore the data is up to individual taste, but with standards in place it is much less work than it would have been with the source data alone!
bushdid_behavior.head()
Subject | Correct | |
---|---|---|
Stimulus | ||
1 | 1 | False |
1 | 2 | True |
1 | 3 | False |
1 | 4 | True |
1 | 5 | True |
Many other datasets are available now, with several dozen additional datasets ready for release in the next 12 months. Licensing issues stand in the way of some, but the Pyrfume maintainers are working this out.
pyrfume.list_archives()
['abraham_2012',
'arctander_1960',
'aromadb',
'arshamian_2022',
'burton_2022',
'bushdid_2014',
'chae_2019',
'dravnieks_1985',
'embedding',
'flavordb',
'flavornet',
'foodb',
'foodcomex',
'fragrancedb',
'freesolve',
'goodscents',
'gras',
'haddad_2008',
'ifra_2019',
'keller_2012',
'keller_2016',
'knapsack',
'leffingwell',
'leon',
'ma_2012',
'ma_2021',
'mainland_2015',
'manoel_2021',
'mayhew_2022',
'molecules',
'mordred',
'morgan',
'nakayama_2022',
'nat_geo_1986',
'nhanes_2014',
'optical_rotation',
'prediction_targets',
'prestwick',
'qian_2022',
'ravia_2020',
'scott_2014',
'sharma_2021a',
'sharma_2021b',
'sigma_2014',
'snitz_2013',
'snitz_2019',
'soh_2013',
'superscent',
't3db',
'tools',
'wakayama_2019',
'weiss_2012']