Foundry Datasets

Describe the metadata that is for each Foundry dataset

Foundry Datasets are comprised of two key components, data and descriptive metadata. In order to make the data easily consumable, data (consisting of files) should be assembled following the supported structures. The metadata description allows tracking of high level information (e.g., authors, assoicated institutions, licenses, data location), and also information on how to operate on the datasets (e.g., how to load the data, training/test splits)

Data

Example - Record-Based Data

Tabular Data

For tabular data should, columns represent the different keys of the data, and rows represent individual records.

Supported tabular data types currently include JSON, csv, and xlsx.

In this example, we showcase how to describe a JSON record-based dataset where each record is a valid JSON object in a JSON list or a line in a JSON line delimited file.

feature_1

feature_2

material_type

band_gap

0.10

0.52

1.40

0.34

0.910

0.73

...

For this example dataset the Key object could be:

{
	"short_name": "oqmd-bandgaps",
	"data_type": "tabular",
	"task_type": ["supervised"],
	"domain": ["materials science"],
	"n_items": 29197,
	"splits": [{
		"type": "train",
		"path": "foundry_dataframe.json",
		"label": "train"
	}],
	"keys": [{
			"key": ["reference"],
			"type": "input",
			"units": "",
			"description": "source publication of the bandgap value"
		}, {
			"key": ["icsd_id"],
			"type": "input",
			"units": "",
			"description": "corresponding id in ICSD of this compound"
		}, {
			"key": ["structure"],
			"type": "input",
			"units": "",
			"description": "the structure of this compound"
		}, {
			"key": ["composition"],
			"type": "input",
			"units": "",
			"description": "reduced composition of this compound"
		}, {
			"key": ["comments"],
			"type": "input",
			"units": "",
			"description": "Additional information about this bandgap measurement"
		}, {
			"key": ["bandgap type"],
			"type": "input",
			"units": "",
			"description": "the type of the bandgap, e.g., direct or indirect"
		}, {
			"key": ["comp method"],
			"type": "input",
			"units": "",
			"description": "functional used to calculate the bandgap"
		}, {
			"key": ["space group"],
			"type": "input",
			"units": "",
			"description": "the space group of this compound"
		},
		{
			"key": ["bandgap value (eV)"],
			"type": "output",
			"units": "eV",
			"description": "value of the bandgap"
		}
	]
}

TODO

"keys":[{
		 	"key": "feature_1",
			"type": "input",
			"units": None,
			"description": "This is feature 1"
		},{
			"key": "feature_2",
			"type": "input",
			"units": None,
			"description": "This is feature 2"
		},{
			"key": "material_type",
			"type": "input",
			"units": None,
			"description": "This is the material type",
			"labels":["perovskite","not perovskite"]
		}{
			"key": "band_gap",
			"type": "target",
			"units": "eV",
			"description": "This is the simulated band gap in eV"
		}
]

This tabular data file should be saved in the base directory as foundry_dataframe.json

Write general pandas reader to try csv, JSON, xlsx for opening

Hierarchical Data

Foundry also supports data from hierarchical data formats (e.g., HDF5). In this case features and outputs can be represented with / notation. For example, if the features of a dataset are located in an array stored in /data/arr1 and /other_data/arr2 while the outputs are in /data/band_gaps, the Key object would be:

{
		"short_name": "segmentation-dev",
		"data_type": "hdf5",
		"task_type": ["unsupervised", "segmentation"],
		"domain": ["materials science", "chemistry"],
		"n_items": 100,
		"splits": [{
			"type": "train",
			"path": "foundry.hdf5",
			"label": "train"
		}],
		"keys": [{
			"key": ["train/input"],
			"type": "input",
			"description": "input, unlabeled images"
		}, {
			"key": ["train/output"],
			"type": "target",
			"description": "target, labeled images"
		}]
	}

"keys":[{
			"key": "/data/arr1",
			"type": "input",
			"units": None,
			"description": "This is an array containing input data"
		},{
		  "key": "/other_data/arr2",
			"type": "input",
			"units": None,
			"description": "This is an another array containing input data"
		},{
		  "key": "/data/band_gaps",
			"type": "target",
			"units": "eV",
			"description": "This is the simulated band gap in eV"
		}
]

Descriptive Metadata

DataCite Metadata (object): All datasets can be described using metadata in compliance with the DataCite metadata format. This metadata captures . Many of these capabilities have helper functions in the SDK, to make it easier to match the DataCite schema

Keys (object): Key objects provide a mapping that allows Foundry to read data from the underlying data structure into usable Python objects. Key objects have the following properties

**key (str)**A name mapping to a column name (e.g., for csv files) or key within a data structure (e.g., for HDF5 files)
type (str) The type of key this entry represents. Currently suported types are ["input", "target" ]
units (str)[optional] _****_The scientific units associated with a key. Default: None
description (str)[optional] _****_A free text description of the key. Default: None
labels (list) (str) [optional]: A list of strings mapped to integers in a key column

short_name (str): Short name is a unique name associated with this dataset to make loading and .

type (str): The type provides a hint to Foundry on how to map the keys into loading operations. Options ["tabular","hdf5"]

"foundry": {
	"dc": {},
	"keys": [{
			"type": "input",
			"name": "feature_1",
			"units": "",
			"description": "This is an input"
		},
		{
			"type": "target",
			"name": "band_gap",
			"units": "eV",
			"description": "blah blah",
			"labels": []
		}
	],
	"short_name": "my_short_name",
	"type": "tabular"
}

PreviousOverview NextData Packages

Last updated 2 months ago

Was this helpful?

hashtagData

hashtagExample - Record-Based Data

hashtagTabular Data

hashtagHierarchical Data

hashtagDescriptive Metadata

Data

Example - Record-Based Data

Tabular Data

Hierarchical Data

Descriptive Metadata