Foundry Datasets
Describe the metadata that is for each Foundry dataset
Foundry Datasets are comprised of two key components, data and descriptive metadata. In order to make the data easily consumable, data (consisting of files) should be assembled following the supported structures. The metadata description allows tracking of high level information (e.g., authors, assoicated institutions, licenses, data location), and also information on how to operate on the datasets (e.g., how to load the data, training/test splits)
Data
Example - Record-Based Data
Tabular Data
For tabular data should, columns represent the different keys of the data, and rows represent individual records.
In this example, we showcase how to describe a JSON record-based dataset where each record is a valid JSON object in a JSON list or a line in a JSON line delimited file.
feature_1
feature_2
material_type
band_gap
0.10
0.52
1
1.40
0.34
0.910
0
0.73
...
...
...
For this example dataset the Key object could be:
{
"short_name": "oqmd-bandgaps",
"data_type": "tabular",
"task_type": ["supervised"],
"domain": ["materials science"],
"n_items": 29197,
"splits": [{
"type": "train",
"path": "foundry_dataframe.json",
"label": "train"
}],
"keys": [{
"key": ["reference"],
"type": "input",
"units": "",
"description": "source publication of the bandgap value"
}, {
"key": ["icsd_id"],
"type": "input",
"units": "",
"description": "corresponding id in ICSD of this compound"
}, {
"key": ["structure"],
"type": "input",
"units": "",
"description": "the structure of this compound"
}, {
"key": ["composition"],
"type": "input",
"units": "",
"description": "reduced composition of this compound"
}, {
"key": ["comments"],
"type": "input",
"units": "",
"description": "Additional information about this bandgap measurement"
}, {
"key": ["bandgap type"],
"type": "input",
"units": "",
"description": "the type of the bandgap, e.g., direct or indirect"
}, {
"key": ["comp method"],
"type": "input",
"units": "",
"description": "functional used to calculate the bandgap"
}, {
"key": ["space group"],
"type": "input",
"units": "",
"description": "the space group of this compound"
},
{
"key": ["bandgap value (eV)"],
"type": "output",
"units": "eV",
"description": "value of the bandgap"
}
]
}TODO
Write general pandas reader to try csv, JSON, xlsx for opening
Hierarchical Data
Foundry also supports data from hierarchical data formats (e.g., HDF5). In this case features and outputs can be represented with / notation. For example, if the features of a dataset are located in an array stored in /data/arr1 and /other_data/arr2 while the outputs are in /data/band_gaps, the Key object would be:
Descriptive Metadata
DataCite Metadata (object): All datasets can be described using metadata in compliance with the DataCite metadata format. This metadata captures . Many of these capabilities have helper functions in the SDK, to make it easier to match the DataCite schema
Keys (object): Key objects provide a mapping that allows Foundry to read data from the underlying data structure into usable Python objects. Key objects have the following properties
**
key (str)**A name mapping to a column name (e.g., for csv files) or key within a data structure (e.g., for HDF5 files)type (str)The type of key this entry represents. Currently suported types are ["input", "target" ]units (str)[optional]_****_The scientific units associated with a key. Default: Nonedescription (str)[optional]_****_A free text description of the key. Default: Nonelabels (list) (str) [optional]: A list of strings mapped to integers in a key column
short_name (str): Short name is a unique name associated with this dataset to make loading and .
type (str): The type provides a hint to Foundry on how to map the keys into loading operations. Options ["tabular","hdf5"]
Last updated
Was this helpful?