Describing Datasets
An introduction to how Foundry datasets and their metadata are defined
Foundry Datasets are comprised of two key components, data and descriptive metadata. In order to make the data easily consumable, data (consisting of files) should be assembled following the supported structures. The metadata description allows tracking of high level information (e.g., authors, associated institutions, licenses, data location), and also information on how to operate on the datasets (e.g., how to load the data, training/test splits).
Data
Example - Record-Based Data
Tabular Data
For tabular data, columns should represent the different keys of the data, and rows the individual records.
Supported tabular data types currently include .json, .jsonl, and .csv
In this example, we showcase how to describe a JSON record-based dataset where each record is a valid JSON object in a JSON list or a line in a JSON line delimited file.
feature_1
feature_2
material_type
band_gap
0.10
0.52
1
1.40
0.34
0.910
0
0.73
...
...
...
For this example dataset, the Key
Python object could be:
Hierarchical Data
Foundry also supports data from hierarchical data formats (e.g., HDF5). In this case, features and outputs can be represented with /
notation. For example, if the features of a dataset are located in an array stored in /data/arr1
and /other_data/arr2
while the outputs are in /data/labeled
, the keys
list may contain the following Key
objects:
and the dataset may be described in whole as:
Descriptive Metadata
Metadata in Foundry comprehensively describe datasets using a combination of the DataCite metadata schema and our own Foundry schema.
Metadata schema
DataCite Metadata (object): All datasets can be described using metadata in compliance with the DataCite metadata format. This metadata captures core elements needed for dataset citation and discovery, such as author names, institutions, associated abstracts, and more. Many of these capabilities have helper functions in the SDK, to make it easier to match the DataCite schema.
For a full list of the metadata keys in DataCite, see their Metadata Schema 4.4 documentation.
Keys list (list of objects) [required]:
Key
objects provide a mapping that allows Foundry to read data from the underlying data structure into usable Python objects. Key
objects have the following properties:
key (str) [required]
A name mapping to a column name (e.g., for csv files) or key within a data structure (e.g., for HDF5 files)type (str) [required]
The type of key this entry represents. Currently suported types are ["input", "target" ]units (str)[optional]
The scientific units associated with a key. Default: Nonedescription (str)[optional]
A free text description of the key. Default: Nonelabels (list) (str) [optional]
: A list of strings mapped to integers in a key column
splits (list of objects) [required]:
split
objects provide a convenient way to specify various data splits, or subsets of the dataset. split
objects have the following properties:
type (str) [required]
The type of split (e.g., "train", "test", "validation")path (str) [required]
A path to the file or folder containing the split datalabel (str)
A descriptive name for the split if required. Default: None
short_name (str) [required]:
Short name is a unique, human-readable name associated with this dataset to make loading and finding the dataset simple.
data_type (str) [required]:
The type of data provided. This gives a hint to Foundry on how to map the keys into loading operations. Options ["tabular","hdf5"]
task_type (str)[optional]:
The type of process or analytical task the dataset is meant for. For example, "classification", "supervised", etc.
domain (str)[optional]:
The science domain of the dataset. For example, "materials science".
n_items (number)[optional]:
The number of items within the dataset.
Example usage
For loading datasets, see Getting Started - Loading Data.
For a full example of exploring data using Foundry, see our Jupyter notebook examples.