Publishing Datasets
Information on how to publish datasets
In order to publish datasets, the datasets must 1) adhere to specified Foundry dataset shapes (see here), and 2) be described with required information (see here). Together, the dataset shape and description enable researchers to reuse the datasets more easily.
Examples
Skip to the publication example notebook.
Shaping Datasets
For a general dataset to be translated into a usable Foundry dataset, it should follow one of the prescribed shapes. It should also be described by a Key object, which provides a mapping that allows Foundry to read data from the underlying data structure into usable Python objects (see Describing Datasets for more info).
Tabular Data
Tabular data should include in a form where columns represent the different keys of the data and rows represent individual entries.
feature_1
feature_2
material_type
band_gap
0.10
0.52
1
1.40
0.34
0.910
0
0.73
...
...
...
For this example dataset the keys list could be:
"keys":[{
"key": "feature_1",
"type": "input",
"units": None,
"description": "This is feature 1"
},{
"key": "feature_2",
"type": "input",
"units": None,
"description": "This is feature 2"
},{
"key": "material_type",
"type": "input",
"units": None,
"description": "This is the material type",
"labels":["perovskite","not perovskite"]
}{
"key": "band_gap",
"type": "target",
"units": "eV",
"description": "This is the simulated band gap in eV"
}
]Hierarchical Data
Foundry also supports data from hierarchical data formats (e.g., HDF5). In this case features and outputs can be represented with / notation. For example, if the features of a dataset are located in an array stored in /data/arr1 and /other_data/arr2 while the outputs are in /data/band_gaps, the Key object would be:
Describing Datasets
DataCite Metadata (object): All datasets can be described using metadata in compliance with the DataCite metadata format. This metadata captures . Many of these capabilities have helper functions in the SDK, to make it easier to match the DataCite schema
Keys (list[Key]): Key objects provide a mapping that allows Foundry to read data from the underlying data structure into usable Python objects. Individual Key objects have the following properties
**
key (str)**A name mapping to a column name (e.g., for csv files) or key within a data structure (e.g., for HDF5 files)type (str)The type of key this entry represents. Currently suported types are ["input", "target" ]units (str)[optional]_****_The scientific units associated with a key. Default: Nonedescription (str)[optional]_****_A free text description of the key. Default: Nonelabels (list) (str) [optional]: A list of strings mapped to integers in a key column
Splits (list[Split]): Splitobjects provide a way for users to specify which data should be included as test, train, or other user defined splits. Individual Split objects have the following properties
**
type (str)**A name mapping to a column name (e.g., for csv files) or key within a data structure (e.g., for HDF5 files)path (str)The full filepath to the dataset file or directory that contains the splitlabel (str)A label to assign to this split
short_name (str): Short name is a unique name associated with this dataset to make loading and .
type (str): The type provides a hint to Foundry on how to map the keys into loading operations. Options ["tabular","hdf5"]
Publishing
Once your dataset is in the proper shape, and you have created the associated metadata structure, you can publish to Foundry! An example is shown below.
Currently, you can publish any dataset you have stored on a Globus endpoint or Google Drive. In the following, assume your previously defined metadata are stored in metadata :
The publish() method returns a result object that you can inspect for information about the state of the publication. For the above publication, res would have the format:
Future Work
Add support for wildcard key type specifications
Add link to example publication
Last updated
Was this helpful?