Publishing Datasets
Information on how to publish datasets
In order to publish datasets, the datasets must 1) adhere to specified Foundry dataset shapes (see here), and 2) be described with required information (see here). Together, the dataset shape and description enable researchers to reuse the datasets more easily.
When datasets are published you will receive a Digital Object Identifier (DOI) to enable citation of the research artifact
Jupyter Notebook Publishing Guide
We created a notebook that walks you through the publication process. Just fill in the notebook with your data's details and you can publish right from there.
Skip to the publication guide notebook.
Shaping Datasets
For a general dataset to be translated into a usable Foundry dataset, it should follow one of the prescribed shapes. It should also be described by a set of Key
objects, which provides description of the data, and a mapping that allows Foundry to read data into usable Python objects (see Describing Datasets for more info).
Tabular Data
Tabular data should be in a form where columns represent the different keys of the data and rows represent individual entries. For example:
feature_1
feature_2
material_type
band_gap
0.10
0.52
1
1.40
0.34
0.910
0
0.73
...
...
...
For this example dataset, the keys
list could be:
Don't forget to specify the tabular data filename and path in the submitted metadata. This can be done in a split - see the section on
Describing Datasets
Hierarchical Data
Foundry also supports data from hierarchical data formats (e.g., HDF5). In this case, features and outputs can be represented with /
notation. For example, if the features of a dataset are located in an array stored in /data/arr1
and /other_data/arr2
while the outputs are in /data/band_gaps
, the Key object would be:
Describing Datasets
DataCite Metadata (object): All datasets can be described using metadata in compliance with the DataCite metadata format. This metadata captures . Many of these capabilities have helper functions in the SDK, to make it easier to match the DataCite schema
Keys (list[Key]): Key
objects provide a mapping that allows Foundry to read data from the underlying data structure into usable Python objects. Individual Key
objects have the following properties
key (str) [required]
A name mapping to a column name (e.g., for csv files) or key within a data structure (e.g., for HDF5 files)type (str) [required]
The type of key this entry represents. Currently suported types are ["input", "target" ]units (str)[optional]
The scientific units associated with a key. Default: Nonedescription (str)[optional]
A free text description of the key. Default: Nonelabels (list) (str) [optional]
: A list of strings mapped to integers in a key column
Splits (list[Split]) [required]: Split
objects provide a way for users to specify which data should be included as test, train, or other user defined splits. Individual Split
objects have the following properties
type (str) [required]
A split type, e.g., the Foundry special split types oftrain
,test
, andvalidation
. These special split types may be handled differently than custom split types defined by users.path (str) [required]
The full filepath to the dataset file or directory that contains the splitlabel (str)
A label to assign to this split
short_name (str) [required]: Short name is a unique name associated with this dataset to make loading and .
type (str) [required]: The type provides a hint to Foundry on how to map the keys into loading operations. Options ["tabular","hdf5"]
Publishing
Before continuing, be sure that you have 1) signed up for a free Globus account and 2) joined this Globus group.
Once your dataset is in the proper shape, and you have created the associated metadata structure, you can publish to Foundry! One example of a complete set of metadata to describe a dataset is shown below.
Currently, you can publish any dataset you have stored on a Globus endpoint or Google Drive. In the following, assume your previously defined metadata are stored in metadata
:
The publish_dataset()
method returns a result object that you can inspect for information about the state of the publication. For the above publication, res
would have the format:
Note that for large datasets, or for datasets that you would like to upload faster than HTTPS allows, you can create a Globus Transfer. Instead of specifying https_data_path
, use globus_data_source
:
More information about how to get the Globus source URL to the dataset you would like to publish can be found in our Publishing Guide.
Once the dataset is submitted, there is a manual curation step required to maintain dataset standards. This will take additional time.
Future Work
Add support for wildcard key type specifications
Add example showing how to describe an image-containing folder