Publishing Datasets
Information on how to publish datasets
Information on how to publish datasets
In order to publish datasets, the datasets must 1) adhere to specified Foundry dataset shapes (see here), and 2) be described with required information (see here). Together, the dataset shape and description enable researchers to reuse the datasets more easily.
When datasets are published you will receive a Digital Object Identifier (DOI) to enable citation of the research artifact
We created a notebook that walks you through the publication process. Just fill in the notebook with your data's details and you can publish right from there.
Skip to the publication guide notebook.
For a general dataset to be translated into a usable Foundry dataset, it should follow one of the prescribed shapes. It should also be described by a set of Key
objects, which provides description of the data, and a mapping that allows Foundry to read data into usable Python objects (see Describing Datasets for more info).
Tabular data should be in a form where columns represent the different keys of the data and rows represent individual entries. For example:
For this example dataset, the keys
list could be:
Don't forget to specify the tabular data filename and path in the submitted metadata. This can be done in a split - see the section on
Describing Datasets
Foundry also supports data from hierarchical data formats (e.g., HDF5). In this case, features and outputs can be represented with /
notation. For example, if the features of a dataset are located in an array stored in /data/arr1
and /other_data/arr2
while the outputs are in /data/band_gaps
, the Key object would be:
DataCite Metadata (object): All datasets can be described using metadata in compliance with the DataCite metadata format. This metadata meets basic standards and adheres to uniform, consistent schema. Many of these capabilities have helper functions in the SDK, to make it easier to match the DataCite schema.
Keys (list[Key]): Key
objects provide a mapping that allows Foundry to read data from the underlying data structure into usable Python objects. Individual Key
objects have the following properties
key (str) [required]
A name mapping to a column name (e.g., for csv files) or key within a data structure (e.g., for HDF5 files)
type (str) [required]
The type of key this entry represents. Currently suported types are ["input", "target" ]
units (str)[optional]
The scientific units associated with a key. Default: None
description (str)[optional]
A free text description of the key. Default: None
labels (list) (str) [optional]
: A list of strings mapped to integers in a key column
Splits (list[Split]) [required]: Split
objects provide a way for users to specify which data should be included as test, train, or other user defined splits. Individual Split
objects have the following properties
type (str) [required]
A split type, e.g., the Foundry special split types of train
, test
, andvalidation
. These special split types may be handled differently than custom split types defined by users.
path (str) [required]
The full filepath to the dataset file or directory that contains the split
label (str)
A label to assign to this split
short_name (str) [required]: Short name is a unique name associated with this dataset to make loading and .
type (str) [required]: The type provides a hint to Foundry on how to map the keys into loading operations. Options ["tabular","hdf5"]
Before continuing, be sure that you have 1) signed up for a free Globus account and 2) joined this Globus group.
Once your dataset is in the proper shape, and you have created the associated metadata structure, you can publish to Foundry! One example of a complete set of metadata to describe a dataset is shown below.
Now that we have the metadata and datacite information contained in the json objects we created above, we can create an instance of a FoundryDataset object. This serves as a container to hold and organize all of the data as well as the metadata for the dataset. We just need one additional bit of information: a dataset name
that we can use to reference the dataset.
If your metadata is correct, you'll receive this message:
Now that we have a FoundryDataset
object instantiated, we need to give it some data! To do so we call the add_data()
method on the FoundryDataset
instance, and pass it a keyword argument local_data_path
that references a path to a local file or folder that contains the data we want to publish.
To publish, we call the publish_dataset()
method on our foundry
object, and provide it with our FoundryDataset.
We can inspect the entire res
object for more detailed information about the publication. For the above publication, res
would have the format:
Note that for large datasets, or for datasets that you would like to upload faster than HTTPS allows, you can create a Globus Transfer. Instead of specifying https_data_path
, use globus_data_source
:
More information about how to get the Globus source URL to the dataset you would like to publish can be found in our Publishing Guide.
Once the dataset is submitted, there is a manual curation step required to maintain dataset standards. This will take additional time.
When you submit your dataset, it will be transferred to the Foundry data store and then any necessary metadata will be extracted for making your dataset searchable.
It then enters a human curation phase, where our team checks that the contents are safe and properly formatted; please note that curation can take up to a day or longer around weekends and holidays.
We we can use the source_id
of the res
result to check the status of our submission. Ths source_id
is a unique identifier based on the title and version of your data package.
To re-publish a dataset, pass update=True
to f.publish()
feature_1
feature_2
material_type
band_gap
0.10
0.52
1
1.40
0.34
0.910
0
0.73
...
...
...