Publishing Datasets

Information on how to publish datasets

In order to publish datasets, the datasets must 1) adhere to specified Foundry dataset shapes (see here), and 2) be described with required information (see here). Together, the dataset shape and description enable researchers to reuse the datasets more easily.

When datasets are published you will receive a Digital Object Identifier (DOI) to enable citation of the research artifact

Jupyter Notebook Publishing Guide

We created a notebook that walks you through the publication process. Just fill in the notebook with your data's details and you can publish right from there.

Skip to the publication guide notebook.

Shaping Datasets

For a general dataset to be translated into a usable Foundry dataset, it should follow one of the prescribed shapes. It should also be described by a set of Key objects, which provides description of the data, and a mapping that allows Foundry to read data into usable Python objects (see Describing Datasets for more info).

Tabular Data

Tabular data should be in a form where columns represent the different keys of the data and rows represent individual entries. For example:

feature_1

feature_2

material_type

band_gap

0.10

0.52

1

1.40

0.34

0.910

0

0.73

...

...

...

For this example dataset, the keys list could be:

"keys": [{
	"key": ["feature_1"],
	"type": "input",
	"units": "",
	"description": "This is feature 1"
},{
	"key": ["feature_2"],
	"type": "input",
	"units": "",
	"description": "This is feature 2"
},{
	"key": ["material_type"],
	"type": "input",
	"description": "This is the material type",
	"classes": ["perovskite", "not perovskite"]
},{
	"key": ["band_gap"],
	"type": "target",
	"units": "eV",
	"description": "This is the simulated band gap in eV"
}]

Don't forget to specify the tabular data filename and path in the submitted metadata. This can be done in a split - see the section on Describing Datasets

Hierarchical Data

Foundry also supports data from hierarchical data formats (e.g., HDF5). In this case, features and outputs can be represented with / notation. For example, if the features of a dataset are located in an array stored in /data/arr1 and /other_data/arr2 while the outputs are in /data/band_gaps, the Key object would be:

"keys": [{
	"key": ["/data/arr1"],
	"type": "input",
	"units": None,
	"description": "This is an array containing input data"
}, {
	"key": ["/other_data/arr2"],
	"type": "input",
	"units": None,
	"description": "This is an another array containing input data"
}, {
	"key": ["/data/band_gaps"],
	"type": "target",
	"units": "eV",
	"description": "This is the simulated band gap in eV"
}]

Describing Datasets

DataCite Metadata (object): All datasets can be described using metadata in compliance with the DataCite metadata format. This metadata captures . Many of these capabilities have helper functions in the SDK, to make it easier to match the DataCite schema

Keys (list[Key]): Key objects provide a mapping that allows Foundry to read data from the underlying data structure into usable Python objects. Individual Key objects have the following properties

  • key (str) [required]A name mapping to a column name (e.g., for csv files) or key within a data structure (e.g., for HDF5 files)

  • type (str) [required] The type of key this entry represents. Currently suported types are ["input", "target" ]

  • units (str)[optional] The scientific units associated with a key. Default: None

  • description (str)[optional] A free text description of the key. Default: None

  • labels (list) (str) [optional]: A list of strings mapped to integers in a key column

# An example of keys object

"keys":[{
    "key": ["band_gaps"],
    "type": "target",
    "units": "eV",
    "description": "This is the simulated band gap in eV"
}]

Splits (list[Split]) [required]: Splitobjects provide a way for users to specify which data should be included as test, train, or other user defined splits. Individual Split objects have the following properties

  • type (str) [required]A split type, e.g., the Foundry special split types of train, test, andvalidation. These special split types may be handled differently than custom split types defined by users.

  • path (str) [required] The full filepath to the dataset file or directory that contains the split

  • label (str) A label to assign to this split

"splits": [{
    "type": "train",
		"path": "g4mp2_data.json", # Specify the filename and path of the source file
		"label": "train"           # A text label for the split
}]

short_name (str) [required]: Short name is a unique name associated with this dataset to make loading and .

type (str) [required]: The type provides a hint to Foundry on how to map the keys into loading operations. Options ["tabular","hdf5"]

Publishing

Before continuing, be sure that you have 1) signed up for a free Globus account and 2) joined this Globus group.

Once your dataset is in the proper shape, and you have created the associated metadata structure, you can publish to Foundry! One example of a complete set of metadata to describe a dataset is shown below.

{
	"splits": [{
		"type": "train",
		"path": "g4mp2_data.json",
		"label": "train"
	}],
	"keys": [{
			"type": "input",
			"key": ["feature_1"],
			"units": "",
			"description": "This is an input"
		},
		{
			"type": "target",
			"key": ["band_gap"],
			"units": "eV",
			"description": "Bandgap of the material"
		}
	],
	"short_name": "my_short_name",
	"data_type": "tabular",
	"task_type": ["supervised"],
	"domain": ["materials science"]
	
}

Currently, you can publish any dataset you have stored on a Globus endpoint or Google Drive. In the following, assume your previously defined metadata are stored in metadata :

from foundry import Foundry

# path to the data for HTTPS upload
# NOTE: if uploading via Globus Connect Client, you'll want to specify `globus_data_source` instead 
https_data_path = "~/Documents/data/g4mp2_data"

# full title of dataset
title = "Scourtas example bandgap dataset"

# authors to list 
authors = ["A. Scourtas", "B. Blaiszik"]

# shorthand title (optional)
short_name = "example_AS_bandgap"

# affiliations of authors (optional)
affiliations = ["Globus Labs, UChicago"]

# publisher of the data (optional)
publisher = "Materials Data Facility"

# publication year (optional)
publication_year = 2023


f = Foundry()
res = f.publish_dataset(metadata, data_source, title, authors, short_name=short_name))

The publish_dataset() method returns a result object that you can inspect for information about the state of the publication. For the above publication, res would have the format:

{
 'error': None,
 'source_id': '_test_example_bandgap_v1.1',
 'status_code': 202,
 'success': True
}

Note that for large datasets, or for datasets that you would like to upload faster than HTTPS allows, you can create a Globus Transfer. Instead of specifying https_data_path, use globus_data_source:

# Globus endpoint URL where your dataset is located
globus_data_source = "https://app.globus.org/file-manager?origin_id=e38ee745-6d04-11e5-ba46-22000b92c6ec&origin_path=%2Ffoundry%2F_test_blaiszik_foundry_bandgap_v1.2%2F"

More information about how to get the Globus source URL to the dataset you would like to publish can be found in our Publishing Guide.

Once the dataset is submitted, there is a manual curation step required to maintain dataset standards. This will take additional time.

Future Work

  • Add support for wildcard key type specifications

  • Add example showing how to describe an image-containing folder