Using Datasets
We'll take you from importing Foundry all the way to seeing your data.
We'll take you from importing Foundry all the way to seeing your data.
We have example notebooks using material science data that illustrate how to load data or publish datasets using Foundry. These notebooks are compatible with Google Colab and with Jupyter Notebook. However, some of the datasets are quite large and cannot be loaded into Google Colab without the pro version.
The Foundry client provides access to all of the methods described here for listing, loading, and publishing datasets and models. The code below will create a Foundry client
If you are running your script on cloud resources (e.g. Google Colab, Binder), see: Using Foundry on Cloud Computing Resources
To show all available Foundry datasets, you can use the Foundry list()
method as follows. The method returns a pandas DataFrame with details on the available datasets.
The Foundry client can be used to access datasets using a source_id
or a digital object identifier (DOI) e.g. here "foundry_wei_atom_locating_benchmark"
or "10.18126/e73h-3w6n"
. You can retrieve the source_id
from the list()
method.
The load()
method will remotely load the metadata (e.g., data location, data keys, etc.) and download the data to local storage if it is not already cached. Data can be downloaded via HTTPS without additional setup (set download
to True
and globus
to False
) or more optimally with a Globus endpoint set up on your machine (set download
to False
and globus
to True
).
All datasets are accessible via HTTPS and Globus by authenticated or anonymous download. Using the load function, simply set globus=True
to use Globus and globus=False
to use HTTPS
The image below is what f
looks like when printed in a notebook. This table contains the dataset's metadata.
Once the data are accessible locally, access the data with the load_data()
method. Load data allows you to load data from a specific split that is defined for the dataset, here we use train
.
Foundry works with common cloud computing providers (e.g., the NSF sponsored Jetstream and Google Colab). On these resources, simply add the following arguments to use a cloud-compatible authentication flow.
When downloading data, add the following argument to download via HTTPS.
This method may be slow for large datasets and datasets with many files