HuggingFace Integration

Export Foundry datasets to HuggingFace Hub to increase visibility and enable discovery by the broader ML community.

Installation

pip install foundry-ml[huggingface]

Quick Start

Python API

from foundry import Foundry
from foundry.integrations.huggingface import push_to_hub

# Get a dataset
f = Foundry()
dataset = f.search("band gap", limit=1).iloc[0].FoundryDataset

# Export to HuggingFace Hub
url = push_to_hub(
    dataset,
    repo_id="your-username/dataset-name",
    token="hf_xxxxx"  # Or set HF_TOKEN env var
)
print(f"Published at: {url}")

CLI

What Gets Created

When you export a dataset, Foundry creates:

1. Data Files

The dataset is converted to HuggingFace's format (Parquet/Arrow) with all splits preserved:

2. Dataset Card (README.md)

A comprehensive README is auto-generated from the Foundry metadata:

3. Metadata

HuggingFace-compatible metadata including:

  • License information

  • Task categories

  • Tags for discoverability

  • Size information

API Reference

push_to_hub

Parameters:

Parameter
Type
Required
Description

dataset

FoundryDataset

Yes

Dataset from Foundry

repo_id

str

Yes

HuggingFace repository ID

token

str

No

API token (uses cached if not provided)

private

bool

No

Create private repository

split

str

No

Export specific split only

Returns: URL of the created dataset

Author Attribution

Important: The authors listed on HuggingFace come from the original DataCite metadata, not the person pushing. This preserves proper scientific attribution.

Examples

Export All Splits

Export Single Split

Private Repository

Using Environment Variable

CLI Options

Best Practices

Repository Naming

Use descriptive, lowercase names with hyphens:

  • Good: materials-science/oqmd-band-gaps

  • Bad: my_dataset_v1

Organization

Consider creating an organization for your lab/group:

  • your-lab/dataset-1

  • your-lab/dataset-2

Documentation

The auto-generated README is a starting point. Consider adding:

  • More detailed description

  • Example usage code

  • Related papers

  • Acknowledgments

Troubleshooting

Authentication Failed

Repository Already Exists

HuggingFace won't overwrite existing repos by default. Either:

  1. Use a different repo name

  2. Delete the existing repo first

  3. Use the HuggingFace web interface to update

Large Datasets

For very large datasets (>10GB), the upload may take time. Consider:

  • Exporting specific splits: split="train"

  • Using a stable internet connection

  • Running in a cloud environment

Last updated

Was this helpful?