HuggingFace Integration
Export Foundry datasets to HuggingFace Hub to increase visibility and enable discovery by the broader ML community.
Installation
pip install foundry-ml[huggingface]Quick Start
Python API
from foundry import Foundry
from foundry.integrations.huggingface import push_to_hub
# Get a dataset
f = Foundry()
dataset = f.search("band gap", limit=1).iloc[0].FoundryDataset
# Export to HuggingFace Hub
url = push_to_hub(
dataset,
repo_id="your-username/dataset-name",
token="hf_xxxxx" # Or set HF_TOKEN env var
)
print(f"Published at: {url}")CLI
What Gets Created
When you export a dataset, Foundry creates:
1. Data Files
The dataset is converted to HuggingFace's format (Parquet/Arrow) with all splits preserved:
2. Dataset Card (README.md)
A comprehensive README is auto-generated from the Foundry metadata:
3. Metadata
HuggingFace-compatible metadata including:
License information
Task categories
Tags for discoverability
Size information
API Reference
push_to_hub
Parameters:
dataset
FoundryDataset
Yes
Dataset from Foundry
repo_id
str
Yes
HuggingFace repository ID
token
str
No
API token (uses cached if not provided)
private
bool
No
Create private repository
split
str
No
Export specific split only
Returns: URL of the created dataset
Author Attribution
Important: The authors listed on HuggingFace come from the original DataCite metadata, not the person pushing. This preserves proper scientific attribution.
Examples
Export All Splits
Export Single Split
Private Repository
Using Environment Variable
CLI Options
Best Practices
Repository Naming
Use descriptive, lowercase names with hyphens:
Good:
materials-science/oqmd-band-gapsBad:
my_dataset_v1
Organization
Consider creating an organization for your lab/group:
your-lab/dataset-1your-lab/dataset-2
Documentation
The auto-generated README is a starting point. Consider adding:
More detailed description
Example usage code
Related papers
Acknowledgments
Troubleshooting
Authentication Failed
Repository Already Exists
HuggingFace won't overwrite existing repos by default. Either:
Use a different repo name
Delete the existing repo first
Use the HuggingFace web interface to update
Large Datasets
For very large datasets (>10GB), the upload may take time. Consider:
Exporting specific splits:
split="train"Using a stable internet connection
Running in a cloud environment
Last updated
Was this helpful?