Overview
Foundry-ML is a Python library that simplifies access to machine learning-ready datasets in materials science and chemistry. It provides a unified interface to discover, load, and use curated scientific datasets.
What is Foundry?
Foundry serves as a bridge between data producers (researchers who create datasets) and data consumers (researchers who use datasets for ML). It standardizes how datasets are:
Discovered - Search by keyword, browse catalogs, or get by DOI
Described - Rich metadata including field descriptions, units, and citations
Delivered - Automatic download, caching, and format conversion
Key Features
For Data Users
from foundry import Foundry
# Connect and search
f = Foundry()
results = f.search("band gap", limit=5)
# Load a dataset
dataset = results.iloc[0].FoundryDataset
X, y = dataset.get_as_dict()['train']
# Understand the data
schema = dataset.get_schema()
print(schema['fields']) # What columns exist and what they meanFor AI Agents
Foundry includes an MCP (Model Context Protocol) server that enables AI assistants like Claude to discover and use datasets programmatically:
For Data Publishers
Share your datasets with the community using standardized metadata:
Architecture
Core Concepts
Datasets
A Foundry dataset contains:
Data files - The actual data (JSON, CSV, HDF5, etc.)
Schema - Description of fields, types, and splits
Metadata - DataCite-compliant citation information
Splits
Datasets are organized into splits (e.g., train, test, validation) with input/target pairs:
Keys (Fields)
Each field in a dataset has:
Name - The column/field identifier
Type -
inputortargetDescription - What the field represents
Units - Physical units (if applicable)
Ecosystem Integration
Foundry integrates with the broader ML ecosystem:
PyTorch
dataset.get_as_torch()
TensorFlow
dataset.get_as_tensorflow()
HuggingFace Hub
Export datasets for broader visibility
MCP Server
AI agent access
CLI
Terminal-based workflows
Next Steps
Installation - Get Foundry installed
Quick Start - Load your first dataset in 5 minutes
Searching for Datasets - Find the right data
Last updated
Was this helpful?