Foundry-ML is a Python library that simplifies access to machine learning-ready datasets in materials science and chemistry. It provides a unified interface to discover, load, and use curated scientific datasets.
What is Foundry?
Foundry serves as a bridge between data producers (researchers who create datasets) and data consumers (researchers who use datasets for ML). It standardizes how datasets are:
Discovered - Search by keyword, browse catalogs, or get by DOI
Described - Rich metadata including field descriptions, units, and citations
Delivered - Automatic download, caching, and format conversion
Key Features
For Data Users
from foundry import Foundry# Connect and searchf =Foundry()results = f.search("band gap",limit=5)# Load a datasetdataset = results.iloc[0].FoundryDatasetX, y = dataset.get_as_dict()['train']# Understand the dataschema = dataset.get_schema()print(schema['fields'])# What columns exist and what they mean
For AI Agents
Foundry includes an MCP (Model Context Protocol) server that enables AI assistants like Claude to discover and use datasets programmatically:
For Data Publishers
Share your datasets with the community using standardized metadata:
Architecture
Core Concepts
Datasets
A Foundry dataset contains:
Data files - The actual data (JSON, CSV, HDF5, etc.)
Schema - Description of fields, types, and splits
Metadata - DataCite-compliant citation information
Splits
Datasets are organized into splits (e.g., train, test, validation) with input/target pairs: