Dataset Schemas

Schemas describe what data a dataset contains, helping you understand before you load.

Getting the Schema

from foundry import Foundry

f = Foundry()
dataset = f.search("band gap", limit=1).iloc[0].FoundryDataset

schema = dataset.get_schema()

Schema Structure

{
    'name': 'foundry_oqmd_band_gaps_v1.1',
    'title': 'OQMD Band Gaps Dataset',
    'doi': '10.18126/abc123',
    'description': 'Band gaps calculated using DFT...',
    'data_type': 'tabular',
    'fields': [...],
    'splits': [...]
}

Fields

Fields describe each column/feature in the dataset:

for field in schema['fields']:
    print(f"Name: {field['name']}")
    print(f"Role: {field['role']}")  # 'input' or 'target'
    print(f"Description: {field['description']}")
    print(f"Units: {field['units']}")
    print("---")

Example output:

Name: composition
Role: input
Description: Chemical composition formula
Units: None
---
Name: band_gap
Role: target
Description: DFT-calculated band gap
Units: eV
---

Splits

Splits show how data is divided:

for split in schema['splits']:
    print(f"{split['name']}: {split.get('type', 'data')}")

Example output:

train: train
test: test

Data Types

The data_type field indicates the format:

Type

Description

tabular

Rows and columns (most common)

hierarchical

Nested/tree structure

image

Image data

Using Schema Information

Filter by Field Role

input_fields = [f for f in schema['fields'] if f['role'] == 'input']
target_fields = [f for f in schema['fields'] if f['role'] == 'target']

print(f"Inputs: {[f['name'] for f in input_fields]}")
print(f"Targets: {[f['name'] for f in target_fields]}")

Check Units

for field in schema['fields']:
    if field['units']:
        print(f"{field['name']}: {field['units']}")

Include Schema with Data

result = dataset.get_as_dict(include_schema=True)

data = result['data']
schema = result['schema']

# Now you have both together
X, y = data['train']
print(f"Loading {schema['name']}...")

CLI Schema

foundry schema 10.18126/abc123

Output:

Dataset: foundry_oqmd_band_gaps_v1.1
Title: OQMD Band Gaps Dataset
DOI: 10.18126/abc123
Data Type: tabular

Fields:
  [input ] composition: Chemical composition formula
  [target] band_gap: DFT-calculated band gap (eV)

Splits:
  - train
  - test

Best Practices

Always Check Schema First

# Before loading (no download)
schema = dataset.get_schema()
print(f"Fields: {len(schema['fields'])}")
print(f"Splits: {[s['name'] for s in schema['splits']]}")

# If it looks right, load
data = dataset.get_as_dict()

Validate Data Against Schema

schema = dataset.get_schema()
data = dataset.get_as_dict()

X, y = data['train']

input_names = [f['name'] for f in schema['fields'] if f['role'] == 'input']
for name in input_names:
    if name not in X:
        print(f"Warning: {name} not in data")

Document Your Usage

schema = dataset.get_schema()
print(f"""
Using dataset: {schema['name']}
DOI: {schema['doi']}

Features used:
{chr(10).join(f"- {f['name']}: {f['description']}" for f in schema['fields'] if f['role'] == 'input')}

Target:
{chr(10).join(f"- {f['name']}: {f['description']}" for f in schema['fields'] if f['role'] == 'target')}
""")

PreviousUsing with ML Frameworks NextCommand Line Interface

Last updated 2 months ago

Was this helpful?

hashtagGetting the Schema

hashtagSchema Structure

hashtagFields

hashtagSplits

hashtagData Types

hashtagUsing Schema Information

hashtagFilter by Field Role

hashtagCheck Units

hashtagInclude Schema with Data

hashtagCLI Schema

hashtagBest Practices

hashtagAlways Check Schema First

hashtagValidate Data Against Schema

hashtagDocument Your Usage