foundry.foundry_cache

module foundry.foundry_cache


class FoundryCache

The FoundryCache manages the local storage of FoundryDataset objects

method __init__

__init__(forge_client: Forge, transfer_client: Any, local_cache_dir: str = None)

Initializes a FoundryCache object.

Args:

  • forge_client (Forge): The Forge client object.

  • transfer_client (Any): The transfer client object.

  • local_cache_dir (str, optional): The local cache directory. Defaults to None. If not specified, defaults to either environmental variable ('FOUNDRY_LOCAL_CACHE_DIR') or './data/'.


method clear_cache

clear_cache(dataset_name: str = None)

Deletes all of the locally stored datasets

Arguments:

  • dataset_name (str): Optional name of a specific dataset. If omitted, all datsets will be erased


method download_to_cache

download_to_cache(
    dataset_name: str,
    splits: List[FoundrySplit] = None,
    use_globus: bool = False,
    interval: int = 10,
    parallel_https: int = 4,
    verbose: bool = False,
    transfer_client=None
)

Checks if the data is downloaded, and if not, downloads the data from source to local storage.

Args:

  • dataset_name (str): Name of the dataset (equivalent to source_id in MDF).

  • splits (List[FoundrySplit], optional): List of splits in the dataset. Defaults to None.

  • use_globus (bool, optional): If True, use Globus to download the data; otherwise, try HTTPS. Defaults to False.

  • interval (int, optional): How often to wait before checking Globus transfer status. Defaults to 10.

  • parallel_https (int, optional): Number of files to download in parallel if using HTTPS. Defaults to 4.

  • verbose (bool, optional): Produce more debug messages to screen. Defaults to False.

  • transfer_client (Any, optional): The transfer client object. Defaults to None.

Returns:

  • FoundryCache: The FoundryCache object.


method download_via_globus

download_via_globus(dataset_name: str, interval: int)

Downloads selected dataset over Globus.

Args:

  • dataset_name (str): Name of the dataset (equivalent to source_id in MDF).

  • interval (int): How often to wait before checking Globus transfer status.


method download_via_http

download_via_http(
    dataset_name: str,
    parallel_https: int,
    verbose: bool,
    transfer_client: Any
)

Downloads selected dataset from MDF over HTTP.

Args:

  • dataset_name (str): Name of the dataset (equivalent to source_id in MDF).

  • parallel_https (int): Number of threads to use for downloading.

  • verbose (bool): Produce more debug messages to screen.

  • transfer_client (Any): The transfer client object.


method get_keys

get_keys(
    foundry_schema: FoundrySchema,
    type: str = None,
    as_object: bool = False
)

Get keys for a Foundry dataset

Arguments:

  • foundry_schema (FoundrySchema): The schema from MDF that contains the keys

  • type (str): The type of key to be returned e.g., "input", "target"

  • as_object (bool): When False, will return a list of keys in as strings When True, will return the full key objects

  • **Default: ** False Returns: (list) String representations of keys or if as_object is False otherwise returns the full key objects.


method load_as_dict

load_as_dict(
    split: str,
    dataset_name: str,
    foundry_schema: FoundrySchema,
    use_globus: bool,
    interval: int,
    parallel_https: int,
    verbose: bool,
    transfer_client: Any,
    as_hdf5: bool
)

Load in the data associated with the prescribed dataset.

Args:

  • dataset_name (str): Name of the dataset (equivalent to source_id in MDF).

  • foundry_schema (FoundrySchema, optional): Schema element as obtained from MDF. Defaults to None.

  • use_globus (bool, optional): If True, use Globus to download the data; otherwise, try HTTPS. Defaults to False.

  • interval (int, optional): How often to wait before checking Globus transfer status. Defaults to 10.

  • parallel_https (int, optional): Number of files to download in parallel if using HTTPS. Defaults to 4.

  • verbose (bool, optional): Produce more debug messages to screen. Defaults to False.

  • transfer_client (Any, optional): The transfer client object. Defaults to None.

  • as_hdf5 (bool, optional): If True and dataset is in HDF5 format, keep data in HDF5 format. Defaults to False.

Returns:

  • dict: A labeled dictionary of tuples.


method load_as_tensorflow

load_as_tensorflow(
    split: str,
    dataset_name: str,
    foundry_schema: FoundrySchema,
    use_globus: bool,
    interval: int,
    parallel_https: int,
    verbose: bool,
    transfer_client: Any,
    as_hdf5: bool
)

Convert Foundry Dataset to a Tensorflow Sequence

Arguments:

  • split (string): Split to create Tensorflow Sequence on.

  • **Default: ** None

Returns: (TensorflowSequence) Tensorflow Sequence of all the data from the specified split


method load_as_torch

load_as_torch(
    split: str,
    dataset_name: str,
    foundry_schema: FoundrySchema,
    use_globus: bool,
    interval: int,
    parallel_https: int,
    verbose: bool,
    transfer_client: Any,
    as_hdf5: bool
)

Convert Foundry Dataset to a PyTorch Dataset

Arguments:

  • split (string): Split to create PyTorch Dataset on.

  • **Default: ** None

Returns: (TorchDataset) PyTorch Dataset of all the data from the specified split


method validate_local_dataset_storage

validate_local_dataset_storage(
    dataset_name: str,
    splits: List[FoundrySplit] = None
)

Verifies that the local storage location exists and all expected files are present.

Args:

  • dataset_name (str): Name of the dataset (equivalent to source_id in MDF).

  • splits (List[FoundrySplit], optional): Labels of splits to be loaded. Defaults to None.

Returns:

  • bool: True if the dataset exists and contains all the desired files; False otherwise.


This file was automatically generated via lazydocs.