Dataset Containers

MLDatasets.jl contains several reusable data containers for accessing datasets in common storage formats. This feature is a work-in-progress and subject to change.

MLDatasets.FileDatasetType
FileDataset([loadfn = FileIO.load,] paths)
FileDataset([loadfn = FileIO.load,] dir, pattern = "*", depth = 4)

Wrap a set of file paths as a dataset (traversed in the same order as paths). Alternatively, specify a dir and collect all paths that match a glob pattern (recursively globbing by depth). The glob order determines the traversal order.

source
MLDatasets.CachedDatasetType
CachedDataset(source, cachesize = numbobs(source))
CachedDataset(source, cacheidx = 1:numbobs(source))
CachedDataset(source, cacheidx, cache)

Wrap a source data container and cache cachesize samples in memory. This can be useful for improving read speeds when source is a lazy data container, but your system memory is large enough to store a sizeable chunk of it.

By default the observation indices 1:cachesize are cached. You can manually pass in a set of cacheidx as well.

See also make_cache for customizing the default cache creation for source.

source
MLDatasets.make_cacheFunction
make_cache(source, cacheidx)

Return a in-memory copy of source at observation indices cacheidx. Defaults to getobs(source, cacheidx).

source