Dataset Containers
MLDatasets.jl contains several reusable data containers for accessing datasets in common storage formats. This feature is a work-in-progress and subject to change.
MLDatasets.FileDataset
— TypeFileDataset([loadfn = FileIO.load,] paths)
FileDataset([loadfn = FileIO.load,] dir, pattern = "*", depth = 4)
Wrap a set of file paths
as a dataset (traversed in the same order as paths
). Alternatively, specify a dir
and collect all paths that match a glob pattern
(recursively globbing by depth
). The glob order determines the traversal order.
MLDatasets.CachedDataset
— TypeCachedDataset(source, cachesize = numbobs(source))
CachedDataset(source, cacheidx = 1:numbobs(source))
CachedDataset(source, cacheidx, cache)
Wrap a source
data container and cache cachesize
samples in memory. This can be useful for improving read speeds when source
is a lazy data container, but your system memory is large enough to store a sizeable chunk of it.
By default the observation indices 1:cachesize
are cached. You can manually pass in a set of cacheidx
as well.
See also make_cache
for customizing the default cache creation for source
.
MLDatasets.make_cache
— Functionmake_cache(source, cacheidx)
Return a in-memory copy of source
at observation indices cacheidx
. Defaults to getobs(source, cacheidx)
.