MLDatasets.jl's Documentation
This package represents a community effort to provide a common interface for accessing common Machine Learning (ML) datasets. In contrast to other data-related Julia packages, the focus of MLDatasets.jl
is specifically on downloading, unpacking, and accessing benchmark dataset. Functionality for the purpose of data processing or visualization is only provided to a degree that is special to some dataset.
This package is a part of the JuliaML
ecosystem.
Installation
To install MLDatasets.jl
, start up Julia and type the following code snippet into the REPL. It makes use of the native Julia package manager.
Pkg.add("MLDatasets")
Available Datasets
Datasets are grouped into different categories. Click on the links below for a full list of datasets available in each category.
- Graph Datasets - datasets with an underlying graph structure: Cora, PubMed, CiteSeer, ...
- Miscellaneuous Datasets - datasets that do not fall into any of the other categories: Iris, BostonHousing, ...
- Text Datasets - datasets for language models.
- Vision Datasets - vision related datasets such as MNIST, CIFAR10, CIFAR100, ...
Basic Usage
The way MLDatasets.jl
is organized is that each dataset is its own type. Where possible, those types share a common interface (fields and methods).
Once a dataset has been instantiated, e.g. by dataset = MNIST()
, an observation i
can be retrieved using the indexing syntax dataset[i]
. By indexing with no arguments, dataset[:]
, the whole set of observations is collected. The total number of observations is given by length(dataset)
.
For example you can load the training set of the MNIST
database of handwritten digits using the following commands:
julia> using MLDatasets
julia> trainset = MNIST(:train)
dataset MNIST:
metadata => Dict{String, Any} with 3 entries
split => :train
features => 28×28×60000 Array{Float32, 3}
targets => 60000-element Vector{Int64}
julia> length(trainset)
60000
julia> trainset[1] # return first observation as a NamedTuple
(features = Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0],
targets = 5)
julia> X_train, y_train = trainset[:] # return all observations
(features = [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0;;; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0;;; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0;;; … ;;; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0;;; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0;;; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0],
targets = [5, 0, 4, 1, 9, 2, 1, 3, 1, 4 … 9, 2, 9, 5, 1, 8, 3, 5, 6, 8])
julia> summary(X_train)
"28×28×60000 Array{Float32, 3}"
Input features are commonly denoted by features
, while classification labels and regression targets are denoted by targets
.
julia> using MLDatasets, DataFrames
julia> iris = Iris()
dataset Iris:
metadata => Dict{String, Any} with 4 entries
features => 150×4 DataFrame
targets => 150×1 DataFrame
dataframe => 150×5 DataFrame
julia> iris.features
150×4 DataFrame
Row │ sepallength sepalwidth petallength petalwidth
│ Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────
1 │ 5.1 3.5 1.4 0.2
2 │ 4.9 3.0 1.4 0.2
3 │ 4.7 3.2 1.3 0.2
4 │ 4.6 3.1 1.5 0.2
5 │ 5.0 3.6 1.4 0.2
6 │ 5.4 3.9 1.7 0.4
7 │ 4.6 3.4 1.4 0.3
8 │ 5.0 3.4 1.5 0.2
9 │ 4.4 2.9 1.4 0.2
⋮ │ ⋮ ⋮ ⋮ ⋮
142 │ 6.9 3.1 5.1 2.3
143 │ 5.8 2.7 5.1 1.9
144 │ 6.8 3.2 5.9 2.3
145 │ 6.7 3.3 5.7 2.5
146 │ 6.7 3.0 5.2 2.3
147 │ 6.3 2.5 5.0 1.9
148 │ 6.5 3.0 5.2 2.0
149 │ 6.2 3.4 5.4 2.3
150 │ 5.9 3.0 5.1 1.8
132 rows omitted
julia> iris.targets
150×1 DataFrame
Row │ class
│ String15
─────┼────────────────
1 │ Iris-setosa
2 │ Iris-setosa
3 │ Iris-setosa
4 │ Iris-setosa
5 │ Iris-setosa
6 │ Iris-setosa
7 │ Iris-setosa
8 │ Iris-setosa
9 │ Iris-setosa
⋮ │ ⋮
142 │ Iris-virginica
143 │ Iris-virginica
144 │ Iris-virginica
145 │ Iris-virginica
146 │ Iris-virginica
147 │ Iris-virginica
148 │ Iris-virginica
149 │ Iris-virginica
150 │ Iris-virginica
132 rows omitted
MLUtils compatibility
MLDatasets.jl guarantees compatibility with the getobs and numobs interface defined in MLUtils.jl. In practice, applying getobs
and numobs
on datasets is equivalent to applying indexing and length
.
Conditional module loading
MLDatasets.jl relies on many different packages in order to load and process the diverse type of datasets it supports. Most likely, any single user of the library will use a limited subset of these functionalities. In order to reduce the time taken by using MLDatasets
in users' code, we use a lazy import system that defers the import of packages inside MLDatasets.jl as much as possible. For some of the packages, some manual intervention is needed from the user. As an example, the following code will produce an error:
julia> using MLDataset
julia> MNIST(); # fine, MNIST doesn't require DataFrames
julia> Iris() # ERROR: Add `import DataFrames` or `using DataFrames` to your code to unlock this functionality.
We can easily fix the error with an additional import as recommended by the error message:
julia> using MLDataset, DataFrames
julia> Iris()
dataset Iris:
metadata => Dict{String, Any} with 4 entries
features => 150×4 DataFrame
targets => 150×1 DataFrame
dataframe => 150×5 DataFrame
Download location
MLDatasets.jl is built on top of the package DataDeps.jl
. To load the data, the package looks for the necessary files in various locations (see DataDeps.jl
for more information on how to configure such defaults). If the data can't be found in any of those locations, then the package will trigger a download dialog to ~/.julia/datadeps/<DATASETNAME>
. To overwrite this on a case by case basis, it is possible to specify a data directory directly in the dataset constructor (e.g. MNIST(dir = <directory>)
).
In order to download datasets without having to manually confirm the download, you can set to true the following environmental variable:
ENV["DATADEPS_ALWAYS_ACCEPT"] = true