Miscellaneuous Datasets
Index
Documentation
MLDatasets.BostonHousing
— TypeBostonHousing(; as_df = true, dir = nothing)
The classical Boston Housing tabular dataset.
Sources: (a) Origin: This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. (b) Creator: Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. (c) Date: July 7, 1993
Number of Instances: 506
Number of Attributes: 13 continuous attributes (including target attribute "MEDV"), 1 binary-valued attribute.
Arguments
If
as_df = true
, load the data as dataframes instead of plain arrays.You can pass a specific
dir
where to load or download the dataset, otherwise uses the default one.
Fields
metadata
: A dictionary containing additional information on the dataset.features
: The data features. An array ifas_df=false
, otherwise a dataframe.targets
: The targets for supervised learning. An array ifas_df=false
, otherwise a dataframe.dataframe
: A dataframe containing bothfeatures
andtargets
. It isnothing
ifas_df=false
, otherwise a dataframed.
Methods
dataset[i]
: Return observation(s)i
as a named tuple of features and targets.dataset[:]
: Return all observations as a named tuple of features and targets.length(dataset)
: Number of observations.
Examples
julia> using MLDatasets: BostonHousing
julia> dataset = BostonHousing()
BostonHousing:
metadata => Dict{String, Any} with 5 entries
features => 506×13 DataFrame
targets => 506×1 DataFrame
dataframe => 506×14 DataFrame
julia> dataset[1:5][1]
5×13 DataFrame
Row │ CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
│ Float64 Float64 Float64 Int64 Float64 Float64 Float64 Float64 Int64 Int64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98
2 │ 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14
3 │ 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
4 │ 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
5 │ 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.9 5.33
julia> dataset[1:5][2]
5×1 DataFrame
Row │ MEDV
│ Float64
────┼─────────
1 │ 24.0
2 │ 21.6
3 │ 34.7
4 │ 33.4
5 │ 36.2
julia> X, y = BostonHousing(as_df=false)[:]
([0.00632 0.02731 … 0.10959 0.04741; 18.0 0.0 … 0.0 0.0; … ; 396.9 396.9 … 393.45 396.9; 4.98 9.14 … 6.48 7.88], [24.0 21.6 … 22.0 11.9])
MLDatasets.Iris
— TypeIris(; as_df = true, dir = nothing)
Fisher's classic iris dataset.
Measurements from 3 different species of iris: setosa, versicolor and virginica. There are 50 examples of each species.
There are 4 measurements for each example: sepal length, sepal width, petal length and petal width. The measurements are in centimeters.
The module retrieves the data from the UCI Machine Learning Repository.
NOTE: no pre-defined train-test split for this dataset.
Arguments
If
as_df = true
, load the data as dataframes instead of plain arrays.You can pass a specific
dir
where to load or download the dataset, otherwise uses the default one.
Fields
metadata
: A dictionary containing additional information on the dataset.features
: The data features. An array ifas_df=false
, otherwise a dataframe.targets
: The targets for supervised learning. An array ifas_df=false
, otherwise a dataframe.dataframe
: A dataframe containing bothfeatures
andtargets
. It isnothing
ifas_df=false
, otherwise a dataframed.
Methods
dataset[i]
: Return observation(s)i
as a named tuple of features and targets.dataset[:]
: Return all observations as a named tuple of features and targets.length(dataset)
: Number of observations.
Examples
julia> dataset = Iris()
Iris:
metadata => Dict{String, Any} with 4 entries
features => 150×4 DataFrame
targets => 150×1 DataFrame
dataframe => 150×5 DataFrame
julia> dataset[1:2]
(2×4 DataFrame
Row │ sepallength sepalwidth petallength petalwidth
│ Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────
1 │ 5.1 3.5 1.4 0.2
2 │ 4.9 3.0 1.4 0.2, 2×1 DataFrame
Row │ class
│ String15
─────┼─────────────
1 │ Iris-setosa
2 │ Iris-setosa)
julia> X, y = Iris(as_df=false)[:]
([5.1 4.9 … 6.2 5.9; 3.5 3.0 … 3.4 3.0; 1.4 1.4 … 5.4 5.1; 0.2 0.2 … 2.3 1.8], InlineStrings.String15["Iris-setosa" "Iris-setosa" … "Iris-virginica" "Iris-virginica"])
MLDatasets.Mutagenesis
— TypeMutagenesis(; split=:train, dir=nothing)
Mutagenesis(split; dir=nothing)
The Mutagenesis
dataset comprises 188 molecules trialed for mutagenicity on Salmonella typhimurium, available from relational.fit.cvut.cz and CTUAvastLab/datasets.
Set split
to :train
, :val
, :test
, or :all
, to select the training, validation, test partition respectively or the whole dataset. The indexes
field in the result contains the indexes of the partition in the full dataset.
Website: https://relational.fit.cvut.cz/dataset/Mutagenesis License: CC0
julia> using MLDatasets: Mutagenesis
julia> dataset = Mutagenesis(:train)
Mutagenesis dataset:
split : train
indexes : 100-element Vector{Int64}
features : 100-element Vector{Dict{Symbol, Any}}
targets : 100-element Vector{Int64}
julia> dataset[1].features
Dict{Symbol, Any} with 5 entries:
:lumo => -1.246
:inda => 0
:logp => 4.23
:ind1 => 1
:atoms => Dict{Symbol, Any}[Dict(:element=>"c", :bonds=>Dict{Symbol, Any}[Dict(:element=>"c", :bond_type=>7, :charge=>-0.117, :atom_type=>22), Dict(:element=>"h", :bond_type=>1, :charge=>0.142, :atom_type=>3)…
julia> dataset[1].targets
1
julia> dataset = Mutagenesis(:all)
Mutagenesis dataset:
split : all
indexes : 188-element Vector{Int64}
features : 188-element Vector{Dict{Symbol, Any}}
targets : 188-element Vector{Int64}
MLDatasets.Titanic
— TypeTitanic(; as_df = true, dir = nothing)
The Titanic dataset, describing the survival of passengers on the Titanic ship.
Arguments
If
as_df = true
, load the data as dataframes instead of plain arrays.You can pass a specific
dir
where to load or download the dataset, otherwise uses the default one.
Fields
metadata
: A dictionary containing additional information on the dataset.features
: The data features. An array ifas_df=false
, otherwise a dataframe.targets
: The targets for supervised learning. An array ifas_df=false
, otherwise a dataframe.dataframe
: A dataframe containing bothfeatures
andtargets
. It isnothing
ifas_df=false
, otherwise a dataframed.
Methods
dataset[i]
: Return observation(s)i
as a named tuple of features and targets.dataset[:]
: Return all observations as a named tuple of features and targets.length(dataset)
: Number of observations.
Examples
julia> using MLDatasets: Titanic
julia> using DataFrames
julia> dataset = Titanic()
Titanic:
metadata => Dict{String, Any} with 5 entries
features => 891×11 DataFrame
targets => 891×1 DataFrame
dataframe => 891×12 DataFrame
julia> describe(dataset.dataframe)
12×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Union… Any Union… Any Int64 Type
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ PassengerId 446.0 1 446.0 891 0 Int64
2 │ Survived 0.383838 0 0.0 1 0 Int64
3 │ Pclass 2.30864 1 3.0 3 0 Int64
4 │ Name Abbing, Mr. Anthony van Melkebeke, Mr. Philemon 0 String
5 │ Sex female male 0 String7
6 │ Age 29.6991 0.42 28.0 80.0 177 Union{Missing, Float64}
7 │ SibSp 0.523008 0 0.0 8 0 Int64
8 │ Parch 0.381594 0 0.0 6 0 Int64
9 │ Ticket 110152 WE/P 5735 0 String31
10 │ Fare 32.2042 0.0 14.4542 512.329 0 Float64
11 │ Cabin A10 T 687 Union{Missing, String15}
12 │ Embarked C S 2 Union{Missing, String1}
MLDatasets.Wine
— TypeWine(; as_df = true, dir = nothing)
The UCI Wine dataset.
These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.
Data source is the UCI Machine Learning Repository where further details can be retrieved.
Arguments
If
as_df = true
, load the data as dataframes instead of plain arrays.You can pass a specific
dir
where to load or download the dataset, otherwise uses the default one.
Fields
metadata
: A dictionary containing additional information on the dataset.features
: The data features. An array ifas_df=false
, otherwise a dataframe.targets
: The targets for supervised learning. An array ifas_df=false
, otherwise a dataframe.dataframe
: A dataframe containing bothfeatures
andtargets
. It isnothing
ifas_df=false
, otherwise a dataframed.
Methods
dataset[i]
: Return observation(s)i
as a named tuple of features and targets.dataset[:]
: Return all observations as a named tuple of features and targets.length(dataset)
: Number of observations.
Examples
julia> using MLDatasets: Wine
julia> using DataFrames
julia> dataset = Wine()
dataset Wine:
metadata => Dict{String, Any} with 5 entries
features => 178×13 DataFrame
targets => 178×1 DataFrame
dataframe => 178×14 DataFrame
julia> describe(dataset.dataframe)
14×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Float64 Real Float64 Real Int64 DataType
─────┼────────────────────────────────────────────────────────────────────────────────
1 │ Wine 1.9382 1 2.0 3 0 Int64
2 │ Alcohol 13.0006 11.03 13.05 14.83 0 Float64
3 │ Malic.acid 2.33635 0.74 1.865 5.8 0 Float64
4 │ Ash 2.36652 1.36 2.36 3.23 0 Float64
5 │ Acl 19.4949 10.6 19.5 30.0 0 Float64
6 │ Mg 99.7416 70 98.0 162 0 Int64
7 │ Phenols 2.29511 0.98 2.355 3.88 0 Float64
8 │ Flavanoids 2.02927 0.34 2.135 5.08 0 Float64
9 │ Nonflavanoid.phenols 0.361854 0.13 0.34 0.66 0 Float64
10 │ Proanth 1.5909 0.41 1.555 3.58 0 Float64
11 │ Color.int 5.05809 1.28 4.69 13.0 0 Float64
12 │ Hue 0.957449 0.48 0.965 1.71 0 Float64
13 │ OD 2.61169 1.27 2.78 4.0 0 Float64
14 │ Proline 746.893 278 673.5 1680 0 Int64