Miscellaneuous Datasets

Index

Documentation

MLDatasets.BostonHousingType
BostonHousing(; as_df = true, dir = nothing)

The classical Boston Housing tabular dataset.

Sources: (a) Origin: This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. (b) Creator: Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. (c) Date: July 7, 1993

Number of Instances: 506

Number of Attributes: 13 continuous attributes (including target attribute "MEDV"), 1 binary-valued attribute.

Arguments

  • If as_df = true, load the data as dataframes instead of plain arrays.

  • You can pass a specific dir where to load or download the dataset, otherwise uses the default one.

Fields

  • metadata: A dictionary containing additional information on the dataset.
  • features: The data features. An array if as_df=false, otherwise a dataframe.
  • targets: The targets for supervised learning. An array if as_df=false, otherwise a dataframe.
  • dataframe: A dataframe containing both features and targets. It is nothing if as_df=false, otherwise a dataframed.

Methods

  • dataset[i]: Return observation(s) i as a named tuple of features and targets.
  • dataset[:]: Return all observations as a named tuple of features and targets.
  • length(dataset): Number of observations.

Examples

julia> using MLDatasets: BostonHousing

julia> dataset = BostonHousing()
BostonHousing:
  metadata => Dict{String, Any} with 5 entries
  features => 506×13 DataFrame
  targets => 506×1 DataFrame
  dataframe => 506×14 DataFrame


julia> dataset[1:5][1]
5×13 DataFrame
 Row │ CRIM     ZN       INDUS    CHAS   NOX      RM       AGE      DIS      RAD    TAX    PTRATIO  B        LSTAT   
     │ Float64  Float64  Float64  Int64  Float64  Float64  Float64  Float64  Int64  Int64  Float64  Float64  Float64 
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 0.00632     18.0     2.31      0    0.538    6.575     65.2   4.09        1    296     15.3   396.9      4.98
   2 │ 0.02731      0.0     7.07      0    0.469    6.421     78.9   4.9671      2    242     17.8   396.9      9.14
   3 │ 0.02729      0.0     7.07      0    0.469    7.185     61.1   4.9671      2    242     17.8   392.83     4.03
   4 │ 0.03237      0.0     2.18      0    0.458    6.998     45.8   6.0622      3    222     18.7   394.63     2.94
   5 │ 0.06905      0.0     2.18      0    0.458    7.147     54.2   6.0622      3    222     18.7   396.9      5.33

julia> dataset[1:5][2]
5×1 DataFrame
Row │ MEDV    
    │ Float64 
────┼─────────
  1 │    24.0
  2 │    21.6
  3 │    34.7
  4 │    33.4
  5 │    36.2  

julia> X, y = BostonHousing(as_df=false)[:]
([0.00632 0.02731 … 0.10959 0.04741; 18.0 0.0 … 0.0 0.0; … ; 396.9 396.9 … 393.45 396.9; 4.98 9.14 … 6.48 7.88], [24.0 21.6 … 22.0 11.9])
source
MLDatasets.IrisType
Iris(; as_df = true, dir = nothing)

Fisher's classic iris dataset.

Measurements from 3 different species of iris: setosa, versicolor and virginica. There are 50 examples of each species.

There are 4 measurements for each example: sepal length, sepal width, petal length and petal width. The measurements are in centimeters.

The module retrieves the data from the UCI Machine Learning Repository.

NOTE: no pre-defined train-test split for this dataset.

Arguments

  • If as_df = true, load the data as dataframes instead of plain arrays.

  • You can pass a specific dir where to load or download the dataset, otherwise uses the default one.

Fields

  • metadata: A dictionary containing additional information on the dataset.
  • features: The data features. An array if as_df=false, otherwise a dataframe.
  • targets: The targets for supervised learning. An array if as_df=false, otherwise a dataframe.
  • dataframe: A dataframe containing both features and targets. It is nothing if as_df=false, otherwise a dataframed.

Methods

  • dataset[i]: Return observation(s) i as a named tuple of features and targets.
  • dataset[:]: Return all observations as a named tuple of features and targets.
  • length(dataset): Number of observations.

Examples

julia> dataset = Iris()
Iris:
  metadata => Dict{String, Any} with 4 entries
  features => 150×4 DataFrame
  targets => 150×1 DataFrame
  dataframe => 150×5 DataFrame


julia> dataset[1:2]
(2×4 DataFrame
 Row │ sepallength  sepalwidth  petallength  petalwidth 
     │ Float64      Float64     Float64      Float64    
─────┼──────────────────────────────────────────────────
   1 │         5.1         3.5          1.4         0.2
   2 │         4.9         3.0          1.4         0.2, 2×1 DataFrame
 Row │ class       
     │ String15    
─────┼─────────────
   1 │ Iris-setosa
   2 │ Iris-setosa)

julia> X, y = Iris(as_df=false)[:]
([5.1 4.9 … 6.2 5.9; 3.5 3.0 … 3.4 3.0; 1.4 1.4 … 5.4 5.1; 0.2 0.2 … 2.3 1.8], InlineStrings.String15["Iris-setosa" "Iris-setosa" … "Iris-virginica" "Iris-virginica"])
source
MLDatasets.MutagenesisType
Mutagenesis(; split=:train, dir=nothing)
Mutagenesis(split; dir=nothing)

The Mutagenesis dataset comprises 188 molecules trialed for mutagenicity on Salmonella typhimurium, available from relational.fit.cvut.cz and CTUAvastLab/datasets.

Set split to :train, :val, :test, or :all, to select the training, validation, test partition respectively or the whole dataset. The indexes field in the result contains the indexes of the partition in the full dataset.

Website: https://relational.fit.cvut.cz/dataset/Mutagenesis License: CC0

julia> using MLDatasets: Mutagenesis

julia> dataset = Mutagenesis(:train)
Mutagenesis dataset:
  split : train
  indexes : 100-element Vector{Int64}
  features : 100-element Vector{Dict{Symbol, Any}}
  targets : 100-element Vector{Int64}

julia> dataset[1].features
Dict{Symbol, Any} with 5 entries:
  :lumo  => -1.246
  :inda  => 0
  :logp  => 4.23
  :ind1  => 1
  :atoms => Dict{Symbol, Any}[Dict(:element=>"c", :bonds=>Dict{Symbol, Any}[Dict(:element=>"c", :bond_type=>7, :charge=>-0.117, :atom_type=>22), Dict(:element=>"h", :bond_type=>1, :charge=>0.142, :atom_type=>3)…

julia> dataset[1].targets
1

julia> dataset = Mutagenesis(:all)
Mutagenesis dataset:
  split : all
  indexes : 188-element Vector{Int64}
  features : 188-element Vector{Dict{Symbol, Any}}
  targets : 188-element Vector{Int64}
source
MLDatasets.TitanicType
Titanic(; as_df = true, dir = nothing)

The Titanic dataset, describing the survival of passengers on the Titanic ship.

Arguments

  • If as_df = true, load the data as dataframes instead of plain arrays.

  • You can pass a specific dir where to load or download the dataset, otherwise uses the default one.

Fields

  • metadata: A dictionary containing additional information on the dataset.
  • features: The data features. An array if as_df=false, otherwise a dataframe.
  • targets: The targets for supervised learning. An array if as_df=false, otherwise a dataframe.
  • dataframe: A dataframe containing both features and targets. It is nothing if as_df=false, otherwise a dataframed.

Methods

  • dataset[i]: Return observation(s) i as a named tuple of features and targets.
  • dataset[:]: Return all observations as a named tuple of features and targets.
  • length(dataset): Number of observations.

Examples

julia> using MLDatasets: Titanic

julia> using DataFrames

julia> dataset = Titanic()
Titanic:
  metadata => Dict{String, Any} with 5 entries
  features => 891×11 DataFrame
  targets => 891×1 DataFrame
  dataframe => 891×12 DataFrame


julia> describe(dataset.dataframe)
12×7 DataFrame
 Row │ variable     mean      min                  median   max                          nmissing  eltype                   
     │ Symbol       Union…    Any                  Union…   Any                          Int64     Type                     
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ PassengerId  446.0     1                    446.0    891                                 0  Int64
   2 │ Survived     0.383838  0                    0.0      1                                   0  Int64
   3 │ Pclass       2.30864   1                    3.0      3                                   0  Int64
   4 │ Name                   Abbing, Mr. Anthony           van Melkebeke, Mr. Philemon         0  String
   5 │ Sex                    female                        male                                0  String7
   6 │ Age          29.6991   0.42                 28.0     80.0                              177  Union{Missing, Float64}
   7 │ SibSp        0.523008  0                    0.0      8                                   0  Int64
   8 │ Parch        0.381594  0                    0.0      6                                   0  Int64
   9 │ Ticket                 110152                        WE/P 5735                           0  String31
  10 │ Fare         32.2042   0.0                  14.4542  512.329                             0  Float64
  11 │ Cabin                  A10                           T                                 687  Union{Missing, String15}
  12 │ Embarked               C                             S                                   2  Union{Missing, String1}
source
MLDatasets.WineType
Wine(; as_df = true, dir = nothing)

The UCI Wine dataset.

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

Data source is the UCI Machine Learning Repository where further details can be retrieved.

Arguments

  • If as_df = true, load the data as dataframes instead of plain arrays.

  • You can pass a specific dir where to load or download the dataset, otherwise uses the default one.

Fields

  • metadata: A dictionary containing additional information on the dataset.
  • features: The data features. An array if as_df=false, otherwise a dataframe.
  • targets: The targets for supervised learning. An array if as_df=false, otherwise a dataframe.
  • dataframe: A dataframe containing both features and targets. It is nothing if as_df=false, otherwise a dataframed.

Methods

  • dataset[i]: Return observation(s) i as a named tuple of features and targets.
  • dataset[:]: Return all observations as a named tuple of features and targets.
  • length(dataset): Number of observations.

Examples

julia> using MLDatasets: Wine

julia> using DataFrames

julia> dataset = Wine()
dataset Wine:
  metadata   =>    Dict{String, Any} with 5 entries
  features   =>    178×13 DataFrame
  targets    =>    178×1 DataFrame
  dataframe  =>    178×14 DataFrame


julia> describe(dataset.dataframe)
14×7 DataFrame
 Row │ variable              mean        min     median   max      nmissing  eltype   
     │ Symbol                Float64     Real    Float64  Real     Int64     DataType 
─────┼────────────────────────────────────────────────────────────────────────────────
   1 │ Wine                    1.9382      1       2.0       3            0  Int64
   2 │ Alcohol                13.0006     11.03   13.05     14.83         0  Float64
   3 │ Malic.acid              2.33635     0.74    1.865     5.8          0  Float64
   4 │ Ash                     2.36652     1.36    2.36      3.23         0  Float64
   5 │ Acl                    19.4949     10.6    19.5      30.0          0  Float64
   6 │ Mg                     99.7416     70      98.0     162            0  Int64
   7 │ Phenols                 2.29511     0.98    2.355     3.88         0  Float64
   8 │ Flavanoids              2.02927     0.34    2.135     5.08         0  Float64
   9 │ Nonflavanoid.phenols    0.361854    0.13    0.34      0.66         0  Float64
  10 │ Proanth                 1.5909      0.41    1.555     3.58         0  Float64
  11 │ Color.int               5.05809     1.28    4.69     13.0          0  Float64
  12 │ Hue                     0.957449    0.48    0.965     1.71         0  Float64
  13 │ OD                      2.61169     1.27    2.78      4.0          0  Float64
  14 │ Proline               746.893     278     673.5    1680            0  Int64
source