This package represents a community effort to provide a common interface for accessing common Machine Learning (ML) datasets. In contrast to other data-related Julia packages, the focus of
MLDatasets.jl is specifically on downloading, unpacking, and accessing benchmark dataset. Functionality for the purpose of data processing or visualization is only provided to a degree that is special to some dataset.
MLDatasets.jl, start up Julia and type the following code snippet into the REPL. It makes use of the native Julia package manger.
Additionally, for example if you encounter any sudden issues, or in the case you would like to contribute to the package, you can manually choose to be on the latest (untagged) version.
MLDatasets.jl is organized is that each dataset has its own dedicated sub-module. Where possible, those sub-module share a common interface for interacting with the datasets. For example you can load the training set and the test set of the MNIST database of handwritten digits using the following commands:
using MLDatasets train_x, train_y = MNIST.traindata() test_x, test_y = MNIST.testdata()
To load the data the package looks for the necessary files in various locations (see
DataDeps.jl for more information on how to configure such defaults). If the data can't be found in any of those locations, then the package will trigger a download dialog to
~/.julia/datadeps/MNIST. To overwrite this on a case by case basis, it is possible to specify a data directory directly in
traindata(dir = <directory>) and
testdata(dir = <directory>).
Each dataset has its own dedicated sub-module. As such, it makes sense to document their functionality similarly distributed. Find below a list of available datasets and their documentation.
This package provides a variety of common benchmark datasets for the purpose of image classification.
|CIFAR-100||100 (20)||32x32x3x50000||50000 (x2)||32x32x3x10000||10000 (x2)|
(*) Note that the SVHN-2 dataset provides an additional 531131 observations aside from the training- and testset
PTBLM dataset consists of Penn Treebank sentences for language modeling, available from tomsercu/lstm. The unknown words are replaced with
<unk> so that the total vocabulary size becomes 10000.
This is the first sentence of the PTBLM dataset.
x, y = PTBLM.traindata() x > ["no", "it", "was", "n't", "black", "monday"] y > ["it", "was", "n't", "black", "monday", "<eos>"]
MLDataset adds the special word:
<eos> to the end of
The UD_English Universal Dependencies English Web Treebank dataset is an annotated corpus of morphological features, POS-tags and syntactic trees. The dataset follows CoNLL-style format.
traindata = UD_English.traindata() devdata = UD_English.devdata() testdata = UD_English.devdata()
| | Train x | Train y | Test x | Test y | |:–:|:–––-:|:–––-:|:–––:|:–––:| | PTBLM | 42068 | 42068 | 3761 | 3761 | | UD_English | 12543 | - | 2077 | - |