API Reference
Core API
These functions are defined in MLCore.jl.
MLCore.getobs
— Functiongetobs(data, [idx])
Return the observations corresponding to the observation index idx
.
The index idx
is an integer with values in the range 1:numobs(data)
. Types can optionally support idx
being an array of integers.
If data
does not have getobs
defined, then in the case of Tables.table(data) == true
returns the row(s) in position idx
, otherwise returns data[idx]
.
Authors of custom data containers should implement Base.getindex
for their type instead of getobs
. getobs
should only be implemented for types where there is a difference between getobs
and Base.getindex
(such as multi-dimensional arrays).
The returned observation(s) should be in the form intended to be passed as-is to some learning algorithm. There is no strict interface requirement on how this "actual data" must look like. Every author behind some custom data container can make this decision themselves. The output should be consistent when idx
is a scalar vs vector.
getobs
supports by default nested combinations of array, tuple, named tuples, and dictionaries.
The return from getobs
should always be a materialized object, not a view, altough it can be a reference to the original data.
If the argument idx
is not provided, getobs(data)
should return a materialized version of the data.
Examples
julia> x = (a = [1, 2, 3], b = rand(6, 3));
julia> getobs(x, 2) == (a = 2, b = x.b[:, 2])
true
julia> getobs(x, [1, 3]) == (a = [1, 3], b = x.b[:, [1, 3]])
true
julia> x = Dict(:a => [1, 2, 3], :b => rand(6, 3));
julia> getobs(x, 2) == Dict(:a => 2, :b => x[:b][:, 2])
true
julia> getobs(x, [1, 3]) == Dict(:a => [1, 3], :b => x[:b][:, [1, 3]])
true
julia> struct DummyDataset end
julia> MLCore.numobs(d::DummyDataset) = 10
julia> MLCore.getobs(d::DummyDataset) = [1:10;]
julia> MLCore.getobs(d::DummyDataset, i::Int) = 0 < i <= numobs(d) ? i : throw(ArgumentError("Index out of bounds"))
MLCore.getobs!
— Functiongetobs!(buffer, data, idx)
Inplace version of getobs(data, idx)
. If this method is defined for the type of data
, then buffer
should be used to store the result, instead of allocating a dedicated object.
Implementing this function is optional. In the case no such method is provided for the type of data
, then buffer
will be ignored and the result of getobs
returned. This could be because the type of data
may not lend itself to the concept of copy!
. Thus, supporting a custom getobs!
is optional and not required.
Custom implementations of getobs!
should be consistent with getobs
in terms of the output format, that is getobs!(buffer, data, idx) == getobs(data, idx)
.
MLCore.numobs
— Functionnumobs(data)
Return the total number of observations contained in data
.
If data
does not have numobs
defined, then in the case of Tables.istable(data) == true
returns the number of rows, otherwise returns length(data)
.
Authors of custom data containers should implement Base.length
for their type instead of numobs
. numobs
should only be implemented for types where there is a difference between numobs
and Base.length
(such as multi-dimensional arrays).
numobs
supports by default nested combinations of arrays, tuples, named tuples, and dictionaries.
See also getobs
.
Examples
julia> x = (a = [1, 2, 3], b = ones(6, 3)); # named tuples
julia> numobs(x)
3
julia> x = Dict(:a => [1, 2, 3], :b => ones(6, 3)); # dictionaries
julia> numobs(x)
3
All internal containers must have the same number of observations:
julia> x = (a = [1, 2, 3, 4], b = ones(6, 3));
julia> numobs(x)
ERROR: DimensionMismatch: All data containers must have the same number of observations.
Stacktrace:
[1] _check_numobs_error()
@ MLCore ~/.julia/dev/MLCore/src/observation.jl:176
[2] _check_numobs
@ ~/.julia/dev/MLCore/src/observation.jl:185 [inlined]
[3] numobs(data::@NamedTuple{a::Vector{Int64}, b::Matrix{Float64}})
@ MLCore ~/.julia/dev/MLCore/src/observation.jl:190
[4] top-level scope
@ REPL[13]:1
Lazy Transforms
MLUtils.filterobs
— Functionfilterobs(f, data)
Return a subset of data container data
including all indices i
for which f(getobs(data, i)) === true
.
data = 1:10
numobs(data) == 10
fdata = filterobs(>(5), data)
numobs(fdata) == 5
MLUtils.groupobs
— Functiongroupobs(f, data)
Split data container data data
into different data containers, grouping observations by f(obs)
.
data = -10:10
datas = groupobs(>(0), data)
length(datas) == 2
MLUtils.joinobs
— Functionjoinobs(datas...)
Concatenate data containers datas
.
data1, data2 = 1:10, 11:20
jdata = joinumobs(data1, data2)
getobs(jdata, 15) == 15
MLUtils.mapobs
— Functionmapobs(f, data; batched=:auto)
Lazily map f
over the observations in a data container data
. Returns a new data container mdata
that can be indexed and has a length. Indexing triggers the transformation f
.
The batched keyword argument controls the behavior of mdata[idx]
and mdata[idxs]
where idx
is an integer and idxs
is a vector of integers:
batched=:auto
(default). Letf
handle the two cases. Callsf(getobs(data, idx))
andf(getobs(data, idxs))
.batched=:never
. The functionf
is always called on a single observation. Callsf(getobs(data, idx))
and[f(getobs(data, idx)) for idx in idxs]
.batched=:always
. The functionf
is always called on a batch of observations. Callsgetobs(f(getobs(data, [idx])), 1)
andf(getobs(data, idxs))
.
Examples
julia> data = (a=[1,2,3], b=[1,2,3]);
julia> mdata = mapobs(data) do x
(c = x.a .+ x.b, d = x.a .- x.b)
end
mapobs(#25, (a = [1, 2, 3], b = [1, 2, 3]); batched=:auto))
julia> mdata[1]
(c = 2, d = 0)
julia> mdata[1:2]
(c = [2, 4], d = [0, 0])
mapobs(fs, data)
Lazily map each function in tuple fs
over the observations in data container data
. Returns a tuple of transformed data containers.
mapobs(namedfs::NamedTuple, data)
Map a NamedTuple
of functions over data
, turning it into a data container of NamedTuple
s. Field syntax can be used to select a column of the resulting data container.
data = 1:10
nameddata = mapobs((x = sqrt, y = log), data)
getobs(nameddata, 10) == (x = sqrt(10), y = log(10))
getobs(nameddata.x, 10) == sqrt(10)
mapobs(f, d::DataLoader)
Return a new dataloader based on d
that applies f
at each iteration.
Examples
julia> X = ones(3, 6);
julia> function f(x)
@show x
return x
end
f (generic function with 1 method)
julia> d = DataLoader(X, batchsize=2, collate=false);
julia> d = mapobs(f, d);
julia> for x in d
@assert size(x) == (2,)
@assert size(x[1]) == (3,)
end
x = [[1.0, 1.0, 1.0], [1.0, 1.0, 1.0]]
x = [[1.0, 1.0, 1.0], [1.0, 1.0, 1.0]]
x = [[1.0, 1.0, 1.0], [1.0, 1.0, 1.0]]
julia> d2 = DataLoader(X, batchsize=2, collate=true);
julia> d2 = mapobs(f, d2);
julia> for x in d2
@assert size(x) == (3, 2)
end
x = [1.0 1.0; 1.0 1.0; 1.0 1.0]
x = [1.0 1.0; 1.0 1.0; 1.0 1.0]
x = [1.0 1.0; 1.0 1.0; 1.0 1.0]
MLUtils.shuffleobs
— Functionshuffleobs([rng], data)
Return a version of the dataset data
that contains all the origin observations in a random reordering.
The values of data
itself are not copied. Instead only the indices are shuffled. This function calls obsview
to accomplish that, which means that the return value is likely of a different type than data
.
Optionally, a random number generator rng
can be passed as the first argument.
For this function to work, the type of data
must implement numobs
and getobs
.
See also obsview
.
Examples
# For Arrays the subset will be of type SubArray
@assert typeof(shuffleobs(rand(4,10))) <: SubArray
# Iterate through all observations in random order
for x in eachobs(shuffleobs(X))
...
end
Batching, Iteration, and Views
MLUtils.batch
— Functionbatch(xs)
Batch the arrays in xs
into a single array with an extra dimension.
If the elements of xs
are tuples, named tuples, or dicts, the output will be of the same type.
See also unbatch
and batch_sequence
.
Examples
julia> batch([[1,2,3],
[4,5,6]])
3×2 Matrix{Int64}:
1 4
2 5
3 6
julia> batch([(a=[1,2], b=[3,4])
(a=[5,6], b=[7,8])])
(a = [1 5; 2 6], b = [3 7; 4 8])
MLUtils.batch_sequence
— Functionbatch_sequence(seqs; pad = 0)
Take a list of N
sequences seqs
, where the i
-th sequence is an array with last dimension Li
, and turn the into a single array with size (..., Lmax, N)
.
The sequences need to have the same size, except for the last dimension.
Short sequences will be padded by pad
.
See also batch
.
Examples
julia> batch_sequence([[1, 2, 3], [10, 20]])
3×2 Matrix{Int64}:
1 10
2 20
3 0
julia> seqs = (ones(2, 3), fill(2.0, (2, 5)))
([1.0 1.0 1.0; 1.0 1.0 1.0], [2.0 2.0 … 2.0 2.0; 2.0 2.0 … 2.0 2.0])
julia> batch_sequence(seqs, pad=-1)
2×5×2 Array{Float64, 3}:
[:, :, 1] =
1.0 1.0 1.0 -1.0 -1.0
1.0 1.0 1.0 -1.0 -1.0
[:, :, 2] =
2.0 2.0 2.0 2.0 2.0
2.0 2.0 2.0 2.0 2.0
MLUtils.batchsize
— Functionbatchsize(data::BatchView) -> Int
Return the fixed size of each batch in data
.
Examples
using MLUtils
X, Y = MLUtils.load_iris()
A = BatchView(X, batchsize=30)
@assert batchsize(A) == 30
MLUtils.batchseq
— Functionbatchseq(seqs, val = 0)
Take a list of N
sequences, and turn them into a single sequence where each item is a batch of N
. Short sequences will be padded by val
.
Examples
julia> batchseq([[1, 2, 3], [4, 5]], 0)
3-element Vector{Vector{Int64}}:
[1, 4]
[2, 5]
[3, 0]
MLUtils.BatchView
— TypeBatchView(data, batchsize; partial=true, collate=nothing)
BatchView(data; batchsize=1, partial=true, collate=nothing)
Create a view of the given data
that represents it as a vector of batches. Each batch will contain an equal amount of observations in them. The batch-size can be specified using the parameter batchsize
. In the case that the size of the dataset is not dividable by the specified batchsize
, the remaining observations will be ignored if partial=false
. If partial=true
instead the last batch-size can be slightly smaller.
If used as an iterator, the object will iterate over the dataset once, effectively denoting an epoch.
Any data access is delayed until iteration or indexing is perfomed. The getobs
function is called on the data object to retrieve the observations.
For BatchView
to work on some data structure, the type of the given variable data
must implement the data container interface. See ObsView
for more info.
Arguments
data
: The object describing the dataset. Can be of any type as long as it implementsgetobs
andnumobs
(see Details for more information).batchsize
: The batch-size of each batch. It is the number of observations that each batch must contain (except possibly for the last one).partial
: Ifpartial=false
and the number of observations is not divisible by the batch-size, then the last mini-batch is dropped.collate
: Defines the batching behavior.- If
nothing
(default), a batch isgetobs(data, indices)
. - If
false
, each batch is[getobs(data, i) for i in indices]
. - If
true
, applies MLUtils to the vector of observations in a batch, recursively collating arrays in the last dimensions. SeeMLUtils.batch
for more information and examples. - If a custom function, it will be used in place of
MLUtils.batch
. It should take a vector of observations as input.
- If
Se also DataLoader
.
Examples
julia> using MLUtils
julia> X, Y = MLUtils.load_iris();
julia> A = BatchView(X, batchsize=30);
julia> @assert eltype(A) <: Matrix{Float64}
julia> @assert length(A) == 5 # Iris has 150 observations
julia> @assert size(A[1]) == (4,30) # Iris has 4 features
julia> for x in BatchView(X, batchsize=30)
# 5 batches of size 30 observations
@assert size(x) == (4, 30)
@assert numobs(x) === 30
end
julia> for (x, y) in BatchView((X, Y), batchsize=20, partial=true)
# 7 batches of size 20 observations + 1 batch of 10 observations
@assert typeof(x) <: Matrix{Float64}
@assert typeof(y) <: Vector{String}
end
julia> for batch in BatchView((X, Y), batchsize=20, partial=false, collate=false)
# 7 batches of size 20 observations
@assert length(batch) == 20
x1, y1 = batch[1]
end
julia> function collate_fn(batch)
# collate observations into a custom batch
return hcat([x[1] for x in batch]...), join([x[2] for x in batch])
end;
julia> for (x, y) in BatchView((rand(10, 4), ["a", "b", "c", "d"]), batchsize=2, collate=collate_fn)
@assert size(x) == (10, 2)
@assert y isa String
end
MLUtils.eachobs
— Functioneachobs(data; kws...)
Return an iterator over data
.
Supports the same arguments as DataLoader
. The batchsize
default is -1
here while it is 1
for DataLoader
.
Examples
X = rand(4,100)
for x in eachobs(X)
# loop entered 100 times
@assert typeof(x) <: Vector{Float64}
@assert size(x) == (4,)
end
# mini-batch iterations
for x in eachobs(X, batchsize=10)
# loop entered 10 times
@assert typeof(x) <: Matrix{Float64}
@assert size(x) == (4,10)
end
# support for tuples, named tuples, dicts
for (x, y) in eachobs((X, Y))
# ...
end
MLUtils.DataLoader
— TypeDataLoader(data; [batchsize, buffer, collate, parallel, partial, rng, shuffle])
An object that iterates over mini-batches of data
, each mini-batch containing batchsize
observations (except possibly the last one).
Takes as input a single data array, a tuple (or a named tuple) of arrays, or in general any data
object that implements the numobs
and getobs
methods.
The last dimension in each array is the observation dimension, i.e. the one divided into mini-batches.
The original data is preserved in the data
field of the DataLoader.
Arguments
data
: The data to be iterated over. The data type has to be supported bynumobs
andgetobs
.batchsize
: If less than 0, iterates over individual observations. Otherwise, each iteration (except possibly the last) yields a mini-batch containingbatchsize
observations. Default1
.buffer
: Ifbuffer=true
and supported by the type ofdata
, a buffer will be allocated and reused for memory efficiency. May want to setpartial=false
to avoid size mismatch. Finally, can pass an external buffer to be used ingetobs!
(depending on thecollate
andbatchsize
options, could begetobs!(buffer, data, idxs)
orgetobs!(buffer[i], data, idx)
). Defaultfalse
.collate
: Defines the batching behavior. Defaultnothing
.- If
nothing
, a batch isgetobs(data, indices)
. - If
false
, each batch is[getobs(data, i) for i in indices]
. - If
true
, appliesMLUtils.batch
to the vector of observations in a batch, recursively collating arrays in the last dimensions. SeeMLUtils.batch
for more information and examples. - If a custom function, it will be used in place of
MLUtils.batch
. It should take a vector of observations as input.
- If
parallel
: Whether to use load data in parallel using worker threads. Greatly speeds up data loading by factor of available threads. Requires starting Julia with multiple threads. CheckThreads.nthreads()
to see the number of available threads. Passingparallel = true
breaks ordering guarantees. Defaultfalse
.partial
: This argument is used only whenbatchsize > 0
. Ifpartial=false
and the number of observations is not divisible by the batchsize, then the last mini-batch is dropped. Defaulttrue
.rng
: A random number generator. DefaultRandom.default_rng()
.shuffle
: Whether to shuffle the observations before iterating. Unlike wrapping the data container withshuffleobs(data)
,shuffle=true
ensures that the observations are shuffled anew every time you start iterating overeachobs
. Defaultfalse
.
Examples
julia> Xtrain = rand(10, 100);
julia> array_loader = DataLoader(Xtrain, batchsize=2);
julia> for x in array_loader
@assert size(x) == (10, 2)
# do something with x, 50 times
end
julia> array_loader.data === Xtrain
true
julia> tuple_loader = DataLoader((Xtrain,), batchsize=2); # similar, but yielding 1-element tuples
julia> for x in tuple_loader
@assert x isa Tuple{Matrix}
@assert size(x[1]) == (10, 2)
end
julia> Ytrain = rand('a':'z', 100); # now make a DataLoader yielding 2-element named tuples
julia> train_loader = DataLoader((data=Xtrain, label=Ytrain), batchsize=5, shuffle=true);
julia> for epoch in 1:100
for (x, y) in train_loader # access via tuple destructuring
@assert size(x) == (10, 5)
@assert size(y) == (5,)
# loss += f(x, y) # etc, runs 100 * 20 times
end
end
julia> first(train_loader).label isa Vector{Char} # access via property name
true
julia> first(train_loader).label == Ytrain[1:5] # because of shuffle=true
false
julia> foreach(println∘summary, DataLoader(rand(Int8, 10, 64), batchsize=30)) # partial=false would omit last
10×30 Matrix{Int8}
10×30 Matrix{Int8}
10×4 Matrix{Int8}
julia> collate_fn(batch) = join(batch);
julia> first(DataLoader(["a", "b", "c", "d"], batchsize=2, collate=collate_fn))
"ab"
MLUtils.obsview
— Functionobsview(data, [indices])
Return a lazy view of the observations in data
that correspond to the given indices
. No data will be copied.
By default the return is an ObsView
, although this can be overloaded for custom types of data
that want to provide their own lazy view.
In case data
is a tuple or named tuple, the constructor will be mapped over its elements. For array types, return a subarray.
The observation in the returned view ov
can be materialized by calling getobs(ov, i)
on the view, where i
is an index in 1:length(ov)
.
If indices
is not provided, it will be assumed to be 1:numobs(data)
.
Examples
julia> obsview([1 2 3; 4 5 6], 1:2)
2×2 view(::Matrix{Int64}, :, 1:2) with eltype Int64:
1 2
4 5
obsview(data::AbstractArray, [obsdim])
obsview(data::AbstractArray, idxs, [obsdim])
Return a view of the array data
that correspond to the given indices idxs
. If obsdim
of type ObsDim
is provided, the observation dimension of the array is assumed to be along that dimension, otherwise it is assumed to be the last dimension.
If idxs
is not provided, it will be assumed to be 1:numobs(data)
.
Examples
julia> x = rand(4, 5, 2);
julia> v = obsview(x, 2:3, ObsDim(2));
julia> numobs(v)
2
julia> getobs(v, 1) == x[:, 2, :]
true
julia> getobs(v, 1:2) == x[:, 2:3, :]
true
MLUtils.ObsDim
— TypeObsDim(d::Int)
Type to specify the observation dimension of an array.
It can be used in combination with obsview
.
MLUtils.ObsView
— TypeObsView(data, [indices])
Used to represent a subset of some data
of arbitrary type by storing which observation-indices the subset spans. Furthermore, subsequent subsettings are accumulated without needing to access actual data.
The main purpose for the existence of ObsView
is to delay data access and movement until an actual batch of data (or single observation) is needed for some computation. This is particularily useful when the data is not located in memory, but on the hard drive or some remote location. In such a scenario one wants to load the required data only when needed.
Any data access is delayed until getindex
is called, and even getindex
returns the result of obsview
which in general avoids data movement until getobs
is called. If used as an iterator, the view will iterate over the dataset once, effectively denoting an epoch. Each iteration will return a lazy subset to the current observation.
Arguments
data
: The object describing the dataset. Can be of any type as long as it implementsgetobs
andnumobs
(see Details for more information).indices
: Optional. The index or indices of the observation(s) indata
that the subset should represent. Can be of typeInt
or some subtype ofAbstractVector
.
Methods
getindex
: Returns the observation(s) of the given index/indices. No data is copied aside from the required indices.numobs
: Returns the total number observations in the subset.getobs
: Returns the underlying data that theObsView
represents at the given relative indices. Note that these indices are in "subset space", and in general will not directly correspond to the same indices in the underlying data set.
Details
For ObsView
to work on some data structure, the desired type MyType
must implement the following interface:
getobs(data::MyType, idx)
: Should return the observation(s) indexed byidx
. In what form is up to the user. Note thatidx
can be of typeInt
orAbstractVector
.numobs(data::MyType)
: Should return the total number of observations indata
The following methods can also be provided and are optional:
getobs(data::MyType)
: By default this function is the identity function. If that is not the behaviour that you want for your type, you need to provide this method as well.obsview(data::MyType, idx)
: If your custom type has its own kind of subset type, you can return it here. An example for such a case areSubArray
for representing a subset of someAbstractArray
.getobs!(buffer, data::MyType, [idx])
: Inplace version ofgetobs(data, idx)
. If this method is provided forMyType
, theneachobs
can preallocate a buffer that is then reused every iteration. Note:buffer
should be equivalent to the return value ofgetobs(::MyType, ...)
, since this is howbuffer
is preallocated by default.
Examples
X, Y = MLUtils.load_iris()
# The iris set has 150 observations and 4 features
@assert size(X) == (4,150)
# Represents the 80 observations as a ObsView
v = ObsView(X, 21:100)
@assert numobs(v) == 80
@assert typeof(v) <: ObsView
# getobs indexes into v
@assert getobs(v, 1:10) == X[:, 21:30]
# Use `obsview` to avoid boxing into ObsView
# for types that provide a custom "subset", such as arrays.
# Here it instead creates a native SubArray.
v = obsview(X, 1:100)
@assert numobs(v) == 100
@assert typeof(v) <: SubArray
# Also works for tuples of arbitrary length
subset = obsview((X, Y), 1:100)
@assert numobs(subset) == 100
@assert typeof(subset) <: Tuple # tuple of SubArray
# Use as iterator
for x in ObsView(X)
@assert typeof(x) <: SubArray{Float64,1}
end
# iterate over each individual labeled observation
for (x, y) in ObsView((X, Y))
@assert typeof(x) <: SubArray{Float64,1}
@assert typeof(y) <: String
end
# same but in random order
for (x, y) in ObsView(shuffleobs((X, Y)))
@assert typeof(x) <: SubArray{Float64,1}
@assert typeof(y) <: String
end
# Indexing: take first 10 observations
x, y = ObsView((X, Y))[1:10]
See also
MLUtils.randobs
— Functionrandobs(data, [n])
Pick a random observation or a batch of n
random observations from data
. For this function to work, the type of data
must implement numobs
and getobs
.
MLUtils.slidingwindow
— Functionslidingwindow(data; size, stride=1, obsdim=nothing) -> SlidingWindow
Return a vector-like view of the data
for which each element is a fixed size "window" of size
adjacent observations.
stride
specifies the distance between the start elements of each adjacent window. The default value is 1. Note that only complete windows are included in the output, which implies that it is possible for excess observations to be omitted from the view.
obsdim
specifies the dimension along which the observations are indexed for the data types that support it (e.g. arrays). By default, the observations are indexed along the last dimension of the data. If obsdim
is specified it will be passed to obsview
to get a view of the data along that dimension.
Note that the windows are not materialized at construction time. To actually get a copy of the data at some window use indexing or getobs
.
When indexing the data is accessed as getobs(data, idxs)
, with idxs
an appropriate range of indexes.
Examples
julia> s = slidingwindow(11:30, size=6)
slidingwindow(20-element UnitRange{Int64}, size=6, stride=1)
julia> s[1] # == getobs(data, 1:6)
11:16
julia> s[2] # == getobs(data, 2:7)
12:17
The optional parameter stride
can be used to specify the distance between the start elements of each adjacent window. By default the stride is equal to 1.
julia> s = slidingwindow(11:30, size=6, stride=3)
slidingwindow(20-element UnitRange{Int64}, size=6, stride=3)
julia> for w in s; println(w); end
11:16
14:19
17:22
20:25
23:28
Partitioning
MLUtils.leavepout
— Functionleavepout(n::Integer, [size = 1]) -> Tuple
Compute the train/validation assignments for k ≈ n/size
repartitions of n
observations, and return them in the form of two vectors. The first vector contains the index-vectors for the training subsets, and the second vector the index-vectors for the validation subsets respectively. Each validation subset will have either size
or size+1
observations assigned to it. The following code snippet generates the index-vectors for size = 2
.
julia> train_idx, val_idx = leavepout(10, 2);
Each observation is assigned to the validation subset once (and only once). Thus, a union over all validation index-vectors reproduces the full range 1:n
. Note that there is no random assignment of observations to subsets, which means that adjacent observations are likely to be part of the same validation subset.
julia> train_idx
5-element Array{Array{Int64,1},1}:
[3,4,5,6,7,8,9,10]
[1,2,5,6,7,8,9,10]
[1,2,3,4,7,8,9,10]
[1,2,3,4,5,6,9,10]
[1,2,3,4,5,6,7,8]
julia> val_idx
5-element Array{UnitRange{Int64},1}:
1:2
3:4
5:6
7:8
9:10
leavepout(data, p = 1)
Repartition a data
container using a k-fold strategy, where k
is chosen in such a way, that each validation subset of the resulting folds contains roughly p
observations. Defaults to p = 1
, which is also known as "leave-one-out" partitioning.
The resulting sequence of folds is returned as a lazy iterator. Only data subsets are created. That means no actual data is copied until getobs
is invoked.
for (train, val) in leavepout(X, p=2)
# if numobs(X) is dividable by 2,
# then numobs(val) will be 2 for each iteraton,
# otherwise it may be 3 for the first few iterations.
end
Seekfolds
for a related function.
MLUtils.kfolds
— Functionkfolds(n::Integer, k = 5) -> Tuple
Compute the train/validation assignments for k
repartitions of n
observations, and return them in the form of two vectors. The first vector contains the index-vectors for the training subsets, and the second vector the index-vectors for the validation subsets respectively. A general rule of thumb is to use either k = 5
or k = 10
.
Each observation is assigned to the validation subset once (and only once). Thus, a union over all validation index-vectors reproduces the full range 1:n
. Note that there is no random assignment of observations to subsets, which means that adjacent observations are likely to be part of the same validation subset.
Examples
julia> train_idx, val_idx = kfolds(10, 5);
julia> train_idx
5-element Vector{Vector{Int64}}:
[3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8]
julia> val_idx
5-element Vector{UnitRange{Int64}}:
1:2
3:4
5:6
7:8
9:10
kfolds(data, k = 5)
Repartition a data
container k
times using a k
folds strategy and return the sequence of folds as a lazy iterator. Only data subsets are created, which means that no actual data is copied until getobs
is invoked.
Conceptually, a k-folds repartitioning strategy divides the given data
into k
roughly equal-sized parts. Each part will serve as validation set once, while the remaining parts are used for training. This results in k
different partitions of data
.
In the case that the size of the dataset is not dividable by the specified k
, the remaining observations will be evenly distributed among the parts.
for (x_train, x_val) in kfolds(X, k=10)
# code called 10 times
# numobs(x_val) may differ up to ±1 over iterations
end
Multiple variables are supported (e.g. for labeled data)
for ((x_train, y_train), val) in kfolds((X, Y), k=10)
# ...
end
By default the folds are created using static splits. Use shuffleobs
to randomly assign observations to the folds.
for (x_train, x_val) in kfolds(shuffleobs(X), k=10)
# ...
end
See leavepout
for a related function.
MLUtils.splitobs
— Functionsplitobs(n::Int; at) -> Tuple
Compute the indices for two or more disjoint subsets of the range 1:n
with split sizes determined by at
.
Examples
julia> splitobs(100, at=0.7)
(1:70, 71:100)
julia> splitobs(100, at=(0.1, 0.4))
(1:10, 11:50, 51:100)
splitobs([rng,] data; at, shuffle=false, stratified=nothing) -> Tuple
Partition the data
into two or more subsets.
The argument at
specifies how to split the data:
- When
at
is a number between 0 and 1, this specifies the proportion in the first subset. - When
at
is an integer, it specifies the number of observations in the first subset. - When
at
is a tuple, entries specifies the number or proportion in each subset, except
for the last which will contain the remaning observations. The number of returned subsets is length(at)+1
.
If shuffle=true
, randomly permute the observations before splitting. A random number generator rng
can be optionally passed as the first argument.
If stratified
is not nothing
, it should be an array of labels with the same length as the data. The observations will be split in such a way that the proportion of each label is preserved in each subset.
Supports any datatype implementing numobs
.
It relies on obsview
to create views of the data.
Examples
julia> splitobs(reshape(1:100, 1, :); at=0.7) # simple 70%-30% split, of a matrix
([1 2 … 69 70], [71 72 … 99 100])
julia> data = (x=ones(2,10), n=1:10) # a NamedTuple, consistent last dimension
(x = [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0], n = 1:10)
julia> splitobs(data, at=(0.5, 0.3)) # a 50%-30%-20% split, e.g. train/test/validation
((x = [1.0 1.0 … 1.0 1.0; 1.0 1.0 … 1.0 1.0], n = 1:5), (x = [1.0 1.0 1.0; 1.0 1.0 1.0], n = 6:8), (x = [1.0 1.0; 1.0 1.0], n = 9:10))
julia> train, test = splitobs((reshape(1.0:100.0, 1, :), 101:200), at=0.7, shuffle=true); # split a Tuple
julia> vec(test[1]) .+ 100 == test[2]
true
julia> splitobs(1:10, at=0.5, stratified=[0,0,0,0,1,1,1,1,1,1]) # 2 zeros and 3 ones in each subset
([1, 2, 5, 6, 7], [3, 4, 8, 9, 10])
Array Constructors
MLUtils.falses_like
— Functionfalses_like(x, [dims=size(x)])
Equivalent to fill_like(x, false, Bool, dims)
.
See also [fill_like
] and trues_like
.
MLUtils.fill_like
— Functionfill_like(x, val, [element_type=eltype(x)], [dims=size(x)]))
Create an array with the given element type and size, based upon the given source array x
. All element of the new array will be set to val
. The third and fourth arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.
See also zeros_like
and ones_like
.
Examples
julia> x = rand(Float32, 2)
2-element Vector{Float32}:
0.16087806
0.89916044
julia> fill_like(x, 1.7, (3, 3))
3×3 Matrix{Float32}:
1.7 1.7 1.7
1.7 1.7 1.7
1.7 1.7 1.7
julia> using CUDA
julia> x = CUDA.rand(2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.803167 0.476101
0.303041 0.317581
julia> fill_like(x, 1.7, Float64)
2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
1.7 1.7
1.7 1.7
MLUtils.ones_like
— Functionones_like(x, [element_type=eltype(x)], [dims=size(x)]))
Create an array with the given element type and size, based upon the given source array x
. All element of the new array will be set to 1. The second and third arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.
See also zeros_like
and fill_like
.
Examples
julia> x = rand(Float32, 2)
2-element Vector{Float32}:
0.8621633
0.5158395
julia> ones_like(x, (3, 3))
3×3 Matrix{Float32}:
1.0 1.0 1.0
1.0 1.0 1.0
1.0 1.0 1.0
julia> using CUDA
julia> x = CUDA.rand(2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.82297 0.656143
0.701828 0.391335
julia> ones_like(x, Float64)
2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
1.0 1.0
1.0 1.0
MLUtils.rand_like
— Functionrand_like([rng=default_rng()], x, [element_type=eltype(x)], [dims=size(x)])
Create an array with the given element type and size, based upon the given source array x
. All element of the new array will be set to a random value. The last two arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.
The default random number generator is used, unless a custom one is passed in explicitly as the first argument.
See also Base.rand
and randn_like
.
Examples
julia> x = ones(Float32, 2)
2-element Vector{Float32}:
1.0
1.0
julia> rand_like(x, (3, 3))
3×3 Matrix{Float32}:
0.780032 0.920552 0.53689
0.121451 0.741334 0.5449
0.55348 0.138136 0.556404
julia> using CUDA
julia> CUDA.ones(2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
1.0 1.0
1.0 1.0
julia> rand_like(x, Float64)
2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
0.429274 0.135379
0.718895 0.0098756
MLUtils.randn_like
— Functionrandn_like([rng=default_rng()], x, [element_type=eltype(x)], [dims=size(x)])
Create an array with the given element type and size, based upon the given source array x
. All element of the new array will be set to a random value drawn from a normal distribution. The last two arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.
The default random number generator is used, unless a custom one is passed in explicitly as the first argument.
See also Base.randn
and rand_like
.
Examples
julia> x = ones(Float32, 2)
2-element Vector{Float32}:
1.0
1.0
julia> randn_like(x, (3, 3))
3×3 Matrix{Float32}:
-0.385331 0.956231 0.0745102
1.43756 -0.967328 2.06311
0.0482372 1.78728 -0.902547
julia> using CUDA
julia> CUDA.ones(2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
1.0 1.0
1.0 1.0
julia> randn_like(x, Float64)
2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
-0.578527 0.823445
-1.01338 -0.612053
MLUtils.trues_like
— Functiontrues_like(x, [dims=size(x)])
Equivalent to fill_like(x, true, Bool, dims)
.
See also [fill_like
] and falses_like
.
MLUtils.zeros_like
— Functionzeros_like(x, [element_type=eltype(x)], [dims=size(x)]))
Create an array with the given element type and size, based upon the given source array x
. All element of the new array will be set to 0. The second and third arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.
See also ones_like
and fill_like
.
Examples
julia> x = rand(Float32, 2)
2-element Vector{Float32}:
0.4005432
0.36934233
julia> zeros_like(x, (3, 3))
3×3 Matrix{Float32}:
0.0 0.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0
julia> using CUDA
julia> x = CUDA.rand(2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.0695155 0.667979
0.558468 0.59903
julia> zeros_like(x, Float64)
2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
0.0 0.0
0.0 0.0
Resampling
MLUtils.oversample
— Functionoversample([rng], data, classes; fraction=1, shuffle=true)
oversample([rng], data::Tuple; fraction=1, shuffle=true)
Generate a re-balanced version of data
by repeatedly sampling existing observations in such a way that every class will have at least fraction
times the number observations of the largest class in classes
. This way, all classes will have a minimum number of observations in the resulting data set relative to what largest class has in the given (original) data
.
As an example, by default (i.e. with fraction = 1
) the resulting dataset will be near perfectly balanced. On the other hand, with fraction = 0.5
every class in the resulting data with have at least 50% as many observations as the largest class.
The classes
input is an array with the same length as numobs(data)
.
The convenience parameter shuffle
determines if the resulting data will be shuffled after its creation; if it is not shuffled then all the repeated samples will be together at the end, sorted by class. Defaults to true
.
The random number generator rng
can be optionally passed as the first argument.
The output will contain both the resampled data and classes.
# 6 observations with 3 features each
X = rand(3, 6)
# 2 classes, severely imbalanced
Y = ["a", "b", "b", "b", "b", "a"]
# oversample the class "a" to match "b"
X_bal, Y_bal = oversample(X, Y)
# this results in a bigger dataset with repeated data
@assert size(X_bal) == (3,8)
@assert length(Y_bal) == 8
# now both "a", and "b" have 4 observations each
@assert sum(Y_bal .== "a") == 4
@assert sum(Y_bal .== "b") == 4
For this function to work, the type of data
must implement numobs
and getobs
.
If data
is a tuple and classes
is not given, then it will be assumed that the last element of the tuple contains the classes.
julia> data = DataFrame(X1=rand(6), X2=rand(6), Y=[:a,:b,:b,:b,:b,:a])
6×3 DataFrames.DataFrame
│ Row │ X1 │ X2 │ Y │
├─────┼───────────┼─────────────┼───┤
│ 1 │ 0.226582 │ 0.0443222 │ a │
│ 2 │ 0.504629 │ 0.722906 │ b │
│ 3 │ 0.933372 │ 0.812814 │ b │
│ 4 │ 0.522172 │ 0.245457 │ b │
│ 5 │ 0.505208 │ 0.11202 │ b │
│ 6 │ 0.0997825 │ 0.000341996 │ a │
julia> getobs(oversample(data, data.Y))
8×3 DataFrame
Row │ X1 X2 Y
│ Float64 Float64 Symbol
─────┼─────────────────────────────
1 │ 0.376304 0.100022 a
2 │ 0.467095 0.185437 b
3 │ 0.481957 0.319906 b
4 │ 0.336762 0.390811 b
5 │ 0.376304 0.100022 a
6 │ 0.427064 0.0648339 a
7 │ 0.427064 0.0648339 a
8 │ 0.457043 0.490688 b
See ObsView
for more information on data subsets. See also undersample
.
MLUtils.undersample
— Functionundersample([rng], data, classes; shuffle=true)
undersample([rng], data::Tuple; shuffle=true)
Generate a class-balanced version of data
by subsampling its observations in such a way that the resulting number of observations will be the same number for every class. This way, all classes will have as many observations in the resulting data set as the smallest class has in the given (original) data
.
The convenience parameter shuffle
determines if the resulting data will be shuffled after its creation; if it is not shuffled then all the observations will be in their original order. Defaults to false
.
If data
is a tuple and classes
is not given, then it will be assumed that the last element of the tuple contains the classes.
The output will contain both the resampled data and classes.
# 6 observations with 3 features each
X = rand(3, 6)
# 2 classes, severely imbalanced
Y = ["a", "b", "b", "b", "b", "a"]
# subsample the class "b" to match "a"
X_bal, Y_bal = undersample(X, Y)
# this results in a smaller dataset
@assert size(X_bal) == (3,4)
@assert length(Y_bal) == 4
# now both "a", and "b" have 2 observations each
@assert sum(Y_bal .== "a") == 2
@assert sum(Y_bal .== "b") == 2
For this function to work, the type of data
must implement numobs
and getobs
.
Note that if data
is a tuple, then it will be assumed that the last element of the tuple contains the targets.
julia> data = DataFrame(X1=rand(6), X2=rand(6), Y=[:a,:b,:b,:b,:b,:a])
6×3 DataFrames.DataFrame
│ Row │ X1 │ X2 │ Y │
├─────┼───────────┼─────────────┼───┤
│ 1 │ 0.226582 │ 0.0443222 │ a │
│ 2 │ 0.504629 │ 0.722906 │ b │
│ 3 │ 0.933372 │ 0.812814 │ b │
│ 4 │ 0.522172 │ 0.245457 │ b │
│ 5 │ 0.505208 │ 0.11202 │ b │
│ 6 │ 0.0997825 │ 0.000341996 │ a │
julia> getobs(undersample(data, data.Y))
4×3 DataFrame
Row │ X1 X2 Y
│ Float64 Float64 Symbol
─────┼─────────────────────────────
1 │ 0.427064 0.0648339 a
2 │ 0.376304 0.100022 a
3 │ 0.467095 0.185437 b
4 │ 0.457043 0.490688 b
See ObsView
for more information on data subsets. See also oversample
.
Operations
MLUtils.chunk
— Functionchunk(x, n; [dims])
chunk(x; [size, dims])
Split x
into n
parts or alternatively, if size
is an integer, into equal chunks of size size
. The parts contain the same number of elements except possibly for the last one that can be smaller.
In case size
is a collection of integers instead, the elements of x
are split into chunks of the given sizes.
If x
is an array, dims
can be used to specify along which dimension to split (defaults to the last dimension).
Examples
julia> chunk(1:10, 3)
3-element Vector{UnitRange{Int64}}:
1:4
5:8
9:10
julia> chunk(1:10; size = 2)
5-element Vector{UnitRange{Int64}}:
1:2
3:4
5:6
7:8
9:10
julia> x = reshape(collect(1:20), (5, 4))
5×4 Matrix{Int64}:
1 6 11 16
2 7 12 17
3 8 13 18
4 9 14 19
5 10 15 20
julia> xs = chunk(x, 2, dims=1)
2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}}:
[1 6 11 16; 2 7 12 17; 3 8 13 18]
[4 9 14 19; 5 10 15 20]
julia> xs[1]
3×4 view(::Matrix{Int64}, 1:3, :) with eltype Int64:
1 6 11 16
2 7 12 17
3 8 13 18
julia> xes = chunk(x; size = 2, dims = 2)
2-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}}:
[1 6; 2 7; … ; 4 9; 5 10]
[11 16; 12 17; … ; 14 19; 15 20]
julia> xes[2]
5×2 view(::Matrix{Int64}, :, 3:4) with eltype Int64:
11 16
12 17
13 18
14 19
15 20
julia> chunk(1:6; size = [2, 4])
2-element Vector{UnitRange{Int64}}:
1:2
3:6
chunk(x, partition_idxs; [npartitions, dims])
Partition the array x
along the dimension dims
according to the indexes in partition_idxs
.
partition_idxs
must be sorted and contain only positive integers between 1 and the number of partitions.
If the number of partition npartitions
is not provided, it is inferred from partition_idxs
.
If dims
is not provided, it defaults to the last dimension.
See also unbatch
.
Examples
julia> x = reshape([1:10;], 2, 5)
2×5 Matrix{Int64}:
1 3 5 7 9
2 4 6 8 10
julia> chunk(x, [1, 2, 2, 3, 3])
3-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}}:
[1; 2;;]
[3 5; 4 6]
[7 9; 8 10]
MLUtils.flatten
— Functionflatten(x::AbstractArray)
Reshape arbitrarly-shaped input into a matrix-shaped output, preserving the size of the last dimension.
See also unsqueeze
.
Examples
julia> rand(3,4,5) |> flatten |> size
(12, 5)
MLUtils.group_counts
— Functiongroup_counts(x)
Count the number of times that each element of x
appears.
See also group_indices
Examples
julia> group_counts(['a', 'b', 'b'])
Dict{Char, Int64} with 2 entries:
'a' => 1
'b' => 2
MLUtils.group_indices
— Functiongroup_indices(x) -> Dict
Computes the indices of elements in the vector x
for each distinct value contained. This information is useful for resampling strategies, such as stratified sampling.
See also group_counts
.
Examples
julia> x = [:yes, :no, :maybe, :yes];
julia> group_indices(x)
Dict{Symbol, Vector{Int64}} with 3 entries:
:yes => [1, 4]
:maybe => [3]
:no => [2]
MLUtils.normalise
— Functionnormalise(x; dims=ndims(x), ϵ=1e-5)
Normalise the array x
to mean 0 and standard deviation 1 across the dimension(s) given by dims
. Per default, dims
is the last dimension.
ϵ
is a small additive factor added to the denominator for numerical stability.
MLUtils.rpad_constant
— Functionrpad_constant(v::AbstractArray, n::Union{Integer, Tuple}, val = 0; dims=:)
Return the given sequence padded with val
along the dimensions dims
up to a maximum length in each direction specified by n
.
Examples
julia> rpad_constant([1, 2], 4, -1) # passing with -1 up to size 4
4-element Vector{Int64}:
1
2
-1
-1
julia> rpad_constant([1, 2, 3], 2) # no padding if length is already greater than n
3-element Vector{Int64}:
1
2
3
julia> rpad_constant([1 2; 3 4], 4; dims=1) # padding along the first dimension
4×2 Matrix{Int64}:
1 2
3 4
0 0
0 0
julia> rpad_constant([1 2; 3 4], 4) # padding along all dimensions by default
4×4 Matrix{Int64}:
1 2 0 0
3 4 0 0
0 0 0 0
0 0 0 0
MLUtils.unbatch
— Functionunbatch(x)
Reverse of the batch
operation, unstacking the last dimension of the array x
.
Examples
julia> unbatch([1 3 5 7;
2 4 6 8])
4-element Vector{Vector{Int64}}:
[1, 2]
[3, 4]
[5, 6]
[7, 8]
MLUtils.unsqueeze
— Functionunsqueeze(x; dims)
Return x
reshaped into an array one dimensionality higher than x
, where dims
indicates in which dimension x
is extended. dims
can be an integer between 1 and ndims(x)+1
.
See also flatten
, stack
.
Examples
julia> unsqueeze([1 2; 3 4], dims=2)
2×1×2 Array{Int64, 3}:
[:, :, 1] =
1
3
[:, :, 2] =
2
4
julia> xs = [[1, 2], [3, 4], [5, 6]]
3-element Vector{Vector{Int64}}:
[1, 2]
[3, 4]
[5, 6]
julia> unsqueeze(xs, dims=1)
1×3 Matrix{Vector{Int64}}:
[1, 2] [3, 4] [5, 6]
unsqueeze(; dims)
Returns a function which, acting on an array, inserts a dimension of size 1 at dims
.
Examples
julia> rand(21, 22, 23) |> unsqueeze(dims=2) |> size
(21, 1, 22, 23)
MLUtils.unstack
— Functionunstack(xs; dims)
Unroll the given xs
into an array of arrays along the given dimension dims
.
It is the inverse operation of stack.
Examples
julia> unstack([1 3 5 7; 2 4 6 8], dims=2)
4-element Vector{Vector{Int64}}:
[1, 2]
[3, 4]
[5, 6]
[7, 8]
Datasets
MLUtils.Datasets.load_iris
— Functionload_iris() -> X, y, names
Loads the first 150 observations from the Iris flower data set introduced by Ronald Fisher (1936). The 4 by 150 matrix X
contains the numeric measurements, in which each individual column denotes an observation. The vector y
contains the class labels as strings. The vector names
contains the names of the features (i.e. rows of X
)
[1] Fisher, Ronald A. "The use of multiple measurements in taxonomic problems." Annals of eugenics 7.2 (1936): 179-188.
MLUtils.Datasets.make_sin
— Functionmake_sin(n, start, stop; noise = 0.3, f_rand = randn) -> x, y
Generates n
noisy equally spaces samples of a sinus from start
to stop
by adding noise .* f_rand(length(x))
to the result of sin(x)
.
Returns the vector x
with the samples and the noisy response y
.
MLUtils.Datasets.make_spiral
— Functionmake_spiral(n, a, theta, b; noise = 0.01, f_rand = randn) -> x, y
Generates n
noisy responses for a spiral with two labels. Uses the radius, angle and scaling arguments to space the points in 2D space and adding noise .* f_randn(n)
to the response.
Returns the 2 x n matrix x
with the coordinates of the samples and the vector y
with the labels.
MLUtils.Datasets.make_poly
— Functionmake_poly(coef, x; noise = 0.01, f_rand = randn) -> x, y
Generates a noisy response for a polynomial of degree length(coef)
and with the coefficients given by coef
. The response is generated by elmentwise computation of the polynome on the elements of x
and adding noise .* f_randn(length(x))
to the result.
The vector coef
contains the coefficients for the terms of the polynome. The first element of coef
denotes the coefficient for the term with the highest degree, while the last element of coef
denotes the intercept.
Return the input x
and the noisy response y
.
MLUtils.Datasets.make_moons
— Functionmake_moons(n; noise=0.0, f_rand=randn, shuffle=true) -> x, y
Generate a dataset with two interleaving half circles.
If n
is an integer, the number of samples is n
and the number of samples for each half circle is n ÷ 2
. If n
is a tuple, the first element of the tuple denotes the number of samples for the first half circle and the second element denotes the number of samples for the second half circle.
The noise level can be controlled by the noise
argument.
Set shuffle=false
to keep the order of the samples.
Returns a 2 x n matrix with the the samples.