Developer guide
A short guide for extending the interface as a developer.
Motivation
TableTransforms.jl currently supports over 25 different transforms that cover a wide variety of use cases ranging from ordinary table operations to complex statistical transformations, which can be arbitrarily composed with one another through elegant syntax. It is easy to leverage all this functionality as a developer of new transforms, and this is the motivation of this guide.
Basic assumptions
All the transforms in this package implement the transforms interface defined in the TransformsBase.jl package so this is really the only dependency needed. The interface assumes the following about new transforms:
- Transforms operate on a single table
- Transforms may be associated with some state that if computed while applying it for the first time and is then cached can help later reapply the transform on another table without recomputing the state
- The transform may be revertible, meaning that a transformed table can be brought back to its original form, and it may need to use the cache for that
- Your transform may be invertible in the mathematical sense
Reversibility assumes that the transform has been applied already and can be undone. On the other hand, invertibility implies that there is a one-to-one mapping between the input and output tables so a table can be inverted to a corresponding input even if it was not transformed a priori.
Defining a new transform
In the following we shall demonstrate the steps to define a new transform.
1. Declare a new type for your transform
The type should subtype TransformsBase.Transform
and it should have a named field for each parameter needed to apply the transform besides the input table. For instance, if you want to call your transform Standardize
and it takes two boolean inputs center
and scale
, then you should declare:
struct Standardize <: TransformsBase.Transform
center::Bool
scale::Bool
end
You may implement keyword constructors as needed if some of the parameters are optional:
Standardize(; center::Bool=true, scale::Bool=true) = Standardize(center, scale)
2. Implement the apply
method for your transform
The apply
method takes an instance of your transform type and a table and returns a new table and cache. Suppose that the Standardize
transform should zero-mean each column if center
is true and scale each column to unit variance if scale
is true, then the apply
method should be implemented as follows:
using Statistics
function TransformsBase.apply(transform::Standardize, X)
# convert the table to a matrix and get col names
Xm = Tables.matrix(X)
names = Tables.columnnames(X)
# compute the means and stds
μ = transform.center ? mean(Xm, dims=1) : zeros(1, size(Xm, 2))
σ = transform.scale ? std(Xm, dims=1) : ones(1, size(Xm, 2))
# standardize the data
Xm = (Xm .- μ) ./ σ
# convert matrix to column table
Xc = (; zip(names, eachcol(Xm))...)
# convert back to original table type
X = Xc |> Tables.materializer(X)
# return the table and cache that may help reapply or revert later
return X, (μ, σ)
end
That's it really! Your transform now behaves like any table transform:
using TableTransforms
X = (A=[1, 2, 3], B=[4, 5, 6])
Xt = X |> Standardize() |> Identity() |> Select([:A])
It holds, however, that in case your transform can be reapplied, is revertible, or is invertible then you should continue implementing the interface to support such functionality.
3. Optionally implement reapply
We need this in case of the Standarize
transform because after computing the mean and std for some training table we may want to apply the transform directly given a test table. Hence, we implement reapply
which has the same signature as apply but it takes an extra argument for the cache and doesn't return it.
function TransformsBase.reapply(transform::Standardize, X, cache)
# convert the table to a matrix and get col names
Xm = Tables.matrix(X)
names = Tables.columnnames(X)
# no need to recompute means and stds
μ, σ = cache
# standardize the data
Xm = (Xm .- μ) ./ σ
# convert matrix to column table
Xc = (; zip(names, eachcol(Xm))...)
# convert back to original table type
X = Xc |> Tables.materializer(X)
return X
end
If not implemented, reapply
simply falls back to apply
.
4. Optionally specify that your transform is revertible and implement revert
We can specify reversibility for an arbitrary transform T
by setting isrevertible(::Type{T})
to true
. It's obvious that this should be supported by our transform so we do
TransformsBase.isrevertible(::Type{Standardize}) = true
By default this falls back to false
so users of the interface would be aware that revert is not implemented in that case. Now we follow up by implementing the revert
method:
function TransformsBase.revert(transform::Standardize, X, cache)
# convert the table to a matrix and get col names
Xm = Tables.matrix(X)
names = Tables.columnnames(X)
# extract the mean and std
μ, σ = cache
# revert the transform
Xm = Xm .* σ .+ μ
# convert matrix to column table
Xc = (; zip(names, eachcol(Xm))...)
# convert back to original table type
X = Xc |> Tables.materializer(X)
return X
end
5. Optionally specify that your transform is invertible and implement Base.inv
Similar to reversibility, falls back to false
by default. We can write that explicitly here since Standardize
has no inverse if we are given nothing except for the table.
TransformsBase.isinvertible(::Type{Standardize}) = false
If an arbitrary transform T
is invertible we can rather specify that as true
and follow up by implementing Base.inv(::T)
which would be expected to return an instance of the inverse transform. For instance, for an identity transform we can do
# interface struct
struct Identity <: Transform end
# specify that it is invertible
TransformsBase.isinvertible(::Type{Identity}) = true
# implement Base.inv
Base.inv(::Identity) = Identity()
which implies that inv(Identity())
would return an identity transform.