TableTransforms.jl

Transforms and pipelines with tabular data.

Overview

This package provides transforms that are commonly used in statistics and machine learning. It was developed to address specific needs in feature engineering and works with general Tables.jl tables.

Past attempts to model transforms in Julia such as FeatureTransforms.jl served as inspiration for this package. We are happy to absorb any missing transform, and contributions are very welcome.

Features

  • Transforms are revertible meaning that one can apply a transform and undo the transformation without having to do all the manual work keeping constants around.

  • Pipelines can be easily constructed with clean syntax (f1 → f2 → f3) ⊔ (f4 → f5), and they are automatically revertible when the individual transforms are revertible.

  • Branches of a pipeline and colwise transforms are run in parallel using multiple threads with the awesome Transducers.jl framework.

  • Pipelines can be reapplied to unseen "test" data using the same cache (e.g. constants) fitted with "training" data. For example, a ZScore relies on "fitting" μ and σ once at training time.

Rationale

A common task in statistics and machine learning consists of transforming the variables of a problem to achieve better convergence or to apply methods that rely on multivariate Gaussian distributions. This process can be quite tedious to implement by hand and very error-prone. We provide a consistent and clean API to combine statistical transforms into pipelines.

Although most transforms discussed here come from the statistical domain, our long term vision is more ambitious. We aim to provide a complete user experience with fully-featured pipelines that include standardization of column names, imputation of missing data, and more.

Usage

Consider the following table and its pairplot:

using TableTransforms
using CairoMakie, PairPlots

# example table from PairPlots.jl
N = 100_000
a = [2randn(N÷2) .+ 6; randn(N÷2)]
b = [3randn(N÷2); 2randn(N÷2)]
c = randn(N)
d = c .+ 0.6randn(N)
table = (; a, b, c, d)

# pairplot of original table
table |> pairplot

We can convert the columns to PCA scores:

# convert to PCA scores
table |> PCA() |> pairplot