Working with Losses

Even though they are called loss "functions", this package implements them as immutable types instead of true Julia functions. There are good reasons for that. For example it allows us to specify the properties of loss functions explicitly (e.g. isconvex(myloss)). It also makes for a more consistent API when it comes to computing the value or the derivative. Some loss functions even have additional parameters that need to be specified, such as the $\epsilon$ in the case of the $\epsilon$-insensitive loss. Here, types allow for member variables to hide that information away from the method signatures.

In order to avoid potential confusions with true Julia functions, we will refer to "loss functions" as "losses" instead. The available losses share a common interface for the most part. This section will provide an overview of the basic functionality that is available for all the different types of losses. We will discuss how to create a loss, how to compute its value and derivative, and how to query its properties.

Instantiating a Loss

Losses are immutable types. As such, one has to instantiate one in order to work with it. For most losses, the constructors do not expect any parameters.

julia> L2DistLoss()
L2DistLoss()

julia> HingeLoss()
L1HingeLoss()

We just said that we need to instantiate a loss in order to work with it. One could be inclined to belief, that it would be more memory-efficient to "pre-allocate" a loss when using it in more than one place.

julia> loss = L2DistLoss()
L2DistLoss()

julia> loss(3, 2)
1

However, that is a common oversimplification. Because all losses are immutable types, they can live on the stack and thus do not come with a heap-allocation overhead.

Even more interesting in the example above, is that for such losses as L2DistLoss, which do not have any constructor parameters or member variables, there is no additional code executed at all. Such singletons are only used for dispatch and don't even produce any additional code, which you can observe for yourself in the code below. As such they are zero-cost abstractions.

julia> v1(loss,y,t) = loss(y,t)

julia> v2(y,t) = L2DistLoss()(y,t)

julia> @code_llvm v1(loss, 3, 2)
define i64 @julia_v1_70944(i64, i64) #0 {
top:
  %2 = sub i64 %1, %0
  %3 = mul i64 %2, %2
  ret i64 %3
}

julia> @code_llvm v2(3, 2)
define i64 @julia_v2_70949(i64, i64) #0 {
top:
  %2 = sub i64 %1, %0
  %3 = mul i64 %2, %2
  ret i64 %3
}

On the other hand, some types of losses are actually more comparable to whole families of losses instead of just a single one. For example, the immutable type L1EpsilonInsLoss has a free parameter $\epsilon$. Each concrete $\epsilon$ results in a different concrete loss of the same family of epsilon-insensitive losses.

julia> L1EpsilonInsLoss(0.5)
L1EpsilonInsLoss{Float64}(0.5)

julia> L1EpsilonInsLoss(1)
L1EpsilonInsLoss{Float64}(1.0)

For such losses that do have parameters, it can make a slight difference to pre-instantiate a loss. While they will live on the stack, the constructor usually performs some assertions and conversion for the given parameter. This can come at a slight overhead. At the very least it will not produce the same exact code when pre-instantiated. Still, the fact that they are immutable makes them very efficient abstractions with little to no performance overhead, and zero memory allocations on the heap.

Computing the Values

The first thing we may want to do is compute the loss for some observation (singular). In fact, all losses are implemented on single observations under the hood, and are functors.

julia> loss = L1DistLoss()
L1DistLoss()

julia> loss.([2,5,-2], [1,2,3])
3-element Vector{Int64}:
 1
 3
 5

Computing the 1st Derivatives

Maybe the more interesting aspect of loss functions are their derivatives. In fact, most of the popular learning algorithm in Supervised Learning, such as gradient descent, utilize the derivatives of the loss in one way or the other during the training process.

To compute the derivative of some loss we expose the function deriv. It may be interesting to note explicitly, that we always compute the derivative in respect to the predicted output, since we are interested in deducing in which direction the output should change.

LossFunctions.Traits.derivFunction
deriv(loss, output, target) -> Number

Compute the analytical derivative with respect to the output for the loss function. Note that target and output can be of different numeric type, in which case promotion is performed in the manner appropriate for the given loss.

source

Computing the 2nd Derivatives

Additionally to the first derivative, we also provide the corresponding methods for the second derivative through the function deriv2. Note again, that we always compute the derivative in respect to the predicted output.

LossFunctions.Traits.deriv2Function
deriv2(loss, output, target) -> Number

Compute the second derivative with respect to the output for the loss function. Note that target and output can be of different numeric type, in which case promotion is performed in the manner appropriate for the given loss.

source

Aggregating losses over collections

The sum and mean of losses over collections can be computed efficiently with the following methods:

Base.sumMethod
sum(loss, outputs, targets)

Return sum of loss values over the iterables outputs and targets.

source
Base.sumMethod
sum(loss, outputs, targets, weights; normalize=true)

Return sum of loss values over the iterables outputs and targets. The weights determine the importance of each observation. The option normalize divides the result by the sum of the weights.

source
Statistics.meanMethod
mean(loss, outputs, targets)

Return mean of loss values over the iterables outputs and targets.

source
Statistics.meanMethod
mean(loss, outputs, targets, weights; normalize=true)

Return mean of loss values over the iterables outputs and targets. The weights determine the importance of each observation. The option normalize divides the result by the sum of the weights.

source

Properties of a Loss

In some situations it can be quite useful to assert certain properties about a loss-function. One such scenario could be when implementing an algorithm that requires the loss to be strictly convex or Lipschitz continuous. Note that we will only skim over the defintions in most cases. A good treatment of all of the concepts involved can be found in either [BOYD2004] or [STEINWART2008].

This package uses functions to represent individual properties of a loss. It follows a list of implemented property-functions defined in LearnBase.jl.

LossFunctions.Traits.isdistancebasedFunction
isdistancebased(loss) -> Bool

Return true if the given loss is a distance-based loss.

A supervised loss function L : Y × ℝ → [0,∞) is said to be distance-based, if there exists a representing function ψ : ℝ → [0,∞) satisfying ψ(0) = 0 and L(y, ŷ) = ψ (ŷ - y), (y, ŷ) ∈ Y × ℝ.

source
LossFunctions.Traits.ismarginbasedFunction
ismarginbased(loss) -> Bool

Return true if the given loss is a margin-based loss.

A supervised loss function L : Y × ℝ → [0,∞) is said to be margin-based, if there exists a representing function ψ : ℝ → [0,∞) satisfying L(y, ŷ) = ψ(y⋅ŷ), (y, ŷ) ∈ Y × ℝ.

source
LossFunctions.Traits.isdifferentiableFunction
isdifferentiable(loss, [x]) -> Bool

Return true if the given loss is differentiable (optionally limited to the given point x if specified).

A function f : ℝⁿ → ℝᵐ is differentiable at a point x in the interior domain of f if there exists a matrix Df(x) ∈ ℝ^(m × n) such that it satisfies:

lim_{z ≠ x, z → x} (|f(z) - f(x) - Df(x)(z-x)|₂) / |z - x|₂ = 0

A function is differentiable if its domain is open and it is differentiable at every point x.

source
LossFunctions.Traits.istwicedifferentiableFunction
istwicedifferentiable(loss, [x]) -> Bool

Return true if the given loss is differentiable (optionally limited to the given point x if specified).

A function f : ℝⁿ → ℝ is said to be twice differentiable at a point x in the interior domain of f, if the function derivative for ∇f exists at x: ∇²f(x) = D∇f(x).

A function is twice differentiable if its domain is open and it is twice differentiable at every point x.

source
LossFunctions.Traits.isconvexFunction
isconvex(loss) -> Bool

Return true if the given loss denotes a convex function. A function f: ℝⁿ → ℝ is convex if its domain is a convex set and if for all x, y in that domain, with θ such that for 0 ≦ θ ≦ 1, we have f(θ x + (1 - θ) y) ≦ θ f(x) + (1 - θ) f(y).

source
LossFunctions.Traits.isstrictlyconvexFunction
isstrictlyconvex(loss) -> Bool

Return true if the given loss denotes a strictly convex function. A function f : ℝⁿ → ℝ is strictly convex if its domain is a convex set and if for all x, y in that domain where x ≠ y, with θ such that for 0 < θ < 1, we have f(θ x + (1 - θ) y) < θ f(x) + (1 - θ) f(y).

source
LossFunctions.Traits.isstronglyconvexFunction
isstronglyconvex(loss) -> Bool

Return true if the given loss denotes a strongly convex function. A function f : ℝⁿ → ℝ is m-strongly convex if its domain is a convex set, and if for all x, y in that domain where x ≠ y, and θ such that for 0 ≤ θ ≤ 1, we have f(θ x + (1 - θ)y) < θ f(x) + (1 - θ) f(y) - 0.5 m ⋅ θ (1 - θ) | x - y |₂²

In a more familiar setting, if the loss function is differentiable we have (∇f(x) - ∇f(y))ᵀ (x - y) ≥ m | x - y |₂²

source
LossFunctions.Traits.isnemitskiFunction
isnemitski(loss) -> Bool

Return true if the given loss denotes a Nemitski loss function.

We call a supervised loss function L : Y × ℝ → [0,∞) a Nemitski loss if there exist a measurable function b : Y → [0, ∞) and an increasing function h : [0, ∞) → [0, ∞) such that L(y,ŷ) ≤ b(y) + h(|ŷ|), (y, ŷ) ∈ Y × ℝ

If a loss if locally lipsschitz continuous then it is a Nemitski loss.

source
LossFunctions.Traits.isfisherconsFunction
isfishercons(loss) -> Bool

Return true if the givel loss is Fisher consistent.

We call a supervised loss function L : Y × ℝ → [0,∞) a Fisher consistent loss if the population minimizer of the risk E[L(y,f(x))] for all measurable functions leads to the Bayes optimal decision rule.

source
LossFunctions.Traits.islipschitzcontFunction
islipschitzcont(loss) -> Bool

Return true if the given loss function is Lipschitz continuous.

A supervised loss function L : Y × ℝ → [0, ∞) is Lipschitz continous, if there exists a finite constant M < ∞ such that |L(y, t) - L(y, t′)| ≤ M |t - t′|, ∀ (y, t) ∈ Y × ℝ

source
LossFunctions.Traits.islocallylipschitzcontFunction
islocallylipschitzcont(loss) -> Bool

Return true if the given loss function is locally-Lipschitz continous.

A supervised loss L : Y × ℝ → [0, ∞) is called locally Lipschitz continuous if for all a ≥ 0 there exists a constant cₐ ≥ 0, such that

sup_{y ∈ Y} | L(y,t) − L(y,t′) | ≤ cₐ |t − t′|, t, t′ ∈ [−a,a]

Every convex function is locally lipschitz continuous.

source
LossFunctions.Traits.isclipableFunction
isclipable(loss) -> Bool

Return true if the given loss function is clipable. A supervised loss L : Y × ℝ → [0,∞) can be clipped at M > 0 if, for all (y,t) ∈ Y × ℝ, L(y, t̂) ≤ L(y, t) where denotes the clipped value of t at ± M. That is t̂ = -M if t < -M, t̂ = t if t ∈ [-M, M], and t = M if t > M.

source
LossFunctions.Traits.issymmetricFunction
issymmetric(loss) -> Bool

Return true if the given loss is a symmetric loss.

A function f : ℝ → [0,∞) is said to be symmetric about origin if we have f(x) = f(-x), ∀ x ∈ ℝ.

A distance-based loss is said to be symmetric if its representing function is symmetric.

source