EvidentialFlux.jl

Evidential Deep Learning is a way to generate predictions and the uncertainty associated with them in one single forward pass. This is in stark contrast to traditional Bayesian neural networks which are typically based on Variational Inference, Markov Chain Monte Carlo, Monte Carlo Dropout or Ensembles.

Deep Evidential Regression

Deep Evidential Regression^[amini2020] is an attempt to apply the principles of Evidential Deep Learning to regression type problems.

It works by putting a prior distribution over the likelihood parameters $\mathbf{\theta} = \{\mu, \sigma^2\}$ governing a likelihood model where we observe a dataset $\mathcal{D}=\{x_i, y_i\}_{i=1}^N$ where $y_i$ is assumed to be drawn i.i.d. from a Gaussian distribution.

\[y_i \sim \mathcal{N}(\mu_i, \sigma^2_i)\]

We can express the posterior parameters $\mathbf{\theta}=\{\mu, \sigma^2\}$ as $p(\mathbf{\theta}|\mathcal{D})$. We seek to create an approximation $q(\mu, \sigma^2) = q(\mu)(\sigma^2)$ meaning that we assume that the posterior factorizes. This means we can write $\mu\sim\mathcal{N}(\gamma,\sigma^2\nu^{-1})$ and $\sigma^2\sim\Gamma^{-1}(\alpha,\beta)$. Thus, we can now form

\[p(\mathbf{\theta}|\mathbf{m})=\mathcal{N}(\gamma,\sigma^2\nu^{-1})\Gamma^{-1}(\alpha,\beta)=\mathcal{N-}\Gamma^{-1}(γ,υ,α,β)\]

which can be plugged in to the posterior below.

\[p(\mathbf{\theta}|\mathbf{m}, y_i) = \frac{p(y_i|\mathbf{\theta}, \mathbf{m})p(\mathbf{\theta}|\mathbf{m})}{p(y_i|\mathbf{m})}\]

Now since the likelihood is Gaussian we would like to put a conjugate prior on the parameters of that likelihood and the Normal Inverse Gamma $\mathcal{N-}\Gamma^{-1}(γ, υ, α, β)$ fits the bill. I'm being a bit handwavy here but this allows us to express the prediction and the associated uncertainty as below.

\[\underset{Prediction}{\underbrace{\mathbb{E}[\mu]=\gamma}}~~~~ \underset{Aleatoric}{\underbrace{\mathbb{E}[\sigma^2]=\frac{\beta}{\alpha-1}}}~~~~ \underset{Epistemic}{\underbrace{\text{Var}[\mu]=\frac{\beta}{\nu(\alpha-1)}}}\]

The NIG layer in EvidentialFlux.jl outputs 4 tensors for each target variable, namely $\gamma,\nu,\alpha,\beta$. This means that in one forward pass we can estimate the prediction, the heteroskedastic aleatoric uncertainty as well as the epistemic uncertainty. Boom!

Theoretical justifications

Although for the problems illustrated by Amini et. al., this approach seems to work well it has been shown in ^[nis2022] that there are theoretical shortcomings regarding the expression of the aleatoric and epistemic uncertainty. They propose a correction of the loss, and the uncertainty calculations. In this package I have implemented both.

Deep Evidential Classification

We follow ^[sensoy2018] in our implementation of Deep Evidential Classification. The neural network layer is implemented to output the $\alpha_k$ representing the parameters of a Dirichlet distribution. These parameters has the additional interpretation $\alpha_k = e_k + 1$ where $e_k$ is the evidence for class $k$. Further, it holds that $e_k > 0$ which is the reason for us modeling them with a softplus activation function.

Ok, so that's all well and good, but what's the point? Well, the point is that since we are now constructing a network layer that outputs evidence for each class we can apply Dempster-Shafer Theory (DST) to those outputs. DST is a generalization of the Bayesian framework of thought and works by assigning belief mass to states of interest. We can further concretize this notion by Subjective Logic (SL) which places a Dirichlet distribution over these belief masses. Belief masses are defined as $b_k=e_k/S$ where $e_k$ is the evidence of state $k$ and $S=\sum_i^K(e_i+1)$. Further, SL requires that $K+1$ states all sum up to 1. This practically means that $u+\sum_k^K~b_k=1$ where $u$ represents the uncertainty of the possible K states, or the "I don't know." class.

Now, since $S=\sum_i^K(e_i+1)=S=\sum_i^K(\alpha_i)$ SL refers to $S$ as the Dirichlet strength which is basically a sum of all the collected evidence in favor of the $K$ outcomes. Consequently the uncertainty $u=K/S$ becomes 1 in case there is no evidence available. Therefor, $u$ is a normalized quantity ranging between 0 and 1.

Functions

EvidentialFlux.DIR — Type

DIR(in => out; bias=true, init=Flux.glorot_uniform)
DIR(W::AbstractMatrix, [bias])

A Linear layer with a softplus activation function in the end to implement the Dirichlet evidential distribution. In this layer the number of output nodes should correspond to the number of classes you wish to model. This layer should be used to model a Multinomial likelihood with a Dirichlet prior. Thus the posterior is also a Dirichlet distribution. Moreover the type II maximum likelihood, i.e., the marginal likelihood is a Dirichlet-Multinomial distribution. Create a fully connected layer which implements the Dirichlet Evidential distribution whose forward pass is simply given by:

y = softplus.(W * x .+ bias)

The input x should be a vector of length in, or batch of vectors represented as an in × N matrix, or any array with size(x,1) == in. The out y will be a vector of length out, or a batch with size(y) == (out, size(x)[2:end]...) The output will have applied the function softplus(y) to each row/element of y. Keyword bias=false will switch off trainable bias for the layer. The initialisation of the weight matrix is W = init(out, in), calling the function given to keyword init, with default glorot_uniform. The weight matrix and/or the bias vector (of length out) may also be provided explicitly.

Arguments:

(in, out): number of input and output neurons
init: The function to use to initialise the weight matrix.
bias: Whether to include a trainable bias vector.

source

EvidentialFlux.NIG — Type

NIG(in => out, σ=NNlib.softplus; bias=true, init=Flux.glorot_uniform)
NIG(W::AbstractMatrix, [bias, σ])

Create a fully connected layer which implements the NormalInverseGamma Evidential distribution whose forward pass is simply given by:

y = W * x .+ bias

The input x should be a vector of length in, or batch of vectors represented as an in × N matrix, or any array with size(x,1) == in. The out y will be a vector of length out*4, or a batch with size(y) == (out*4, size(x)[2:end]...) The output will have applied the function σ(y) to each row/element of y except the first out ones. Keyword bias=false will switch off trainable bias for the layer. The initialisation of the weight matrix is W = init(out*4, in), calling the function given to keyword init, with default glorot_uniform. The weight matrix and/or the bias vector (of length out) may also be provided explicitly. Remember that in this case the number of rows in the weight matrix W MUST be a multiple of 4. The same holds true for the bias vector.

Arguments:

(in, out): number of input and output neurons
σ: The function to use to secure positive only outputs which defaults to the softplus function.
init: The function to use to initialise the weight matrix.
bias: Whether to include a trainable bias vector.

source

Missing docstring.

Missing docstring for predict. Check Documenter's build log for details.

EvidentialFlux.uncertainty — Function

uncertainty(ν, α, β)

Calculates the epistemic uncertainty of the predictions from the Normal Inverse Gamma (NIG) model. Given a $\text{N-}\Gamma^{-1}(γ, υ, α, β)$ distribution we can calculate the epistemic uncertainty as

$Var[μ] = \frac{β}{ν(α-1)}$

Arguments:

ν: the ν parameter of the NIG distribution which relates to it's precision and whose shape should be (O, B)
α: the α parameter of the NIG distribution which relates to it's precision and whose shape should be (O, B)
β: the β parameter of the NIG distribution which relates to it's uncertainty and whose shape should be (O, B)

source

uncertainty(α, β)

Calculates the aleatoric uncertainty of the predictions from the Normal Inverse Gamma (NIG) model. Given a $\text{N-}\Gamma^{-1}(γ, υ, α, β)$ distribution we can calculate the aleatoric uncertainty as

$\mathbb{E}[σ^2] = \frac{β}{(α-1)}$

Arguments:

α: the α parameter of the NIG distribution which relates to it's precision and whose shape should be (O, B)
β: the β parameter of the NIG distribution which relates to it's uncertainty and whose shape should be (O, B)

source

uncertainty(α)

Calculates the epistemic uncertainty associated with a MultinomialDirichlet model (DIR) layer.

α: the α parameter of the Dirichlet distribution which relates to it's concentrations and whose shape should be (O, B)

source

EvidentialFlux.aleatoric — Function

aleatoric(ν, α, β)

This is the aleatoric uncertainty as recommended by Meinert, Nis, Jakob Gawlikowski, and Alexander Lavin. 'The Unreasonable Effectiveness of Deep Evidential Regression.' arXiv, May 20, 2022. http://arxiv.org/abs/2205.10060. This is precisely the $σ_{St}$ from the Student T distribution.

Arguments:

ν: the ν parameter of the NIG distribution which relates to it's precision and whose shape should be (O, B)
α: the α parameter of the NIG distribution which relates to it's precision and whose shape should be (O, B)
β: the β parameter of the NIG distribution which relates to it's uncertainty and whose shape should be (O, B)

source

EvidentialFlux.epistemic — Function

epistemic(ν)

This is the epistemic uncertainty as recommended by Meinert, Nis, Jakob Gawlikowski, and Alexander Lavin. 'The Unreasonable Effectiveness of Deep Evidential Regression.' arXiv, May 20, 2022. http://arxiv.org/abs/2205.10060.

Arguments:

ν: the ν parameter of the NIG distribution which relates to it's precision and whose shape should be (O, B)

source

EvidentialFlux.evidence — Function

evidence(α)

Calculates the total evidence of assigning each observation in α to the respective class for a DIR layer.

α: the α parameter of the Dirichlet distribution which relates to it's concentrations and whose shape should be (O, B)

source

evidence(ν, α)

Returns the evidence for the data pushed through the NIG layer. In this setting one way of looking at the NIG distribution is as ν virtual observations governing the mean μ of the likelihood and α virtual observations governing the variance $\sigma^2$. The evidence is then a sum of the virtual observations. Amini et. al. goes through this interpretation in their 2020 paper.

Arguments:

ν: the ν parameter of the NIG distribution which relates to it's precision and whose shape should be (O, B)
α: the α parameter of the NIG distribution which relates to it's precision and whose shape should be (O, B)

source

EvidentialFlux.nigloss — Function

nigloss(y, γ, ν, α, β, λ = 1, ϵ = 0.0001)

This is the standard loss function for Evidential Inference given a NormalInverseGamma posterior for the parameters of the gaussian likelihood function: μ and σ.

Arguments:

y: the targets whose shape should be (O, B)
γ: the γ parameter of the NIG distribution which corresponds to it's mean and whose shape should be (O, B)
ν: the ν parameter of the NIG distribution which relates to it's precision and whose shape should be (O, B)
α: the α parameter of the NIG distribution which relates to it's precision and whose shape should be (O, B)
β: the β parameter of the NIG distribution which relates to it's uncertainty and whose shape should be (O, B)
λ: the weight to put on the regularizer (default: 1)
ϵ: the threshold for the regularizer (default: 0.0001)

source

EvidentialFlux.nigloss2 — Function

nigloss2(y, γ, ν, α, β, λ = 1, p = 1)

This is the corrected loss function for DER as recommended by Meinert, Nis, Jakob Gawlikowski, and Alexander Lavin. “The Unreasonable Effectiveness of Deep Evidential Regression.” arXiv, May 20, 2022. http://arxiv.org/abs/2205.10060. This is the standard loss function for Evidential Inference given a NormalInverseGamma posterior for the parameters of the gaussian likelihood function: μ and σ.

Arguments:

y: the targets whose shape should be (O, B)
γ: the γ parameter of the NIG distribution which corresponds to it's mean and whose shape should be (O, B)
ν: the ν parameter of the NIG distribution which relates to it's precision and whose shape should be (O, B)
α: the α parameter of the NIG distribution which relates to it's precision and whose shape should be (O, B)
β: the β parameter of the NIG distribution which relates to it's uncertainty and whose shape should be (O, B)
λ: the weight to put on the regularizer (default: 1)
p: the power which to raise the scaled absolute prediction error (default: 1)

source

EvidentialFlux.dirloss — Function

dirloss(y, α, t)

Regularized version of a type II maximum likelihood for the Multinomial(p) distribution where the parameter p, which follows a Dirichlet distribution has been integrated out.

Arguments:

y: the targets whose shape should be (O, B)
α: the parameters of a Dirichlet distribution representing the belief in each class which shape should be (O, B)
t: counter for the current epoch being evaluated

source

References

amini2020Amini, Alexander, Wilko Schwarting, Ava Soleimany, and Daniela Rus. “Deep Evidential Regression.” ArXiv:1910.02600 [Cs, Stat], November 24, 2020. http://arxiv.org/abs/1910.02600.
sensoy2018Sensoy, Murat, Lance Kaplan, and Melih Kandemir. “Evidential Deep Learning to Quantify Classification Uncertainty.” Advances in Neural Information Processing Systems 31 (June 2018): 3179–89.
nis2022Meinert, Nis, Jakob Gawlikowski, and Alexander Lavin. “The Unreasonable Effectiveness of Deep Evidential Regression.” arXiv, May 20, 2022. http://arxiv.org/abs/2205.10060.