Variational AutoEncoders
This is going to be long post, I reckon. Cause, I am entering VAE again. Maybe it would refresh my mind.
AutoEncoder:
I already know what autoencoder is, so if you do not know about it, I am sorry then. VAE is kind of updated form of autoencoder.
Importance of Autoencoder is compression, image denoising e.t.c.
VAE:
Encoder, decoder like similar structure, but they differ. How? The encoder and decoder are now probabilistic and not deterministic like autoencoders.
KL-divergence:
Bored of same Mean Squared Error, Categorical Cross Entropy Loss error? Presenting to you, KL DIVERGENCE. This measures the difference between probability distribution of two given distributions. Sounds quiet frightening, right?
Imagine we want to find the difference between normal distribution and uniform distribution. Can we do that? YES. KL-divergence does that.
KL-Divergence is not symmetric.
If we have two normal distribution (multivariate) with mean u1 and u2 and covariance sigma1 and sigma2, what would be its KL?
This formula is derived from long process. If you want to go through, you can. I won’t. I have done it on my notebook long back. But, I am not going again.
Working Details of Variational AutoEncoder:
The goal of VAE is to find latent variable Q(z|x) which generates P(x’|z).
The difference between latent variable here in VAE vs in autoencoder is that, VAE latent variable represent values that are from distribution.
It has two channels. First one is encoder which learns the parameters that helps us to have the latent vector z. See, we have x, we need z, and that we can get from Q(z|x). This is probabilistic okay. Probabilistic means we get different output for same input. After learning z, we would generate P(x’|z) from decoder model which is not probabilistic rather it is deterministic.
Loss Function of VAE:
Since VAE is generative model. You know what generative model needs right?
P(z|x) = P(z|x)*p(z)/P(x).
P(x) for higher dimension cannot be determined. Just picture that x is a 100 dimensional vector.
How can we determine P(x)?
But, when z is of higher dimension, the integration goes to as many dimension, which is difficult to calculate.
So we come to conclusion that P(z|x) cannot be found this way. So, what we do is try to make that distribution look like something else. Or make that distribution approximated by some other distribution (known).
So, what do we approximate? Well, not a tough question for us. How would you make y^(y_predicted) value close to y_actual value in Machine Learning? You take the loss between two. You know y_actual, so you try to make y_predicted near it. In other words, minimize the error between them. But this is distribution we are talking about, we don’t have loss for distribution. Guess we have. KL DIVERGENCE.
So which probability distribution should we use to approximate this encoder distribution of ours? Binomial? Nah. It has just one parameter. Our seems to have multiple parameters. How about Gaussian? Simple. Bell Curve with mean and variance. Looks easy right?
It is what we take.
So we reduce the KL divergence between Gaussian distribution (multivariate) and that encoder network distribution for inference. Term alert. This way of making some distribution approximated by known distribution is called Inference.
If we take the KL divergence between P(z|x) (which is intractable) and Q(z|x) which is Gaussian distribution with mean and std_deviation. So some complex derivation of the formula brings the equation to :
Don’t get overwhelmed by the derivation or the equation other. It is how it is.
So it contains two loss:
- The one on the left with expectation, it is actually the log-likelihood or log(P(x|z) ). What is P(X|Z). We know that z is formed from Gaussian distribution right? So if z is formed from Gaussian distribution, then P(x|z) is the likelihood of gaussian distribution, infact log likelihood. You know what is log-likelihood of Gaussian distribution? The mean squared error.
So we can replace the term on the left with mean squared error. - The term on the right is KL (Q(z|x)||P(z) ). Q(Z|x) is Gaussian distributed with mean and variance that our network learns, this is what we know right? What about P(z). Its the prior. Since, it is the prior, you must keep in mind that prior assumptions are made. So P(z) is assumed to be standard Gaussian Distribution (mean = 0; std_dev = 1).
So, KL divergence between two Gaussian distribution is that long formula which I told you earlier. If we use that formula there, we get reduced form of second loss function as:
So we got our two loss functions. The first one is MSE between input and output. And second one is this formula. Easy right? No. I know it isn’t.
The question remains why we choose standard normal distribution for P(z)? Because we do not want Q(Z|X) to divert from the distribution to some other distribution which we are unknown of. So we can say it acts as regulizer.
Reparametarization Trick:
Since, our latent variable z is random. How is it random? Because it comes from mean and standard deviation. But we cannot backpropagate from random node. So to solve this, we use Reparametarization trick which is simple as it says. We change the parameter.
So our encoder is not just a random node, but contains two values layers, one of which is mean(u) and one is std_deviation(sigma). So, we change the latent variable to have u+sigma*e. Where e is the standard normal distribution which is random node. So our random node which was z at first, is not a random node now, but a sampled node.
This is how VAE works.