Getting Started with TensorFlow Probability in R -

WhatsApp Group Join Now

Telegram Group Join Now

Instagram Group Join Now

With an abundance of great libraries, in R, for statistical computing, why would you be interested in TensorFlow Probability (TFP, for short)? Well – let's look at its ingredients list:

Distributions and bijectors (invertible bijectors, composable maps)
Probabilistic Modeling (Edward 2 and Probabilistic Network Layers)
Probability estimation (by MCMC or variance estimation)

Now imagine all of these working seamlessly with the TensorFlow framework – Core, Keras, supported modules – and that too, distributed and running on a GPU. The field of potential applications is vast – and too varied to cover in its entirety in an introductory blog post.

Instead, our goal here is to provide a first introduction. TFP, focusing on direct applicability and interoperability with deep learning. We'll quickly show how to get started with one of the basic building blocks: distributions. Then, we will create a variable autoencoder similar to the one in learning representations with MMD-VAE. This time though, we'll use TFP To sample from the prior and approximate the posterior distribution.

We'll see this post as a “proof of concept” to use. TFP Plan to follow with more extensive examples from the area of learning Keras – from R – and semi-supervised representations.

to install TFP Together with TensorFlow, simply add tensorflow-probability In the default list of additional packages:

library(tensorflow)
install_tensorflow(
  extra_packages = c("keras", "tensorflow-hub", "tensorflow-probability"),
  version = "1.12"
)

Now to use TFPwe just need to import it and create some useful handles.

And here we go, sampling from the standard normal distribution.

n <- tfd$Normal(loc = 0, scale = 1)
n$sample(6L)

tf.Tensor(
"Normal_1/sample/Reshape:0", shape=(6,), dtype=float32
)

Now that's cool, but it's 2019, we don't want to create more sessions to evaluate these tensors. In the variable autoencoder example below, we're going to see how TFP and TF Passionate execution There are great matches, so why not start using it now.

To use execution, we need to execute the following lines in a fresh (R) session:

… and import. TFPas above.

tfp <- import("tensorflow_probability")
tfd <- tfp$distributions

Now let's see quickly. TFP Distribution

Using distribution

Here again is that standard routine.

n <- tfd$Normal(loc = 0, scale = 1)

Things that are commonly done with distributions include sampling:

# just as in low-level tensorflow, we need to append L to indicate integer arguments
n$sample(6L)

tf.Tensor(
[-0.34403768 -0.14122334 -1.3832929   1.618252    1.364448   -1.1299014 ],
shape=(6,),
dtype=float32
)

Also getting the log probability. Here we do it for three values simultaneously.

tf.Tensor(
[-1.4189385 -0.9189385 -1.4189385], shape=(3,), dtype=float32
)

We can do the same thing with many other distributions, such as the Bernoulli:

b <- tfd$Bernoulli(0.9)
b$sample(10L)

tf.Tensor(
[1 1 1 0 1 1 0 1 0 1], shape=(10,), dtype=int32
)

tf.Tensor(
[-1.2411538 -0.3411539 -1.2411538 -1.2411538], shape=(4,), dtype=float32
)

Note that in the last section, we are asking about the log-probabilities of four independent draws.

Batch formats and event formats.

I TFPwe can do the following.

ns <- tfd$Normal(
  loc = c(1, 10, -200),
  scale = c(0.1, 0.1, 1)
)
ns

tfp.distributions.Normal(
"Normal/", batch_shape=(3,), event_shape=(), dtype=float32
)

Contrary to what it appears, this is not multivariate normal. As indicated. batch_shape=(3,), it is a “batch” of independent univariate distributions. The fact is that these are immutable. event_shape=(): Each of them lives in one dimension. Event venue.

If instead we construct a single, two-dimensional multivariate normal:

n <- tfd$MultivariateNormalDiag(loc = c(0, 10), scale_diag = c(1, 4))
n

tfp.distributions.MultivariateNormalDiag(
"MultivariateNormalDiag/", batch_shape=(), event_shape=(2,), dtype=float32
)

Let's see batch_shape=(), event_shape=(2,)as expected.

Of course, we can combine the two to form the middle of a multivariate distribution:

nd_batch <- tfd$MultivariateNormalFullCovariance(
  loc = list(c(0., 0.), c(1., 1.), c(2., 2.)),
  covariance_matrix = list(
    matrix(c(1, .1, .1, 1), ncol = 2),
    matrix(c(1, .3, .3, 1), ncol = 2),
    matrix(c(1, .5, .5, 1), ncol = 2))
)

This example illustrates the mean of three bivariate multivariate normal distributions.

Switching between batch formats and event formats

Strange as it may seem, situations arise when we want to change the distribution patterns between these types – indeed, we'll see such a case very soon.

tfd$Independent is used to change dimensions. batch_shape In dimensions event_shape.

Here is a batch of three independent Bernoulli distributions.

bs <- tfd$Bernoulli(probs=c(.3,.5,.7))
bs

tfp.distributions.Bernoulli(
"Bernoulli/", batch_shape=(3,), event_shape=(), dtype=int32
)

We can convert this to a virtual “three-dimensional” Bernoulli like this:

b <- tfd$Independent(bs, reinterpreted_batch_ndims = 1L)
b

tfp.distributions.Independent(
"IndependentBernoulli/", batch_shape=(), event_shape=(3,), dtype=int32
)

Here reinterpreted_batch_ndims tells TFP Counting from the right side of the shape list, how many dimensions of the batch are being used for event space.

With this basic understanding of TFP Distribution, we are ready to see them used in VAE.

We will take and use a (not so) deep convolutional architecture from learning representations with MMD-VAE. distributions For sampling and computing possibilities. Optionally, our new VAE will be enabled. Learn the prior distribution..

Concretely, the following exposition will consist of three parts. First, we present a combined code applicable to both a VAE with a static prior, and one that learns the parameters of the prior distribution. Then, we have a training loop for the first (before static) VAE. Finally, we discuss the training loop and additional model involved in the second (first learning) VAE.

Deploying both versions back-to-back leads to code duplication, but avoids scattering if ambiguous branches throughout the code.

Another VAE is available as Part of the Keras examples So you don't need to copy code snippets. The code also includes additional functionality not discussed or duplicated here, such as saving model weights.

So, let's start with the general part.

At the risk of repeating myself, here are the preparation steps again (including a few extra library loads).

The data set

To change from MNIST and Fashion-MNIST, we will use the brand new Kuzushiji-MNIST(Clanwat et al. 2018).

From: Deep Learning for Classical Japanese Literature (Clanuwat et al. 2018) — From: Intensive Studies for Classical Japanese Literature (Clanwat et al. 2018)

np <- import("numpy")

kuzushiji <- np$load("kmnist-train-imgs.npz")
kuzushiji <- kuzushiji$get("arr_0")
 
train_images <- kuzushiji %>%
  k_expand_dims() %>%
  k_cast(dtype = "float32")

train_images <- train_images %>% `/`(255)

Like this other post, we stream the data through it. tfdatasets:

buffer_size <- 60000
batch_size <- 256
batches_per_epoch <- buffer_size / batch_size

train_dataset <- tensor_slices_dataset(train_images) %>%
  dataset_shuffle(buffer_size) %>%
  dataset_batch(batch_size)

Now let's see what changes occur in the encoder and decoder models.

Encoder

The encoder is different than the one we had without. TFP in that it does not return approximate posterior means and variances directly as tensors. Instead, it returns a batch of multivariate normal distributions:

# you might want to change this depending on the dataset
latent_dim <- 2

encoder_model <- function(name = NULL) {

  keras_model_custom(name = name, function(self) {
  
    self$conv1 <-
      layer_conv_2d(
        filters = 32,
        kernel_size = 3,
        strides = 2,
        activation = "relu"
      )
    self$conv2 <-
      layer_conv_2d(
        filters = 64,
        kernel_size = 3,
        strides = 2,
        activation = "relu"
      )
    self$flatten <- layer_flatten()
    self$dense <- layer_dense(units = 2 * latent_dim)
    
    function (x, mask = NULL) {
      x <- x %>%
        self$conv1() %>%
        self$conv2() %>%
        self$flatten() %>%
        self$dense()
        
      tfd$MultivariateNormalDiag(
        loc = x[, 1:latent_dim],
        scale_diag = tf$nn$softplus(x[, (latent_dim + 1):(2 * latent_dim)] + 1e-5)
      )
    }
  })
}

Let's try it out.

encoder <- encoder_model()

iter <- make_iterator_one_shot(train_dataset)
x <-  iterator_get_next(iter)

approx_posterior <- encoder(x)
approx_posterior

tfp.distributions.MultivariateNormalDiag(
"MultivariateNormalDiag/", batch_shape=(256,), event_shape=(2,), dtype=float32
)

approx_posterior$sample()

tf.Tensor(
[[ 5.77791929e-01 -1.64988488e-02]
 [ 7.93901443e-01 -1.00042784e+00]
 [-1.56279251e-01 -4.06365871e-01]
 ...
 ...
 [-6.47531569e-01  2.10889503e-02]], shape=(256, 2), dtype=float32)

We don't know about you, but we still enjoy the convenience of inspecting values. Passionate execution – much more.

Now, at the decoder, whatever returns a distribution instead of a tensor.

Decoder

In the decoder, we see why conversions between batch format and event format are useful. Production of self$deconv3 is four dimensional. What we need is an on-off probability for each pixel. Previously, this was accomplished by feeding the tensor into the dense layer and applying a sigmoid activation. Here, we use tfd$Independent Efficiently converting tensors into probability distributions over three-dimensional images (width, height, channel(s)).

decoder_model <- function(name = NULL) {
  
  keras_model_custom(name = name, function(self) {
    
    self$dense <- layer_dense(units = 7 * 7 * 32, activation = "relu")
    self$reshape <- layer_reshape(target_shape = c(7, 7, 32))
    self$deconv1 <-
      layer_conv_2d_transpose(
        filters = 64,
        kernel_size = 3,
        strides = 2,
        padding = "same",
        activation = "relu"
      )
    self$deconv2 <-
      layer_conv_2d_transpose(
        filters = 32,
        kernel_size = 3,
        strides = 2,
        padding = "same",
        activation = "relu"
      )
    self$deconv3 <-
      layer_conv_2d_transpose(
        filters = 1,
        kernel_size = 3,
        strides = 1,
        padding = "same"
      )
    
    function (x, mask = NULL) {
      x <- x %>%
        self$dense() %>%
        self$reshape() %>%
        self$deconv1() %>%
        self$deconv2() %>%
        self$deconv3()
      
      tfd$Independent(tfd$Bernoulli(logits = x),
                      reinterpreted_batch_ndims = 3L)
      
    }
  })
}

Let's try that too.

decoder <- decoder_model()
decoder_likelihood <- decoder(approx_posterior_sample)

tfp.distributions.Independent(
"IndependentBernoulli/", batch_shape=(256,), event_shape=(28, 28, 1), dtype=int32
)

This distribution will be used to generate “reconstructions” as well as to determine the log of the original samples.

KL loss and rectifier

Both of the VAEs discussed below will require an optimizer…

optimizer <- tf$train$AdamOptimizer(1e-4)

… and will assign both. compute_kl_loss To calculate the KL portion of the loss.

This helper function simply subtracts the log-likelihood of the samples under the prior from their approximate posterior.

compute_kl_loss <- function(
  latent_prior,
  approx_posterior,
  approx_posterior_sample) {
  
  kl_div <- approx_posterior$log_prob(approx_posterior_sample) -
    latent_prior$log_prob(approx_posterior_sample)
  avg_kl_div <- tf$reduce_mean(kl_div)
  avg_kl_div
}

Now that we've seen the common parts, we first discuss how to train a VAE with stable priors.

In this VAE, we use TFP To construct the usual isotropic Gaussian prior. We then sample directly from this distribution in the training loop.

latent_prior <- tfd$MultivariateNormalDiag(
  loc  = tf$zeros(list(latent_dim)),
  scale_identity_multiplier = 1
)

And here is the complete training loop. We will point out the important ones. TFP– Related steps below.

for (epoch in seq_len(num_epochs)) {
  iter <- make_iterator_one_shot(train_dataset)
  
  total_loss <- 0
  total_loss_nll <- 0
  total_loss_kl <- 0
  
  until_out_of_range({
    x <-  iterator_get_next(iter)
    
    with(tf$GradientTape(persistent = TRUE) %as% tape, {
      approx_posterior <- encoder(x)
      approx_posterior_sample <- approx_posterior$sample()
      decoder_likelihood <- decoder(approx_posterior_sample)
      
      nll <- -decoder_likelihood$log_prob(x)
      avg_nll <- tf$reduce_mean(nll)
      
      kl_loss <- compute_kl_loss(
        latent_prior,
        approx_posterior,
        approx_posterior_sample
      )

      loss <- kl_loss + avg_nll
    })
    
    total_loss <- total_loss + loss
    total_loss_nll <- total_loss_nll + avg_nll
    total_loss_kl <- total_loss_kl + kl_loss
    
    encoder_gradients <- tape$gradient(loss, encoder$variables)
    decoder_gradients <- tape$gradient(loss, decoder$variables)
    
    optimizer$apply_gradients(purrr::transpose(list(
      encoder_gradients, encoder$variables
    )),
    global_step = tf$train$get_or_create_global_step())
    optimizer$apply_gradients(purrr::transpose(list(
      decoder_gradients, decoder$variables
    )),
    global_step = tf$train$get_or_create_global_step())
 
  })
  
  cat(
    glue(
      "Losses (epoch): {epoch}:",
      "  {(as.numeric(total_loss_nll)/batches_per_epoch) %>% round(4)} nll",
      "  {(as.numeric(total_loss_kl)/batches_per_epoch) %>% round(4)} kl",
      "  {(as.numeric(total_loss)/batches_per_epoch) %>% round(4)} total"
    ),
    "\n"
  )
}

Above, playing with encoder and decoder, we have already seen how

approx_posterior <- encoder(x)

gives us a distribution from which to sample. We use this to obtain samples from the approximate posterior:

approx_posterior_sample <- approx_posterior$sample()

These samples, we take them and feed them to the decoder, which gives us the on-off possibilities for the image pixels.

decoder_likelihood <- decoder(approx_posterior_sample)

Loss now consists of the usual ELBO components: reconstruction loss and KL diversion. Reconstruction damage that we receive directly. TFPestimating the probability of the original input using the learned decoder distribution.

nll <- -decoder_likelihood$log_prob(x)
avg_nll <- tf$reduce_mean(nll)

By KL loss we gain. compute_kl_lossthe helper function we saw above:

kl_loss <- compute_kl_loss(
        latent_prior,
        approx_posterior,
        approx_posterior_sample
      )

We add the two and arrive at the overall loss of VAE:

loss <- kl_loss + avg_nll

In addition to these changes due to use TFPthe training process is just the normal backprop, the way it is using. Passionate execution.

Now let's see how instead of using a standard isotropic Gaussian, we can learn a mixture of Gaussians. The choice of the number of divisions here is quite arbitrary. With the kind of latent_dimyou'll want to experiment and find what works best on your dataset.

mixture_components <- 16

learnable_prior_model <- function(name = NULL, latent_dim, mixture_components) {
  
  keras_model_custom(name = name, function(self) {
    
    self$loc <-
      tf$get_variable(
        name = "loc",
        shape = list(mixture_components, latent_dim),
        dtype = tf$float32
      )
    self$raw_scale_diag <- tf$get_variable(
      name = "raw_scale_diag",
      shape = c(mixture_components, latent_dim),
      dtype = tf$float32
    )
    self$mixture_logits <-
      tf$get_variable(
        name = "mixture_logits",
        shape = c(mixture_components),
        dtype = tf$float32
      )
      
    function (x, mask = NULL) {
        tfd$MixtureSameFamily(
          components_distribution = tfd$MultivariateNormalDiag(
            loc = self$loc,
            scale_diag = tf$nn$softplus(self$raw_scale_diag)
          ),
          mixture_distribution = tfd$Categorical(logits = self$mixture_logits)
        )
      }
    })
  }

I TFP Terminology components_distribution The basic distribution type is , and mixture_distribution Has the possibility to select individual components.

Note how self$loc, self$raw_scale_diag And self$mixture_logits are TensorFlow. Variables And thus, persistent and updatable via backprop.

Now we build the model.

latent_prior_model <- learnable_prior_model(
  latent_dim = latent_dim,
  mixture_components = mixture_components
)

How do we obtain a latent prior distribution from which we can sample? A bit unusually, this model will be called without input:

latent_prior <- latent_prior_model(NULL)
latent_prior

tfp.distributions.MixtureSameFamily(
"MixtureSameFamily/", batch_shape=(), event_shape=(2,), dtype=float32
)

Now here is the complete training loop. Notice how we have a third model to backprop.

for (epoch in seq_len(num_epochs)) {
  iter <- make_iterator_one_shot(train_dataset)
  
  total_loss <- 0
  total_loss_nll <- 0
  total_loss_kl <- 0
  
  until_out_of_range({
    x <-  iterator_get_next(iter)
    
    with(tf$GradientTape(persistent = TRUE) %as% tape, {
      approx_posterior <- encoder(x)
      
      approx_posterior_sample <- approx_posterior$sample()
      decoder_likelihood <- decoder(approx_posterior_sample)
      
      nll <- -decoder_likelihood$log_prob(x)
      avg_nll <- tf$reduce_mean(nll)
      
      latent_prior <- latent_prior_model(NULL)
      
      kl_loss <- compute_kl_loss(
        latent_prior,
        approx_posterior,
        approx_posterior_sample
      )

      loss <- kl_loss + avg_nll
    })
    
    total_loss <- total_loss + loss
    total_loss_nll <- total_loss_nll + avg_nll
    total_loss_kl <- total_loss_kl + kl_loss
    
    encoder_gradients <- tape$gradient(loss, encoder$variables)
    decoder_gradients <- tape$gradient(loss, decoder$variables)
    prior_gradients <-
      tape$gradient(loss, latent_prior_model$variables)
    
    optimizer$apply_gradients(purrr::transpose(list(
      encoder_gradients, encoder$variables
    )),
    global_step = tf$train$get_or_create_global_step())
    optimizer$apply_gradients(purrr::transpose(list(
      decoder_gradients, decoder$variables
    )),
    global_step = tf$train$get_or_create_global_step())
    optimizer$apply_gradients(purrr::transpose(list(
      prior_gradients, latent_prior_model$variables
    )),
    global_step = tf$train$get_or_create_global_step())
    
  })
  
  checkpoint$save(file_prefix = checkpoint_prefix)
  
  cat(
    glue(
      "Losses (epoch): {epoch}:",
      "  {(as.numeric(total_loss_nll)/batches_per_epoch) %>% round(4)} nll",
      "  {(as.numeric(total_loss_kl)/batches_per_epoch) %>% round(4)} kl",
      "  {(as.numeric(total_loss)/batches_per_epoch) %>% round(4)} total"
    ),
    "\n"
  )
}

And that's it! For us, both VAEs yielded similar results, and we did not experience much difference from experimenting with the latent dimension and the number of mixture distributions. But then, we wouldn't want to generalize to other datasets, architectures, etc.

Speaking of results, how do they look? Here we see the letters generated after 40 rounds of training. On the left are random characters, on the right, the usual VAE grid display of latent space.

results

Hopefully, we've been able to show that TensorFlow Probability, Execution, and Keras make an attractive combination! If you are concerned. The total amount of the code is required. Given the complexity of the task as well as the depth of concepts involved, it should appear as a pretty comprehensive implementation.

In the near future, we plan to follow up with more involved applications of TensorFlow Probability, mostly from the field of representation learning. Stay tuned!

Clanuwat, Tarin, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. 2018. “Deep Learning for Classical Japanese Literature.” December 3, 2018. https://arxiv.org/abs/cs.CV/1812.01718.

Post Views: 91