A look at activations and cost functions

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

You are building a Keras model. If you haven't been doing deep learning that long, getting the output activation and cost function right can involve some memorization (or searching). You may be trying to remember general instructions like this:

So with my cats and dogs, I'm doing a 2-class classification, so I have to use sigmoid activation in the output layer, right, and then, this is the binary crossentropy for the cost function…
or: I'm doing classification on ImageNet, it's multiclass, so it was softmax for the activation, and then, the cost should be the apparent crossentropy…

It's fine to memorize things like this, but knowing a little about the reasons behind them often makes things easier. So we ask: Why is it that these output activation and cost functions go together? And, do they always have to?

in short

Simply put, we choose activations that predict what we want the network to predict. The cost function is then determined by the model.

This is because neural networks are generally optimized using Maximum likelihoodAnd depending on the distribution we assume for the output units, maximum likelihood achieves different optimization objectives. All these objectives then reduce the cross-entropy (in practice: mismatch) between the actual distribution and the predicted distribution.

Let's start with the simplest, linear case.

Regression

For the botanists among us, here's a very simple network that aims to predict sepal width from sepal length.

model <- keras_model_sequential() %>%
  layer_dense(units = 32) %>%
  layer_dense(units = 1)

model %>% compile(
  optimizer = "adam", 
  loss = "mean_squared_error"
)

model %>% fit(
  x = iris$Sepal.Length %>% as.matrix(),
  y = iris$Sepal.Width %>% as.matrix(),
  epochs = 50
)

Here our model assumes that sepal width is normally distributed, given sepal length. Often, we are trying to estimate the mean of the conditional Gaussian distribution:

\[p(y|\mathbf{x} = N(y; \mathbf{w}^t\mathbf{h} + b)\]

In this case, the cost function that minimizes the cross-entropy (equivalently: optimizes the maximum likelihood) is Mean squared error. And that's exactly what we're using as the cost function above.

Alternatively, we can predict the median of this conditional distribution. In this case, we will change the cost function to use the mean absolute error:

model %>% compile(
  optimizer = "adam", 
  loss = "mean_absolute_error"
)

Now let's move on from linear.

Binary classification

We are keen bird watchers and wanted an application to notify us when there was a bird in our garden – not when the neighbors landed their plane. Thus we will train a network to distinguish between two classes: birds and airplanes.

# Using the CIFAR-10 dataset that conveniently comes with Keras.
cifar10 <- dataset_cifar10()

x_train <- cifar10$train$x / 255
y_train <- cifar10$train$y

is_bird <- cifar10$train$y == 2
x_bird <- x_train[is_bird, , ,]
y_bird <- rep(0, 5000)

is_plane <- cifar10$train$y == 0
x_plane <- x_train[is_plane, , ,]
y_plane <- rep(1, 5000)

x <- abind::abind(x_bird, x_plane, along = 1)
y <- c(y_bird, y_plane)

model <- keras_model_sequential() %>%
  layer_conv_2d(
    filter = 8,
    kernel_size = c(3, 3),
    padding = "same",
    input_shape = c(32, 32, 3),
    activation = "relu"
  ) %>%
  layer_max_pooling_2d(pool_size = c(2, 2)) %>%
  layer_conv_2d(
    filter = 8,
    kernel_size = c(3, 3),
    padding = "same",
    activation = "relu"
  ) %>%
  layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_flatten() %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

model %>% compile(
  optimizer = "adam", 
  loss = "binary_crossentropy", 
  metrics = "accuracy"
)

model %>% fit(
  x = x,
  y = y,
  epochs = 50
)

Although we usually talk about “binary classification”, the way results are usually modeled is Bernoulli random variable, conditional on the input data. So:

\[P(y = 1|\mathbf{x}) = p, \ 0\leq p\leq1\]

A Bernoulli random variable takes between values \(0\) And \(1\). So that's what our network should produce. One idea might be to simply clip all values ​​of \(\mathbf{w}^t\mathbf{h} + b\) Outside of this interval, however, if we do so, there will be a gradient in these regions. \(0\): The network cannot learn.

A better approach is to eliminate the full interval falling in the range (0,1) using the logistic. Sigmoid Function

\[ \sigma(x) = \frac{1}{1 + e^{(-x)}} \]

The sigmoid function squeezes its input into the interval (0,1).

As you can see, the sigmoid function saturates when its input becomes too large, or too small. Is this the problem? It depends. Finally, what we care about is if the cost function saturates. Should we choose the mean squared error here, as in the regression function above, this may indeed be the case.

However, if we follow the general principle of maximum likelihood/cross-entropy, there will be a loss.

\[- log P (y|\mathbf{x})\]

where \(log\) cancels the \(End\) in the sigmoid.

In Keras, the relative harm is the verb binary_crossentropy. There will be a loss for one item.

  • \(-log(p)\) When ground truth 1
  • \(- log(1-p)\) When the ground truth is 0.

Here, you can see when, for an individual instance, the network predicts the wrong class. And Very confident of this, this example will contribute very strongly to the loss.

Cross-entropy penalizes incorrect predictions most when they are highly confident.

What happens when we differentiate between more than two classes?

Multi-class classification

CIFAR-10 has 10 classes. So now we want to decide which of the 10 object classes are present in the image.

Here's the code first: Not much different from above, but note the changes in the activation and cost function.

cifar10 <- dataset_cifar10()

x_train <- cifar10$train$x / 255
y_train <- cifar10$train$y

model <- keras_model_sequential() %>%
  layer_conv_2d(
    filter = 8,
    kernel_size = c(3, 3),
    padding = "same",
    input_shape = c(32, 32, 3),
    activation = "relu"
  ) %>%
  layer_max_pooling_2d(pool_size = c(2, 2)) %>%
  layer_conv_2d(
    filter = 8,
    kernel_size = c(3, 3),
    padding = "same",
    activation = "relu"
  ) %>%
  layer_max_pooling_2d(pool_size = c(2, 2)) %>%
  layer_flatten() %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 10, activation = "softmax")

model %>% compile(
  optimizer = "adam",
  loss = "sparse_categorical_crossentropy",
  metrics = "accuracy"
)

model %>% fit(
  x = x_train,
  y = y_train,
  epochs = 50
)

So now we have it. softmax Together with Apparent crossentropy. Why?

Again, we want an exact probability distribution: the sum of the probabilities of all discrete events must be 1.

CIFAR-10 has one object per image. So the events are disconnected. Then we have a single draw polynomial distribution (popularly known as “multinomial”, mostly due to Murphy Machine learning(Murphy 2012)) which can be modeled by a softmax activation:

\[softmax(\mathbf{z})_i = \frac{e^{z_i}}{\sum_j{e^{z_j}}}\]

Just like Sigmoid, Softmax can be saturated. In this case, it would be when Differences Between the output becomes very large. Also with the sigmoid, a \(log\) cancels out in the cost function. \(End\) It is responsible for saturation:

\[log\ softmax(\mathbf{z})_i = z_i – log\sum_j{e^{z_j}}\]

Here \(z_i\) is the class whose probability we are estimating – we see that its contribution to the loss is linear and, thus, can never saturate.

In Keras, the loss function that does this for us is called categorical_crossentropy. We use sparse_categorical_crossentropy in the code which is similar. categorical_crossentropy But integer labels don't need to be converted to a hot vector.

Let's take a closer look at what SoftMax does. Suppose these are the raw outputs of our 10 output units:

Simulation output before softmax application.

Now what does the normal probability distribution look like after taking the softmax:

Final output after softmax.

Do you see where? The winner takes all Does the title come from? This is an important point to keep in mind: activation functions are not simply meant to produce specific desired distributions. They can also change the relationships between values.

Result

We started this post by pointing out common heuristics, such as “For multiclass classification, we use softmax activation, combined with implicit crossentropy as a loss function. are.” Hopefully, we've been able to show why these heuristics make sense.

However, knowing this background, you can also figure out when these rules don't apply. For example, say you want to detect several objects in an image. in that case, Winner all Strategy is not the most useful, because we don't want to exaggerate the differences between the candidates. So here, we will use Sigmoid Instead, over all output units, to determine the probability of occurrence per objection.

Goodfellow, Ian, Yoshua Bengio, and Aaron Corwell. 2016. Deep learning. MIT Press.

Murphy, Kevin. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment