Cheat prediction with Autoencoders and Keras

Overview

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

In this post we will train an autoencoder to detect credit card fraud. We will also show how to train Keras models using the cloud. Cloud ML.

The basis of our model will be Kaggle. Credit card fraud detection The dataset, which was collected during a research collaboration by WorldLine et al. Machine Learning Group of ULB (Université Libre de Bruxelles) on big data mining and fraud detection.

The dataset includes credit card transactions made by European cardholders over a two-day period in September 2013. Out of 284,807 transactions, 492 are fraudulent. The data set is highly unbalanced, with only 0.172% of the positive class (frauds) accounting for all transactions.

Reading data

After downloading the data from Cagleyou can read it in R with read_csv():

library(readr)
df <- read_csv("data-raw/creditcard.csv", col_types = list(Time = col_number()))

The input variables contain only numerical values ​​that are the result of the PCA transformation. In order to maintain confidentiality, no further information about the actual features has been provided. Features V1, …, V28 were obtained with PCA. However there are 2 properties (Time And the amount) which were not changed.
Time The data set contains the seconds that elapsed between each transaction and the first transaction. the amount is the transaction amount and can be used for cost learning. gave the class The variable takes 1 in case of fraud and 0 otherwise.

Auto encoders

Since only 0.172% of observations are fraudulent, we have a highly unbalanced classification problem. With this type of problem, traditional classification methods usually don't work very well because we only have a very small sample of the rare class.

one Autocoder A neural network is used to learn (encode) a representation of a set of data, usually for the purpose of dimensionality reduction. For this problem we will train an autoencoder to encode non-fraudulent observations from our training set. Since the distribution of fraud is assumed to be different from that of normal transactions, we expect our autoencoder to have higher reconstruction errors on fraud than on normal transactions. This means that we can use the reconstruction error as a quantity that indicates whether a transaction is fraudulent or not.

This is a good starting point if you want to learn more about autoencoders. Video from Larochelle And on YouTube Chapter 14 From Deep learning The book by Goodfellow et al.

imagining

For the autoencoder to work well we have a strong initial assumption: that the distribution of variables for normal transactions is different from the distribution for fraudulent ones. Let's make some plots to verify this. The variables were transformed into a. [0,1] A break for plotting

joy division

We can see that the distribution of variables for fraudulent transactions is very different from normal, except that Time variable, whose distribution appears to be exactly the same.

Pre-processing

Before the modeling steps we need to do some pre-processing. We will split the dataset into a train and a test set and then we will Normalize min to max. our data (this is done because neural networks work much better with small input values). We will remove that too. Time The variable as it has exactly the same distribution for normal and fraudulent transactions.

Based on Time Variables We will use the first 200,000 observations for training and the rest for testing. This is good practice because when using the model we want to predict fraud based on future transactions.

Now let's work on normalizing the input. We created 2 functions to help us. The former obtain descriptive statistics about the dataset that are used for scaling. Then we have a function to perform the minimum scaling. It is important to note that we applied the same normalization constant for the training and test sets.

library(purrr)

#' Gets descriptive statistics for every variable in the dataset.
get_desc <- function(x) {
  map(x, ~list(
    min = min(.x),
    max = max(.x),
    mean = mean(.x),
    sd = sd(.x)
  ))
} 

#' Given a dataset and normalization constants it will create a min-max normalized
#' version of the dataset.
normalization_minmax <- function(x, desc) {
  map2_dfc(x, desc, ~(.x - .y$min)/(.y$max - .y$min))
}

Now let's create normalized versions of our datasets. We have also converted our dataframes to matrices as this is the format expected by Keras.

We will now define our model in Keras, a synchronous autoencoder with 4 dense layers.

library(keras)
model <- keras_model_sequential()
model %>%
  layer_dense(units = 15, activation = "tanh", input_shape = ncol(x_train)) %>%
  layer_dense(units = 10, activation = "tanh") %>%
  layer_dense(units = 15, activation = "tanh") %>%
  layer_dense(units = ncol(x_train))

summary(model)
___________________________________________________________________________________
Layer (type)                         Output Shape                     Param #      
===================================================================================
dense_1 (Dense)                      (None, 15)                       450          
___________________________________________________________________________________
dense_2 (Dense)                      (None, 10)                       160          
___________________________________________________________________________________
dense_3 (Dense)                      (None, 15)                       165          
___________________________________________________________________________________
dense_4 (Dense)                      (None, 29)                       464          
===================================================================================
Total params: 1,239
Trainable params: 1,239
Non-trainable params: 0
___________________________________________________________________________________

We will then build our model, using the mean squared error loss and Adam Optimizer for training.

model %>% compile(
  loss = "mean_squared_error", 
  optimizer = "adam"
)

Model training

Now we can train our model using fit() Training the function model is reasonably fast (~14s per round on my laptop). We will only feed our model to observations of normal (non-fraud) transactions.

We will use callback_model_checkpoint() To save your model after each round. By passing the argument save_best_only = TRUE We will only keep the disk with the smallest loss value on the test set. We will also use callback_early_stopping() Stop training if the validation loss stops decreasing within 5 epochs.

checkpoint <- callback_model_checkpoint(
  filepath = "model.hdf5", 
  save_best_only = TRUE, 
  period = 1,
  verbose = 1
)

early_stopping <- callback_early_stopping(patience = 5)

model %>% fit(
  x = x_train[y_train == 0,], 
  y = x_train[y_train == 0,], 
  epochs = 100, 
  batch_size = 32,
  validation_data = list(x_test[y_test == 0,], x_test[y_test == 0,]), 
  callbacks = list(checkpoint, early_stopping)
)
Train on 199615 samples, validate on 84700 samples
Epoch 1/100
199615/199615 [==============================] - 17s 83us/step - loss: 0.0036 - val_loss: 6.8522e-04d from inf to 0.00069, saving model to model.hdf5
Epoch 2/100
199615/199615 [==============================] - 17s 86us/step - loss: 4.7817e-04 - val_loss: 4.7266e-04d from 0.00069 to 0.00047, saving model to model.hdf5
Epoch 3/100
199615/199615 [==============================] - 19s 94us/step - loss: 3.7753e-04 - val_loss: 4.2430e-04d from 0.00047 to 0.00042, saving model to model.hdf5
Epoch 4/100
199615/199615 [==============================] - 19s 94us/step - loss: 3.3937e-04 - val_loss: 4.0299e-04d from 0.00042 to 0.00040, saving model to model.hdf5
Epoch 5/100
199615/199615 [==============================] - 19s 94us/step - loss: 3.2259e-04 - val_loss: 4.0852e-04 improve
Epoch 6/100
199615/199615 [==============================] - 18s 91us/step - loss: 3.1668e-04 - val_loss: 4.0746e-04 improve
...

After training we can get the final loss for the test set using evaluate() Function

loss <- evaluate(model, x = x_test[y_test == 0,], y = x_test[y_test == 0,])
loss
        loss 
0.0003534254 

Tuning with CloudML

We may be able to get better results by tuning our model hyperparameters. We can tune, for example, the normalization function, the learning rate, the activation functions and the size of the hidden layers. CloudML uses Bayesian optimization to tune the hyperparameters of the models as described in This blog post.

We can use CloudML package To tune our model, but first we need to prepare our project. Training flag For each hyperparameter and a tuning.yml file that will tell CloudML what parameters we want to tune and how.

The complete script used for training on CloudML can be found at https://github.com/dfalbel/fraud-autoencoder-example. The most important modification to the code was adding the training flags:

FLAGS <- flags(
  flag_string("normalization", "minmax", "One of minmax, zscore"),
  flag_string("activation", "relu", "One of relu, selu, tanh, sigmoid"),
  flag_numeric("learning_rate", 0.001, "Optimizer Learning Rate"),
  flag_integer("hidden_size", 15, "The hidden layer size")
)

Then we used FLAGS Variables within the script to run the model's hyperparameters, for example:

model %>% compile(
  optimizer = optimizer_adam(lr = FLAGS$learning_rate), 
  loss = 'mean_squared_error',
)

We made one too tuning.yml file that specifies how the hyperparameter should vary during training, as well as which metric we want to optimize (in this case it was the validation loss: val_loss).

tuning.yml

trainingInput:
  scaleTier: CUSTOM
  masterType: standard_gpu
  hyperparameters:
    goal: MINIMIZE
    hyperparameterMetricTag: val_loss
    maxTrials: 10
    maxParallelTrials: 5
    params:
      - parameterName: normalization
        type: CATEGORICAL
        categoricalValues: [zscore, minmax]
      - parameterName: activation
        type: CATEGORICAL
        categoricalValues: [relu, selu, tanh, sigmoid]
      - parameterName: learning_rate
        type: DOUBLE
        minValue: 0.000001
        maxValue: 0.1
        scaleType: UNIT_LOG_SCALE
      - parameterName: hidden_size
        type: INTEGER
        minValue: 5
        maxValue: 50
        scaleType: UNIT_LINEAR_SCALE

We specify the type of machine we want to use (in this case a standard_gpu for example) the metric we want to minimize during tuning, and the maximum number of trials (i.e., the number of hyperparameter combinations we want to test). Then we specify how we want to change each hyperparameter during tuning.

You can learn more about the tuning.yml file. For R documentation on Tensorflow And on Google's official documentation on CloudML.

Now we are ready to send the job to Google CloudML. We can do this by running:

library(cloudml)
cloudml_train("train.R", config = "tuning.yml")

The CloudML package takes care of uploading the dataset and installing any R package dependencies needed to run the script on CloudML. If you are using RStudio v1.1 or higher, it will also allow you to monitor your work in the background terminal. You can also monitor your work. Google Cloud Console.

After the job is finished we may collect the job results with:

It will copy files from work with best. val_loss Open a report summarizing your local system performance and training runs on CloudML.

cloudml report

Since we used callbacks to save model checkpoints during training, the model file was also copied from Google Cloud ML. The files created during training are copied to the “runs” subdirectory of the working directory from where cloudml_train() is called. You can specify this directory with:

[1] runs/cloudml_2018_01_23_221244595-03

You can also list all previous runs and their validation losses:

ls_runs(order = metric_val_loss, decreasing = FALSE)
                    run_dir metric_loss metric_val_loss
1 runs/2017-12-09T21-01-11Z      0.2577          0.1482
2 runs/2017-12-09T21-00-11Z      0.2655          0.1505
3 runs/2017-12-09T19-59-44Z      0.2597          0.1402
4 runs/2017-12-09T19-56-48Z      0.2610          0.1459

Use View(ls_runs()) to view all columns

In our case the job downloaded from CloudML was saved. runs/cloudml_2018_01_23_221244595-03/so the saved model is available on the file. runs/cloudml_2018_01_23_221244595-03/model.hdf5. Now we can use our tuned model to make predictions.

Making predictions

Now that we have trained and tuned our model we are ready to generate predictions with our autoencoder. We are interested in the MSE for each observation and we expect the MSE to be higher in fraudulent transaction observations.

First, let's load our model.

model <- load_model_hdf5("runs/cloudml_2018_01_23_221244595-03/model.hdf5", 
                         compile = FALSE)

Now let's calculate the MSE for the training and test set observations.

pred_train <- predict(model, x_train)
mse_train <- apply((x_train - pred_train)^2, 1, sum)

pred_test <- predict(model, x_test)
mse_test <- apply((x_test - pred_test)^2, 1, sum)

A good measure of model performance in highly unbalanced datasets is the area under the ROC curve (AUC). AUC has a good interpretation of this problem, it is likely that fraudulent transactions will have higher than normal MSE. We can calculate using Matrix package, which implements a wide variety of common machine learning model performance metrics.

[1] 0.9546814
[1] 0.9403554

In order to practically use the model to make predictions we need to find a limit. \(k\) For MSE, then if if \(MSE > k\) We consider this transaction fraudulent (otherwise we consider it normal). It is useful to look at and recall the accuracy by varying the range to define this value. \(k\).

possible_k <- seq(0, 0.5, length.out = 100)
precision <- sapply(possible_k, function(k) {
  predicted_class <- as.numeric(mse_test > k)
  sum(predicted_class == 1 & y_test == 1)/sum(predicted_class)
})

qplot(possible_k, precision, geom = "line") 
  + labs(x = "Threshold", y = "Precision")

precision

recall <- sapply(possible_k, function(k) {
  predicted_class <- as.numeric(mse_test > k)
  sum(predicted_class == 1 & y_test == 1)/sum(y_test)
})
qplot(possible_k, recall, geom = "line") 
  + labs(x = "Threshold", y = "Recall")

recall

A good starting point would be to choose a limit with maximum accuracy, but we can also base our decision on how much money we stand to lose from fraudulent transactions.

Let's say each manual fraud verification costs us $1 but if we don't verify a transaction and it's a fraud, we'll lose money on that transaction. Let's figure out the value of each limit to see how much money we will lose.

cost_per_verification <- 1

lost_money <- sapply(possible_k, function(k) {
  predicted_class <- as.numeric(mse_test > k)
  sum(cost_per_verification * predicted_class + (predicted_class == 0) * y_test * df_test$Amount) 
})

qplot(possible_k, lost_money, geom = "line") + labs(x = "Threshold", y = "Lost Money")

money

We can find the optimal limit in this case:

[1] 0.005050505

If we needed to manually validate all fraud, it would cost us ~$13,000. Using our model we can reduce that to $2,500.

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment