Categorizing Quora duplicate questions with Keras

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Introduction

In this post we will use Keras to classify duplicate questions from Quora. The dataset first appeared in a Kaggle competition. Pairs of Quora questions and contains about 400,000 pairs of queries with a column indicating whether a query pair is considered a duplicate.

Affected by our implementation. Siamese recurrent architecture, with modifications to the similarity measure and embedding layers (the original paper uses pre-trained word vectors). This type of architecture has been in use since 2005. Le Cun et al and is useful for verification tasks. The idea is to learn a function that maps the input pattern to the target space such that the similarity measure in the target space estimates the “semantic” distance in the input space.

After the contest, Quora also expressed its view on the issue. Blog post.

Downloading data

Data can be downloaded from Kaggle. Dataset web page
Or by Quora Dataset release:

library(keras)
quora_data <- get_file(
  "quora_duplicate_questions.tsv",
  "
)

We are using Keras get_file() function so that the file download is cached.

Reading and preprocessing

We will first load the data into R and do some preprocessing to make it easier to add to the model. After downloading the data, you can read it using a reader. read_tsv() Function

We will make a Kira. tokenizer Converting each word to an integer token. We will also specify a hyperparameter of our model: word size. For now let's use the 50,000 most common words (we'll tune this parameter later). The tokenizer will fit using all unique queries from the data set.

tokenizer <- text_tokenizer(num_words = 50000)
tokenizer %>% fit_text_tokenizer(unique(c(df$question1, df$question2)))

Let's save the tokenizer to disk so it can be used for guessing later.

save_text_tokenizer(tokenizer, "tokenizer-question-pairs")

Now we'll use a text tokenizer to convert each query into a list of numbers.

question1 <- texts_to_sequences(tokenizer, df$question1)
question2 <- texts_to_sequences(tokenizer, df$question2)

Let's take a look at the number of words in each question. This will help us decide the padding length, another hyperparameter of our model. Padding the sequences normalizes them to the same size so we can feed them into the Keras model.

80% 90% 95% 99% 
 14  18  23  31 

We can see that 99% of queries have a maximum length of 31 so we will choose a padding length between 15 and 30. Let's start with 20 (we will also tune this parameter later). The default padding value is 0, but we're already using that value for words that don't appear very often in 50,000, so we'll use 50,001 instead.

question1_padded <- pad_sequences(question1, maxlen = 20, value = 50000 + 1)
question2_padded <- pad_sequences(question2, maxlen = 20, value = 50000 + 1)

Now we have completed the pre-processing steps. We will now run a simple benchmark model before moving on to the Keras model.

Simple standard

Before building a complex model, let's take a simple approach. Let's create two predictors: the percentage of words in question 1 that appear in question 2 and vice versa. We will then use logistic regression to estimate whether questions are duplicates.

perc_words_question1 <- map2_dbl(question1, question2, ~mean(.x %in% .y))
perc_words_question2 <- map2_dbl(question2, question1, ~mean(.x %in% .y))

df_model <- data.frame(
  perc_words_question1 = perc_words_question1,
  perc_words_question2 = perc_words_question2,
  is_duplicate = df$is_duplicate
) %>%
  na.omit()

Now that we have predictors, let's build the logistic model. We will take a small sample for validation.

val_sample <- sample.int(nrow(df_model), 0.1*nrow(df_model))
logistic_regression <- glm(
  is_duplicate ~ perc_words_question1 + perc_words_question2, 
  family = "binomial",
  data = df_model[-val_sample,]
)
summary(logistic_regression)
Call:
glm(formula = is_duplicate ~ perc_words_question1 + perc_words_question2, 
    family = "binomial", data = df_model[-val_sample, ])

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5938  -0.9097  -0.6106   1.1452   2.0292  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -2.259007   0.009668 -233.66   <2e-16 ***
perc_words_question1  1.517990   0.023038   65.89   <2e-16 ***
perc_words_question2  1.681410   0.022795   73.76   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 479158  on 363843  degrees of freedom
Residual deviance: 431627  on 363841  degrees of freedom
  (17 observations deleted due to missingness)
AIC: 431633

Number of Fisher Scoring iterations: 3

Let's calculate the accuracy on our validation set.

pred <- predict(logistic_regression, df_model[val_sample,], type = "response")
pred <- pred > mean(df_model$is_duplicate[-val_sample])
accuracy <- table(pred, df_model$is_duplicate[val_sample]) %>% 
  prop.table() %>% 
  diag() %>% 
  sum()
accuracy
[1] 0.6573577

We found an accuracy of 65.7%. All this is not much better than random guesswork. Now create your model in Keras.

Model definition

We will use a Siamese network to predict whether pairs are duplicates or not. The idea is to create a model that can embed queries (orders of words) into vectors. We can then compare each query vector using a similarity measure and tell if the queries are duplicates or not.

First we define the inputs for the model.

input1 <- layer_input(shape = c(20), name = "input_question1")
input2 <- layer_input(shape = c(20), name = "input_question2")

Then let's define the part of the model that will embed the queries into vectors.

word_embedder <- layer_embedding( 
  input_dim = 50000 + 2, # vocab size + UNK token + padding value
  output_dim = 128,      # hyperparameter - embedding size
  input_length = 20,     # padding size,
  embeddings_regularizer = regularizer_l2(0.0001) # hyperparameter - regularization 
)

seq_embedder <- layer_lstm(
  units = 128, # hyperparameter -- sequence embedding size
  kernel_regularizer = regularizer_l2(0.0001) # hyperparameter - regularization 
)

We will now explain the relationship between input vectors and embedding layers. Note that we use the same layers and weights on both inputs. That is why it is called Siamese network. This makes sense, because we don't want to get different output if query 1 is replaced with query 2.

vector1 <- input1 %>% word_embedder() %>% seq_embedder()
vector2 <- input2 %>% word_embedder() %>% seq_embedder()

Then we define the similarity measure we want to optimize. We want duplicate queries to have high similarity values. In this example we will use cosine similarity, but any similarity measure can be used. Note that the cosine similarity is the normalized dot product of the vectors, but it is not necessary to normalize the results for training.

cosine_similarity <- layer_dot(list(vector1, vector2), axes = 1)

Next, we define a final sigmoid layer to output the probability that the two queries are duplicates.

output <- cosine_similarity %>% 
  layer_dense(units = 1, activation = "sigmoid")

Now let's define and compile the Keras model in terms of its inputs and outputs. In the compilation phase we define our loss function and optimizer. Similar to the Kaggle challenge, we will minimize the logglass (equivalent to minimizing the binary crossentropy). We will use Adam Optimizer.

model <- keras_model(list(input1, input2), output)
model %>% compile(
  optimizer = "adam", 
  metrics = list(acc = metric_binary_accuracy), 
  loss = "binary_crossentropy"
)

Next we can take a look at the exterior model. summary Function

_______________________________________________________________________________________
Layer (type)                Output Shape       Param #    Connected to                 
=======================================================================================
input_question1 (InputLayer (None, 20)         0                                       
_______________________________________________________________________________________
input_question2 (InputLayer (None, 20)         0                                       
_______________________________________________________________________________________
embedding_1 (Embedding)     (None, 20, 128)    6400256    input_question1[0][0]        
                                                          input_question2[0][0]        
_______________________________________________________________________________________
lstm_1 (LSTM)               (None, 128)        131584     embedding_1[0][0]            
                                                          embedding_1[1][0]            
_______________________________________________________________________________________
dot_1 (Dot)                 (None, 1)          0          lstm_1[0][0]                 
                                                          lstm_1[1][0]                 
_______________________________________________________________________________________
dense_1 (Dense)             (None, 1)          2          dot_1[0][0]                  
=======================================================================================
Total params: 6,531,842
Trainable params: 6,531,842
Non-trainable params: 0
_______________________________________________________________________________________

Model fitting

Now we will fit and tune our model. However, before proceeding, let's take a sample for verification.

set.seed(1817328)
val_sample <- sample.int(nrow(question1_padded), size = 0.1*nrow(question1_padded))

train_question1_padded <- question1_padded[-val_sample,]
train_question2_padded <- question2_padded[-val_sample,]
train_is_duplicate <- df$is_duplicate[-val_sample]

val_question1_padded <- question1_padded[val_sample,]
val_question2_padded <- question2_padded[val_sample,]
val_is_duplicate <- df$is_duplicate[val_sample]

Now we use fit() The function to train the model:

model %>% fit(
  list(train_question1_padded, train_question2_padded),
  train_is_duplicate, 
  batch_size = 64, 
  epochs = 10, 
  validation_data = list(
    list(val_question1_padded, val_question2_padded), 
    val_is_duplicate
  )
)
Train on 363861 samples, validate on 40429 samples
Epoch 1/10
363861/363861 [==============================] - 89s 245us/step - loss: 0.5860 - acc: 0.7248 - val_loss: 0.5590 - val_acc: 0.7449
Epoch 2/10
363861/363861 [==============================] - 88s 243us/step - loss: 0.5528 - acc: 0.7461 - val_loss: 0.5472 - val_acc: 0.7510
Epoch 3/10
363861/363861 [==============================] - 88s 242us/step - loss: 0.5428 - acc: 0.7536 - val_loss: 0.5439 - val_acc: 0.7515
Epoch 4/10
363861/363861 [==============================] - 88s 242us/step - loss: 0.5353 - acc: 0.7595 - val_loss: 0.5358 - val_acc: 0.7590
Epoch 5/10
363861/363861 [==============================] - 88s 242us/step - loss: 0.5299 - acc: 0.7633 - val_loss: 0.5358 - val_acc: 0.7592
Epoch 6/10
363861/363861 [==============================] - 88s 242us/step - loss: 0.5256 - acc: 0.7662 - val_loss: 0.5309 - val_acc: 0.7631
Epoch 7/10
363861/363861 [==============================] - 88s 242us/step - loss: 0.5211 - acc: 0.7701 - val_loss: 0.5349 - val_acc: 0.7586
Epoch 8/10
363861/363861 [==============================] - 88s 242us/step - loss: 0.5173 - acc: 0.7733 - val_loss: 0.5278 - val_acc: 0.7667
Epoch 9/10
363861/363861 [==============================] - 88s 242us/step - loss: 0.5138 - acc: 0.7762 - val_loss: 0.5292 - val_acc: 0.7667
Epoch 10/10
363861/363861 [==============================] - 88s 242us/step - loss: 0.5092 - acc: 0.7794 - val_loss: 0.5313 - val_acc: 0.7654

After training is complete, we can save our model for inference. save_model_hdf5()
Function

save_model_hdf5(model, "model-question-pairs.hdf5")

Model tuning

Now that we have a reasonable model, let's tune the hyperparameter using
tfruns We'll start by adding packages. FLAGS Declarations for our script for all the hyperparameters we want to tune (FLAGS Allow us to change hyperparameters without changing our source code):

FLAGS <- flags(
  flag_integer("vocab_size", 50000),
  flag_integer("max_len_padding", 20),
  flag_integer("embedding_size", 256),
  flag_numeric("regularization", 0.0001),
  flag_integer("seq_embedding_size", 512)
)

along with FLAGS Definition We can now write our code in terms of flags. For example:

input1 <- layer_input(shape = c(FLAGS$max_len_padding))
input2 <- layer_input(shape = c(FLAGS$max_len_padding))

embedding <- layer_embedding(
  input_dim = FLAGS$vocab_size + 2, 
  output_dim = FLAGS$embedding_size, 
  input_length = FLAGS$max_len_padding, 
  embeddings_regularizer = regularizer_l2(l = FLAGS$regularization)
)

Complete source code of the script FLAGS can be found Here.

We also added an early stopping callback in the training phase to stop training if the validation loss does not decrease in 5 consecutive rounds. This will hopefully reduce the training time for bad models. We also added a learning rate reducer to reduce the learning rate by a factor of 10 when the loss is not reduced in 3 bins (this technique generally increases model accuracy).

model %>% fit(
  ...,
  callbacks = list(
    callback_early_stopping(patience = 5),
    callback_reduce_lr_on_plateau(patience = 3)
  )
)

We can now perform a tuning run to investigate the optimal combination of hyperparameters. We call tuning_run() function, passing a list with possible values ​​for each flag. gave tuning_run() The function will be responsible for executing the script for all combinations of hyperparameters. We also explained sample parameter to train the model for only one random sample from all collections (significantly reducing training time).

library(tfruns)

runs <- tuning_run(
  "question-pairs.R", 
  flags = list(
    vocab_size = c(30000, 40000, 50000, 60000),
    max_len_padding = c(15, 20, 25),
    embedding_size = c(64, 128, 256),
    regularization = c(0.00001, 0.0001, 0.001),
    seq_embedding_size = c(128, 256, 512)
  ), 
  runs_dir = "tuning", 
  sample = 0.2
)

A tuning run will return a data.frame With the results of all the runs. We found that the best run achieved 84.9% accuracy using the combination of hyperparameters shown below, so we modify our training script to use these values ​​as default:

FLAGS <- flags(
  flag_integer("vocab_size", 50000),
  flag_integer("max_len_padding", 20),
  flag_integer("embedding_size", 256),
  flag_numeric("regularization", 1e-4),
  flag_integer("seq_embedding_size", 512)
)

Making predictions

Now that we have trained and tuned our model we can start making predictions. At prediction time we will load the text tokenizer and the model we saved to disk earlier.

library(keras)
model <- load_model_hdf5("model-question-pairs.hdf5", compile = FALSE)
tokenizer <- load_text_tokenizer("tokenizer-question-pairs")

Since we will not continue training the model, we specify compile = FALSE Argument

Now let's define a function to make predictions. In this function we preprocess the input data in the same way we preprocessed the training data.

predict_question_pairs <- function(model, tokenizer, q1, q2) {
  q1 <- texts_to_sequences(tokenizer, list(q1))
  q2 <- texts_to_sequences(tokenizer, list(q2))
  
  q1 <- pad_sequences(q1, 20)
  q2 <- pad_sequences(q2, 20)
  
  as.numeric(predict(model, list(q1, q2)))
}

Now we can say it with a new pair of questions, for example:

predict_question_pairs(
  model,
  tokenizer,
  "What's R programming?",
  "What's R in programming?"
)
[1] 0.9784008

Prediction is quite fast (~40 milliseconds).

Model deployment

To demonstrate the deployment of the trained model, we created a simple shiny application, where you can paste 2 questions from Quora and find out if they are duplicates. Try changing the questions below or entering two completely different questions.

Gloss can be found on request And here is the source code. https://github.com/dfalbel/shiny-quora-question-pairs.

Note that when deploying a Keras model you only need to load a pre-saved model file and tokenizer (no training data or model training steps are required).

finish

  • We trained a Siamese LSTM which gives us reasonable accuracy (84%). Quora's state of the art is 87%.
  • We can improve our model by using pre-trained word embeddings on large datasets. For example, try using the one described in This example. Quora uses its entire corpus to train word embeddings.
  • After training, we deployed our model as a shiny application that calculated the probability of two Quora questions being duplicated.

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment