Analyzing rtweet data with the Kira formula

Overview

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

gave The Kira Formula The package provides a high-level interface to the R interface. Keras. It has a main interface. kms function, a regression-style interface keras_model_sequential which uses formulas and sparse matrices.

The kerasformula package is available on CRAN, and can be installed with:

# install the kerasformula package
install.packages("kerasformula")    
# or devtools::install_github("rdrr1990/kerasformula")

library(kerasformula)

# install the core keras library (if you haven't already done so)
# see ?install_keras() for options e.g. install_keras(tensorflow = "gpu")
install_keras()

kms() function

Many classic machine learning tutorials assume that data comes in relatively uniform form (eg, pixels for digit recognition or word count or rank), which can make coding somewhat cumbersome when the data is in a heterogeneous data frame. Is. kms() R takes advantage of the flexibility of formulas to streamline this process.

kms Constructs dense neural networks and, after fitting them, returns a single object with details about predictions, fit steps, and function calls. kms accepts a number of parameters including the loss and activation functions found in keras. kms Compiler also accepts. keras_model_sequential Items that allow for more customization. This short demo shows how kms Can begin with raw data collected using model construction and hyperparameter selection (e.g., batch size). library(rtweet).

Let's look at #rstats tweets (excluding retweets) for the six-day period ending on January 24, 2018 at 10:40. This gives us a good reasonable number of observations to work with in terms of runtime (and the purpose of this document is to demonstrate the syntax, not specifically build predictive models).

rstats <- search_tweets("#rstats", n = 10000, include_rts = FALSE)
dim(rstats)
  [1] 2840   42

Suppose our goal is to predict how popular tweets will be based on the number of times a tweet is retweeted and liked (which are strongly correlated).

cor(rstats$favorite_count, rstats$retweet_count, method="spearman")
    [1] 0.7051952

Because few tweets go viral, the data is heavily skewed toward zero.

densities 1

Getting the most out of formulas

Suppose we are interested in categorizing tweets based on popularity but we are not sure how finely we want to differentiate. Some data, eg rstats$mentions_screen_name returns a list of varying lengths, so let's write a helper function to count the non-NA entries.

Let's start with a dense neural network, the default of kms. We can use basic R functions to help clean the data—in this case, cut to expose the results, grepl To find keywords, and weekdays And format To capture different aspects of the time a tweet is posted.

breaks <- c(-1, 0, 1, 10, 100, 1000, 10000)
popularity <- kms(cut(retweet_count + favorite_count, breaks) ~ screen_name + 
                  source + n(hashtags) + n(mentions_screen_name) + 
                  n(urls_url) + nchar(text) +
                  grepl('photo', media_type) +
                  weekdays(created_at) + 
                  format(created_at, '%H'), rstats)
plot(popularity$history) 
  + ggtitle(paste("#rstat popularity:", 
            paste0(round(100*popularity$evaluations$acc, 1), "%"),
            "out-of-sample accuracy")) 
  + theme_minimal()

popularity$confusion

first model 1

popularity$confusion

                    (-1,0] (0,1] (1,10] (10,100] (100,1e+03] (1e+03,1e+04]
      (-1,0]            37    12     28        2           0             0
      (0,1]             14    19     72        1           0             0
      (1,10]             6    11    187       30           0             0
      (10,100]           1     3     54       68           0             0
      (100,1e+03]        0     0      4       10           0             0
      (1e+03,1e+04]      0     0      0        1           0             0

The model only correctly classifies about 55% of the out-of-sample data and its prediction accuracy does not improve after the first ten epochs. The confusion matrix suggests that the model works best with tweets that are retweeted a handful of times but predict more than a 1-10 level. gave history The plot also shows that the out-of-sample precision is not very stable. We can easily change the number of breakpoints and positions.

breaks <- c(-1, 0, 1, 25, 50, 75, 100, 500, 1000, 10000)
popularity <- kms(cut(retweet_count + favorite_count, breaks) ~  
                  n(hashtags) + n(mentions_screen_name) + n(urls_url) +
                  nchar(text) +
                  screen_name + source +
                  grepl('photo', media_type) +
                  weekdays(created_at) + 
                  format(created_at, '%H'), rstats, Nepochs = 10)

plot(popularity$history) 
  + ggtitle(paste("#rstat popularity (new breakpoints):",
            paste0(round(100*popularity$evaluations$acc, 1), "%"),
            "out-of-sample accuracy")) 
  + theme_minimal()

change breaks 1

This helped somewhat (about 5% additional prediction accuracy). Let's say we want to add a little more data. Let's store the input formula first.

pop_input <- "cut(retweet_count + favorite_count, breaks) ~  
                          n(hashtags) + n(mentions_screen_name) + n(urls_url) +
                          nchar(text) +
                          screen_name + source +
                          grepl('photo', media_type) +
                          weekdays(created_at) + 
                          format(created_at, '%H')"

Here we use paste0 Adding something like this to the formula by looping over the userID:

grepl("12233344455556", mentions_user_id)
mentions <- unlist(rstats$mentions_user_id)
mentions <- unique(mentions[which(table(mentions) > 5)]) # remove infrequent
mentions <- mentions[!is.na(mentions)] # drop NA

for(i in mentions)
  pop_input <- paste0(pop_input, " + ", "grepl(", i, ", mentions_user_id)")

popularity <- kms(pop_input, rstats)

mentionsplot 1

It helped a touch but the accuracy of the prediction is still quite unstable across visits…

Customizing layers with km()

We can add more data, perhaps individual words from the text or some other summary state (mean(text %in% LETTERS) to see if all caps define popularity). But let's change the neural net.

gave input.formula is used to generate the sparse model matrix. For example, rstats$source (Twitter or Twitter client application type) and rstats$screen_name are character vectors that will be dummy output. How many columns does it have?

    [1] 1277

Say we wanted to reshape the layers for a more gradual transition from the input shape to the output.

popularity <- kms(pop_input, rstats,
                  layers = list(
                    units = c(1024, 512, 256, 128, NA),
                    activation = c("relu", "relu", "relu", "relu", "softmax"), 
                    dropout = c(0.5, 0.45, 0.4, 0.35, NA)
                  ))

customplot 1

kms Makes a keras_sequential_model(), which is a stack of linear layers. The shape of the input is determined by the dimension of the model matrix (popularity$PBut after that users are free to determine the number of layers and so on. gave kms Argument layers A list is expected, the first entry of which is a vector. units With whom to call. keras::layer_dense(). The number of the first element units In the first layer, the second element for the second layer, and so on (NA (as the final element automatically detects the final number of units based on the observed number of results). activation is also transferred. layer_dense() and can take values ​​such as softmax, relu, eluAnd linear. (kms There is also a separate parameter to control the optimizer. By default kms(... optimizer="rms_prop").) gave dropout which follows the rate of each dense layer prevents overfitting (but of course does not apply to the final layer).

Batch size selection

By default, kms Uses batches of 32. Let's say we were happy with our model but had no specific insight into what the size should be.

Nbatch <- c(16, 32, 64)
Nruns <- 4
accuracy <- matrix(nrow = Nruns, ncol = length(Nbatch))
colnames(accuracy) <- paste0("Nbatch_", Nbatch)

est <- list()
for(i in 1:Nruns){
  for(j in 1:length(Nbatch)){
   est[[i]] <- kms(pop_input, rstats, Nepochs = 2, batch_size = Nbatch[j])
   accuracy[i,j] <- est[[i]][["evaluations"]][["acc"]]
  }
}
  
colMeans(accuracy)
    Nbatch_16 Nbatch_32 Nbatch_64 
    0.5088407 0.3820850 0.5556952 

In order to reduce runtime, the number of positions was kept arbitrarily short but, from these results, 64 is the optimal batch size.

Making predictions for new data

So far, we've been using the default settings. kms which first divides the data into 80% training and 20% testing. Out of 80% of the training, a certain portion is allocated for validation and this is what produces the loss and accuracy graphs for each round. 20% is only used at the end to assess the accuracy of the prediction. But suppose you want to make predictions on a new data set…

popularity <- kms(pop_input, rstats[1:1000,])
predictions <- predict(popularity, rstats[1001:2000,])
predictions$accuracy
    [1] 0.579

Because the formula creates a dummy variable for each screen name and mention, any set of tweets is guaranteed to have different columns. predict.kms_fit one S3 method which takes the new data and creates a (sparse) model matrix that preserves the original structure of the training matrix. predict Then returns predictions with confusion matrix and accuracy score.

If your new data has the same observation level as the y and columns of x_train (the model matrix), you can also use keras::predict_classes On object$model.

Using the compiled Keras model

This section shows how to input the compiled model in fashion. library(keras), which is useful for more advanced models. Here is an example lstm Similar to imbd with keras example.

k <- keras_model_sequential()
k %>%
  layer_embedding(input_dim = popularity$P, output_dim = popularity$P) %>% 
  layer_lstm(units = 512, dropout = 0.4, recurrent_dropout = 0.2) %>% 
  layer_dense(units = 256, activation = "relu") %>%
  layer_dropout(0.3) %>%
  layer_dense(units = 8, # number of levels observed on y (outcome)  
              activation = 'sigmoid')

k %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = 'rmsprop',
  metrics = c('accuracy')
)

popularity_lstm <- kms(pop_input, rstats, k)

Drop me a line through the project. Github repo. Special thanks @dfalbel And @jjallaire For the helpful tips!!

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment