An ‘N-Gram analysis’ tool in R
Recently, I wrote a post on WebAnalyticsWorld talking about using ngrams to help analyse keyword performance.
To quickly summarise that post:
- An ‘n-gram’ is “a continuous sequence of n items from a given sequence of text or speech”, or a string of words of specified length
- Splitting keywords into n-grams and aggregating data lets you dig for performance traits (positive or negative) and try to understand the effect on performance of certain phrases
- The article then runs through a quick and easy process to perform an n-gram analysis in Excel
As this is quite a useful activity, I wanted to make the process a little bit easier, so decided to build a tool using R, and in particular the incredibly useful package Shiny which allows simple deployment of R scripts as web apps. If you haven’t heard of R, see [here](https://en.wikipedia.org/wiki/R(programminglanguage) for a quick intro.
A Simple Application, Created Simply
This tool is deployed via shinyapps.io – a convenient way to host R apps without having to worry about server configuration, etc. If you just wish to use it, no need to read on, just visit the site:
N-Gram Analysis Tool on shiny.io
The process
I find that one of the most appealing features of R is the ability to perform seemingly complex tasks without having to write too much code. This is further enhanced by the depth of packages available which provide custom functions for specific purposes.
The essence of this app is an R script which uses the Natural Language package ‘RWeka’ to split keywords into ‘tokens’ of specific length.
Following this creation of tokens, all the data is aggregated into a summary data frame using ‘dplyr’.
Finally, a simple application I’ve created using the ‘Shiny’package/framework wraps up the above functions into a user interface. This allows the uploading of a csv file to generate a table of n-gram information.
Appendix for R Users: The Code
#/ read in data /#
ngraming <- function(data, ngram_size){
attach(data)
#/ tokenize keywords/#
#tokens <- MC_tokenizer(data$keyword)
tokens <- NGramTokenizer(data$Keyword, Weka_control(min = ngram_size, max=ngram_size))
#/ Remove duplicates /#
tokens <- unique(tokens)
#/summarize if includes token #/
dummy <- data.frame(Token=character(),
Count=numeric(),
Cost=numeric(),
Conversions=numeric(),
CPA=numeric(),
stringsAsFactors=FALSE)
tokenData <- lapply(1:length(head(tokens, n=500)),function(i) {
test <- subset(data, grepl(tokens[i], data$Keyword))
if (sum(test$Conversions) == 0) {
cpaCalc <- 0
}
else {
cpaCalc <- sum(test$Cost)/sum(test$Conversions)
}
tempDF <- data.frame(Token=tokens[i],
Count=length(test$Cost),
Cost=sum(test$Cost),
Conversions=sum(test$Conversions),
CPA=format(round(cpaCalc,2),nsmall=2),
stringsAsFactors=FALSE)
rbind(dummy,tempDF)
})
detach(data)
tokenDF <- do.call(rbind.data.frame, tokenData)
tokenDF <- tokenDF[with(tokenDF, order(-Count)),]
Own your marketing data & simplify your tech stack.
Have you read?
I have worked in SEO for 12+ years and I’ve seen the landscape shift a dozen times over. But the rollout of generative search engines (GSEs) feels like the biggest...
As you will have likely seen, last week Google released the March 2024 Core Algorithm Update. With it, comes a host of changes aiming to improve the quality of ranking...
After a year of seemingly constant Google core updates and the increasingly widespread usage of AI, the SEO landscape is changing more quickly than ever. With this rapid pace of...