In my last article, I showed you some methods to classify URLS and detect active pages.
I improved my source code and now, you can classify 500 000 URLS in 1 second according to several criteria ( active pages, compliant pages, sections, number of inlinks, response time, duplicate meta tags).
Today, I offer you my github repo to test yourself : https://github.com/voltek62/SEO-Dashboard
Classify urls
Method : For each line of my csv, I identify all pages with pattern url specified in csv file.
I use the str_detect_fixed() function to classify each url.
siteconf <- "./conf/blog.csv" schemas <- read.csv(siteconf, header = FALSE, col.names = "schema", stringsAsFactors = FALSE ) schemas <- as.character(schemas[,1]) urls$Category <- "no match" for (j in 1:length(schemas)) { #print(schemas[j]) urls$Category[which(stri_detect_fixed(urls$Address,schemas[j]))] <- schemas[j] } # Detect HomePage urls$`Category`[1] <- 'Home' urls$Category <- as.factor(urls$Category)
Detect Compliant pages
Method: It is compliant if it :
- Responds with an HTTP 200 (OK) status code
- Does not include a canonical tag to another URL.
- Has an HTML content type
- Does not include any Noindex meta tag
urls$Compliant <- TRUE urls$Compliant[which(urls$`Status Code` != 200 | urls$`Canonical Link Element 1` != urls$Address | urls$Status != "OK" | grepl("noindex",urls$`Meta Robots 1`) )] <- FALSE urls$Compliant <- as.factor(urls$Compliant)
Classify by inlinks
Method: I have created 5 groups to classify according to the number of inlinks
- URLs with No Follow Inlinks
- URLs with 1 Follow Inlink
- URLs with 2 to 5 Follow Inlinks
- URLs with 5 to 10 Follow Inlinks
- URLs with more than 10 Follow Inlinks
urls$`Group Inlinks` <- "URLs with No Follow Inlinks" urls$`Group Inlinks`[which(urls$`Inlinks` < 1 )] <- "URLs with No Follow Inlinks" urls$`Group Inlinks`[which(urls$`Inlinks` == 1 )] <- "URLs with 1 Follow Inlink" urls$`Group Inlinks`[which(urls$`Inlinks` > 1 & urls$`Inlinks` < 6)] <- "URLs with 2 to 5 Follow Inlinks" urls$`Group Inlinks`[which(urls$`Inlinks` >= 6 & urls$`Inlinks` < 11 )] <- "URLs with 5 to 10 Follow Inlinks" urls$`Group Inlinks`[which(urls$`Inlinks` >= 11)] <- "URLs with more than 10 Follow Inlinks" urls$`Group Inlinks` <- as.factor(urls$`Group Inlinks`)
Detect Duplicate Meta
Method : By default, all meta tags are unique. If my field “Meta Length” is equal to 0, I classify as “No set”.
Finally, I use the function “duplicated()” to detect all duplicate meta tags.
urls$`Status Title` <- 'Unique' urls$`Status Title`[which(urls$`Title 1 Length` == 0)] <- "No Set" urls$`Status Description` <- 'Unique' urls$`Status Description`[which(urls$`Meta Description 1 Length` == 0)] <- "No Set" urls$`Status H1` <- 'Unique' urls$`Status H1`[which(urls$`H1-1 Length` == 0)] <- "No Set" urls$`Status Title`[which(duplicated(urls$`Title 1`))] <- 'Duplicate' urls$`Status Description`[which(duplicated(urls$`Meta Description 1`))] <- 'Duplicate' urls$`Status H1`[which(duplicated(urls$`H1-1`))] <- 'Duplicate' urls$`Status Title` <- as.factor(urls$`Status Title`) urls$`Status Description` <- as.factor(urls$`Status Description`) urls$`Status H1` <- as.factor(urls$`Status H1`)
Charts
Method : I use the ggplott() function to draw each chart.
ggsave() is a convenient function to save a plot.
urls_cat_statustitle % group_by(Cat,`Status Title`) %>% summarise(count = n()) ggplot(urls_cat_statustitle, aes(x=Cat, y=count, fill=`Status Title`) ) + geom_bar(stat = "identity", position = "stack") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(x = "Section", y ="Crawled URLs") #+ ggtitle("Nombre d'urls crawlés par section et status de la balise title") ggsave(file="./export/urlsBysectionFillstatustitle.png")
Bonus : Calculate Internal PageRank
Method : Use a new Screaming Frog export : Bulk Export > All Outlinks from the top menu, and save the CSV file.
I use the igraph package to calculate the Google PageRank for the specified vertices.
library(igraph) library(dplyr) library(ggplot2) library(magrittr) #library(ForceAtlas2) file_outlinks <- './input/blog/all_outlinks_test.csv' website_url <- 'http//www.mywebsite.com' # transform raw internal page rank to page rank map <- function(x, range = c(0,1), from.range=NA) { if(any(is.na(from.range))) from.range <- range(x, na.rm=TRUE) ## check if all values are the same if(!diff(from.range)) return( matrix(mean(range), ncol=ncol(x), nrow=nrow(x), dimnames = dimnames(x))) ## map to [0,1] x <- (x-from.range[1]) x <- x/diff(from.range) ## handle single values if(diff(from.range) == 0) x <- 0 ## map from [0,1] to [range] if (range[1]>range[2]) x <- 1-x x <- x*(abs(diff(range))) + min(range) x[x<min(range) | x>max(range)] <- NA x } DF <- read.csv2(file_outlinks, header=TRUE, sep = ",", stringsAsFactors = F, skip=1 ) ## we keep only link DF <- filter(DF,grepl(website_url,Source) & Type=="HREF" & Follow=="true") %>% select(Source,Destination) DF <- as.data.frame(sapply(DF,gsub,pattern=website_url,replacement="")) ## adapt colnames and rownames colnames(DF) <- c("from","to") rownames(DF) <- NULL # generate graph with data.frame graphObject = graph.data.frame(DF) # calculate pagerank urls_pagerank <- page.rank(graphObject, directed= TRUE, damping = 0.85) %>% use_series("vector") %>% sort(decreasing = TRUE) %>% as.data.frame %>% set_colnames("raw.internal.pagerank") urls_pagerank$Address<-rownames(urls_pagerank) rownames(urls_pagerank) <- NULL urls_pagerank <- mutate(urls_pagerank, internal.pagerank = map(raw.internal.pagerank, c(1,10)))
Conclusion
Do not hesitate to use comments to request my help to draw complex charts.
2 Comments