R Tools for SEO : Visualizing your site’s Internal Linking

My first english post is about how to generate the internal link graph of any site with R Studio.

Process and Tools
You need RStudio with packages iGraph, SNA and Dplyr
To install a package , just type in the console : install.packages(“iGraph”)
The process required exporting the URLs crawled with Screaming Frog.
When the crawl is finished, you choose “Bulk Export” then “All Outlinks”

semrush
Now, you are ready to use R Studio

R Code
You need to create a dataframe with the file generated by Screaming Frog.

# import semRush
## skip first line
DF <- read.csv2(file_csv, header=TRUE, sep = ",", stringsAsFactors = F, skip=1 )
## we keep only link
DF <- DF[DF$Type=="HREF",]
DF <- select(DF,Source,Destination)
DF <- as.data.frame(sapply(DF,gsub,pattern=website_url1,replacement=""))
DF <- as.data.frame(sapply(DF,gsub,pattern=website_url2,replacement=""))
DF <- as.data.frame(sapply(DF,gsub,pattern="\"",replacement=""))
## delete subdomain
DF <- subset(DF, !grepl("^http", DF$Source))
DF <- subset(DF, !grepl("^http", DF$Destination))
## adapt colnames and rownames
colnames(DF) <- c("From","To")
rownames(DF) <- NULL
# generate graph with data.frame
graphObject = graph.data.frame(DF, directed = TRUE)
# to run pagerank we need a simple, undirected graph
graphObject = simplify(as.undirected(graphObject))

Visualize the data
You can see the generation of graphs.
Be patient, it can take 5 minutes for a graph with 74 000 inlinks.
First we need to load a library with visualization capabilities. I’ll start with ‘igraph’, which is its own package.

library(igraph)

Note that igraph’s excellent documentation is accessible via the command ?igraph.

Now, we plot the simple graph

plot(graphObject, 
     layout=layout.fruchterman.reingold,
     vertex.size        = 4,          # Smaller nodes
     vertex.label       = NA,         # Set the labels
     vertex.label.color = "black")

graph seo
Now, I use Google Pagerank to adjust node sizes.

# calculate pagerank
pr <- page.rank(graphObject,directed=TRUE)
# print graph with size node linked with pagerank
plot(graphObject, 
     layout=layout.fruchterman.reingold,
     vertex.size        = map(pr$vector, c(1,20)),
     vertex.label       = NA,      
     vertex.label.color = "black",
     edge.arrow.size=.2
)

graph2 seo with pagerank

It is very easy to notice that pagerank is badly distributed and internnal linking is unbalanced.

Example with a good internal linking
graph2 good example
A website with a good internal linking can be a spheroid because all is connected and whatever the first entry, Google Bots crawl and access different parts of your site.

Conclusion
This can be extremely helpful to better understand how a site is internally linked and detect areas of the sites (orphan pages) that are not connected with the rest.
If you want more information in order to improve your internal linking, I suggest you this post : http://www.wikiweb.com/internal-linking-structure/

Update 2016

You need this “map” function

map <- function(x, range = c(0,1), from.range=NA) {
if(any(is.na(from.range))) from.range <- range(x, na.rm=TRUE)

## check if all values are the same
if(!diff(from.range)) return(
matrix(mean(range), ncol=ncol(x), nrow=nrow(x),
dimnames = dimnames(x)))

## map to [0,1]
x <- (x-from.range[1])
x <- x/diff(from.range)
## handle single values
if(diff(from.range) == 0) x <- 0

## map from [0,1] to [range]
if (range[1]>range[2]) x <- 1-x
x <- x*(abs(diff(range))) + min(range)

x[x<min(range) | x>max(range)] <- NA

x
}