Six months ago, I wrote an article titled “How to predict Google rankings” in French. As you know, it is not feasible to predict the exact position of a website for a search term on Google. To simplify the complex task, we aimed to predict the presence of a website on Google’s first search engine result page for a search term.
After many SEO talks with Remi Bacha and with his contribution, together we achieved pretty good results with 92% accuracy in solving this prediction problem.
I created an Open Source project ( OVH GitHub ) and shared the source code in R, but unfortunately, it can’t be used without R language knowledge.
I have found a solution recently to ease the process of using this prediction algorithm with a platform named “Dataiku” a data science platform in the open source model.
I tried to reproduce my use-case, and it worked very well with only some clicks and few set-up.
So, I am going to create two blog posts to explain how to automate the entire process, and everybody will find their SEO factors.
This first post describes how to use XgBoost or another algorithm with my prepared dataset that I generated for OVH Summit. I used a data set of 200,000 records, including 2,000 distinct keywords/search terms. ( Thanks to Visiblis, Rankplorer, Majestic for their help). My second post will focus on how to collect all the data and merge into one dataset.
Step 1: Install Dataiku
You need to install Dataiku on your platform.
Just follow the tutorial on this page: https://www.dataiku.com/dss/trynow/
For this post, I used Dataiku Free Edition for Virtualbox and whatever the version, you select your data stays on your infrastructure, and there is no limitation in size nor volume of processed data. Once you have installed, you access to the login page.
Step 2: Create your project
You just need to choose a name.
Step 3: Prepare a new dataset
Download my prepared dataset ( click here ) and upload it to create your first dataset.
Click on the link “Upload your files” to do it.
Step 4: Create your first analysis
Click on the green wheel in the menu
Click on “New analysis” and choose the previous dataset and click on “Create Analysis.”
Now, click on “Features” on the left column to choose to remove unuseful features like URL, TextRatio, ExtBAckLinks ( because it is only for homepage ) and Keyword.
Choose “Reject” to ignore a feature and work only on your pertinent data.
Dataiku has industrialized all steps to produce the model: load train set, load test set, collect statistics, preprocess train set, preprocess test set, fit model, save model and score model.
You can save a lot of times by testing differents algorithms at the same time and compare accuracy.
Step5: Check results
Be careful while analyzing the results because they are valid only for the specified dataset in the thematic within the specified period.
Google personalizes search engine results pages with more than 300 factors according to localization, device, language, thematic, etc..
For your case, you can have a good idea to know what it is working. I think it is a good approach to have the best results for a specific term/keyword and the worst term/keyword to give the possibilities to an ML algorithm to confirm or reject a feature.
Now, you have your accuracy for each algorithm, and you can see the importance of the variables.
Click on the link of your algorithm, and you access to a menu where you can discover your important variables and measure the performance of your algorithm with a lot of methods ( roc curve, confusion matrix, decision chart )
Ok, it is your first steps in this platform where you can import and manipulate a dataset quickly and use predictions algorithm just by clicking.
If you want to test your prediction model on the new page or updated page, you can follow this great tutorial:
The next time, I will teach you how to get data from Majestic API, Visiblis API, SemRush API or Yooda API and merge all in one dataset
Dataiku allows you to code in R or Python, more importantly, you can share all the workflow quickly with the code source in a zip file. Of course, I am going to prepare a zip to deploy all the processes in one import.