How to predict Google Rankings with DataScience platform @Dataiku

Six months ago, I wrote an article titled “How to predict Google rankings” in French. As you know, it is not feasible to predict the exact position of a website for a search term on Google. To simplify the complex task, we aimed to predict the presence of a website on Google’s first search engine result page for a search term.
After many SEO talks with Remi Bacha and with his contribution, together we achieved pretty good results with 92% accuracy in solving this prediction problem.

I created an Open Source project ( OVH GitHub ) and shared the source code in R, but unfortunately, it can’t be used without R language knowledge.
I have found a solution recently to ease the process of using this prediction algorithm with a platform named “Dataiku” a data science platform in the open source model.
I tried to reproduce my use-case, and it worked very well with only some clicks and few set-up.

So, I am going to create two blog posts to explain how to automate the entire process, and everybody will find their SEO factors.
This first post describes how to use XgBoost or another algorithm with my prepared dataset that I generated for OVH Summit. I used a data set of 200,000 records, including 2,000 distinct keywords/search terms. ( Thanks to Visiblis, Rankplorer, Majestic for their help). My second post will focus on how to collect all the data and merge into one dataset.

Step 1: Install Dataiku

You need to install Dataiku on your platform.
Just follow the tutorial on this page:

For this post, I used Dataiku Free Edition for Virtualbox and whatever the version, you select your data stays on your infrastructure, and there is no limitation in size nor volume of processed data.  Once you have installed, you access to the login page.

Step 2: Create your project

You just need to choose a name.

Great, you are ready to import files. Click on your project link and Click on “Import Dataset.”

Step 3: Prepare a new dataset

Download my prepared dataset ( click here ) and upload it to create your first dataset.

Click on the link “Upload your files” to do it.

Don’t worry in the next post; I will teach you how to create your dataset

Click on “Finish.”

Choose a name for example : dataset_garden_queries

Step 4: Create your first analysis

Click on the green wheel in the menu

Click on “New analysis” and choose the previous dataset and click on “Create Analysis.”

If you want to remove a column, click on the column name and click on “Delete.”

For two columns about Visiblis, you need to remove invalid rows for meaning Decimal by clicking on the “Visiblis_Title” column name and choose this function.

Choose the column “isTopTen” you want to predict and click on “Create Prediction model.”

 I advise you to choose Performance model and select “Xgboost” algorithm.

You need to personalize the model by clicking on “Settings.”

In the left column, click on “Algorithm,” and you can unselect other algorithms than Xgboost. For the Xgboost set-up, you can change Maximum number of trees to 1000 to have better results

Now, click on “Features” on the left column to choose to remove unuseful features like URL, TextRatio, ExtBAckLinks ( because it is only for homepage ) and Keyword.
Choose “Reject” to ignore a feature and work only on your pertinent data.

Good, you have finished, now click on the green button “Train.”
Be very patient, take a coffee because XGboost is efficient but very long.

Dataiku has industrialized all steps to produce the model: load train set, load test set, collect statistics, preprocess train set, preprocess test set, fit model, save model and score model.
You can save a lot of times by testing differents algorithms at the same time and compare accuracy.

Step5: Check results

Be careful while analyzing the results because they are valid only for the specified dataset in the thematic within the specified period.
Google personalizes search engine results pages with more than 300 factors according to localization, device, language, thematic, etc..
For your case, you can have a good idea to know what it is working. I think it is a good approach to have the best results for a specific term/keyword and the worst term/keyword to give the possibilities to an ML algorithm to confirm or reject a feature.

Now, you have your accuracy for each algorithm, and you can see the importance of the variables.

Click on the link of your algorithm, and you access to a menu where you can discover your important variables and measure the performance of your algorithm with a lot of methods ( roc curve, confusion matrix, decision chart )

Screen Shot 04-11-17 at 09.03 AM

Screen Shot 04-11-17 at 09.04 AM


Ok, it is your first steps in this platform where you can import and manipulate a dataset quickly and use predictions algorithm just by clicking.
If you want to test your prediction model on the new page or updated page, you can follow this great tutorial:

The next time, I will teach you how to get data from Majestic API, Visiblis API, SemRush API or Yooda API and merge all in one dataset

Dataiku allows you to code in R or Python, more importantly, you can share all the workflow quickly with the code source in a zip file. Of course, I am going to prepare a zip to deploy all the processes in one import.

Thanks to Aysun Akarsu and Remi Bacha for re-reading