stat-search/getstat.json at master · sarkaria/stat-search · GitHub

1
{"paragraphs":[{"text":"%md\n<div class=\"alert alert-warning\" role=\"alert\" style=\"margin: 10px\">\n<h3>Stat Search Analytics</h3>\nThe assignment instructions are as follows:</br><br>\n<b>Business Case</b></br>\n\nSTAT spiders and parses HTML search engine results for ~18 million keywords everyday. The parsing process transforms these HTML documents into ~1.8 billion result rows, with ~100 results per keyword. This leaves us with ~54 billion rows per month to analyze. One of our major objectives this year is improve both the storage and analysis of this parsed data so that we can quickly analyze our data across multiple dimensions in order to surface timely, actionable insights for our customers.</br>\n\nWe would like you to take a subset of our data, store it in an appropriate format, and then perform some analysis on the stored data.</br></br>\n\n<b>Data</b></br>\n\nYou can download a compressed CSV with a very tiny subset of our parsed data from https://stat-ds-test.s3.amazonaws.com/getstat_com_serp_report_201707.csv.gz - 3128 keywords for 31 days of July 2017 (approximately 9.9 million rows)\n\nThe parsed data can be divided into 2 parts - keyword information (entity data) and ranking metrics. The keyword information (entity data) is represented by the following columns</br>\nKeyword</br>\nMarket</br>\nLocation</br>\nDevice</br>\nThe date of the crawl (spider and parse) is represented by the “Crawl Date”</br>\nThe ranking metrics are represented by the following columns</br>\nRank</br>\nURL</br>\nEach combination of keyword information and crawl date will have approximately 100 ranks and URLs (lower ranks are better)</br></br>\n\n<b>Project</b></br>\n\nOur goal with the project is two fold:</br>\nDesign and build an ETL process to ingest and store this data in an efficient format for analysis\nAnalyze the data to answer three important SEO questions\nIn doing so, we want to give you a feel for the kind of work you will be doing in the Data Services team at STAT, and we want to get an idea of how you would solve these problems.\n\nAlthough we have provided a very tiny subset of data to you to limit the size and scope of the project, we would like you to design the solution considering the scale of data that we store and analyze in production today.</br>\n\nThe SEO questions that we would like you to answer are:</br>\n\n1. Which URL has the most ranks below 10 across all keywords over the period?</br>\n2. Provide the set of keywords (keyword information) where the rank 1 URL changes the most over the period. A change, for the purpose of this question, is when a given keyword's rank 1 URL is different from the previous day's URL.</br>\n3. We would like to understand how similar the results returned for the same keyword, market, and location are across devices. For the set of keywords, markets, and locations that have data for both desktop and smartphone devices, please devise a measure of difference to indicate how similar these datasets are.</br></br>\n\n<b>Deliverables</b></br>\n\nPlease submit your work in the form of a git repository, preferably a public repository on GitHub so we can easily share it among our team members. This repository should include:</br>\n1. A README.md file that outlines</br>\n2. How to install the required dependencies on a Linux or macOS system, and how to run the ETL and analysis process(es).</br>\n3. The process you went through to arrive at this solution.</br>\nNotes on any functionality you would have included before releasing the code into a production environment but were unable to due to time constraints.\nRunnable code that extracts the provided data and completes the requested analysis in the language of your choice.</br></br>\n\n<b>Some additional notes</b></br>\n\nPlease be mindful of the characteristics of solution that you choose, we'll be most interested in why you selected a particular solution given our parameters, and the trade offs of that solution relative to other options.</br>\n\nWe expect this assignment to take six hours or less. Please aim to produce code that you would be happy to release into a production environment, and document anything you would otherwise include but are unable to due to time constraints.</br>\n\n</div>\n\n<div class=\"clearfix\" style=\"padding: 10px; padding-left: 0px\">\n<a href=\"https://getstat.com\"><img src=\"https://getstat.com/drive/uploads/2012/08/STAT_logo_blue_small.jpg\" width=\"150px\" class=\"pull-left\" style=\"display: inline-block; margin: 0px;\"></a>\n</div>","user":"anonymous","dateUpdated":"2017-09-04T08:17:23-0700","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"<div class=\"markdown-body\">\n<div class=\"alert alert-warning\" role=\"alert\" style=\"margin: 10px\">\n<h3>Stat Search Analytics</h3>\nThe assignment instructions are as follows:</br><br>\n<b>Business Case</b></br>\n\nSTAT spiders and parses HTML search engine results for ~18 million keywords everyday. The parsing process transforms these HTML documents into ~1.8 billion result rows, with ~100 results per keyword. This leaves us with ~54 billion rows per month to analyze. One of our major objectives this year is improve both the storage and analysis of this parsed data so that we can quickly analyze our data across multiple dimensions in order to surface timely, actionable insights for our customers.</br>\n\nWe would like you to take a subset of our data, store it in an appropriate format, and then perform some analysis on the stored data.</br></br>\n\n<b>Data</b></br>\n\nYou can download a compressed CSV with a very tiny subset of our parsed data from https://stat-ds-test.s3.amazonaws.com/getstat_com_serp_report_201707.csv.gz - 3128 keywords for 31 days of July 2017 (approximately 9.9 million rows)\n\nThe parsed data can be divided into 2 parts - keyword information (entity data) and ranking metrics. The keyword information (entity data) is represented by the following columns</br>\nKeyword</br>\nMarket</br>\nLocation</br>\nDevice</br>\nThe date of the crawl (spider and parse) is represented by the “Crawl Date”</br>\nThe ranking metrics are represented by the following columns</br>\nRank</br>\nURL</br>\nEach combination of keyword information and crawl date will have approximately 100 ranks and URLs (lower ranks are better)</br></br>\n\n<b>Project</b></br>\n\nOur goal with the project is two fold:</br>\nDesign and build an ETL process to ingest and store this data in an efficient format for analysis\nAnalyze the data to answer three important SEO questions\nIn doing so, we want to give you a feel for the kind of work you will be doing in the Data Services team at STAT, and we want to get an idea of how you would solve these problems.\n\nAlthough we have provided a very tiny subset of data to you to limit the size and scope of the project, we would like you to design the solution considering the scale of data that we store and analyze in production today.</br>\n\nThe SEO questions that we would like you to answer are:</br>\n\n1. Which URL has the most ranks below 10 across all keywords over the period?</br>\n2. Provide the set of keywords (keyword information) where the rank 1 URL changes the most over the period. A change, for the purpose of this question, is when a given keyword's rank 1 URL is different from the previous day's URL.</br>\n3. We would like to understand how similar the results returned for the same keyword, market, and location are across devices. For the set of keywords, markets, and locations that have data for both desktop and smartphone devices, please devise a measure of difference to indicate how similar these datasets are.</br></br>\n\n<b>Deliverables</b></br>\n\nPlease submit your work in the form of a git repository, preferably a public repository on GitHub so we can easily share it among our team members. This repository should include:</br>\n1. A README.md file that outlines</br>\n2. How to install the required dependencies on a Linux or macOS system, and how to run the ETL and analysis process(es).</br>\n3. The process you went through to arrive at this solution.</br>\nNotes on any functionality you would have included before releasing the code into a production environment but were unable to due to time constraints.\nRunnable code that extracts the provided data and completes the requested analysis in the language of your choice.</br></br>\n\n<b>Some additional notes</b></br>\n\nPlease be mindful of the characteristics of solution that you choose, we'll be most interested in why you selected a particular solution given our parameters, and the trade offs of that solution relative to other options.</br>\n\nWe expect this assignment to take six hours or less. Please aim to produce code that you would be happy to release into a production environment, and document anything you would otherwise include but are unable to due to time constraints.</br>\n\n</div>\n<div class=\"clearfix\" style=\"padding: 10px; padding-left: 0px\">\n<a href=\"https://getstat.com\"><img src=\"https://getstat.com/drive/uploads/2012/08/STAT_logo_blue_small.jpg\" width=\"150px\" class=\"pull-left\" style=\"display: inline-block; margin: 0px;\"></a>\n</div>\n</div>"}]},"apps":[],"jobName":"paragraph_1504459697586_165885654","id":"20170903-102817_2130175749","dateCreated":"2017-09-03T10:28:17-0700","dateStarted":"2017-09-04T08:17:23-0700","dateFinished":"2017-09-04T08:17:23-0700","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:2508"},{"title":"Verify Spark Version","text":"sc.version\n","user":"anonymous","dateUpdated":"2017-09-04T08:17:30-0700","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":false},"editorMode":"ace/mode/scala","title":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"\nres0: String = 2.1.0\n"}]},"apps":[],"jobName":"paragraph_1504326066920_185615845","id":"20170901-212106_1439781607","dateCreated":"2017-09-01T21:21:06-0700","dateStarted":"2017-09-04T08:17:30-0700","dateFinished":"2017-09-04T08:18:17-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:2509"},{"text":"%md\nIn the next paragraph, please update the path to CSV file.","user":"anonymous","dateUpdated":"2017-09-04T08:20:37-0700","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"<div class=\"markdown-body\">\n<p>In the next paragraph, please update the path to CSV file.</p>\n</div>"}]},"apps":[],"jobName":"paragraph_1504485696301_-2097486722","id":"20170903-174136_1247674206","dateCreated":"2017-09-03T17:41:36-0700","dateStarted":"2017-09-04T08:20:37-0700","dateFinished":"2017-09-04T08:20:37-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:2510"},{"title":"Load Data into Spark","text":"val df = sqlContext.read\n    .format(\"com.databricks.spark.csv\")\n    .option(\"header\", \"true\") // Use first line of all files as header\n    .option(\"inferSchema\", \"true\") // Automatically infer data types\n    .load(\"/Users/sarbjit/Desktop/Not backed up/getstat_com_serp_report_201707.csv\")\n    .cache","user":"anonymous","dateUpdated":"2017-09-04T08:18:38-0700","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala","title":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"\ndf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Keyword: string, Market: string ... 5 more fields]\n"}]},"apps":[],"jobName":"paragraph_1504324340282_-830756644","id":"20170901-205220_1693849834","dateCreated":"2017-09-01T20:52:20-0700","dateStarted":"2017-09-04T08:18:38-0700","dateFinished":"2017-09-04T08:20:47-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:2511"},{"title":"Lets Look at a Sample of the Data","text":"df.printSchema\ndf.show(5)","user":"anonymous","dateUpdated":"2017-09-04T08:20:52-0700","config":{"colWidth":12,"enabled":true,"results":{"0":{"graph":{"mode":"table","height":362,"optionOpen":false}}},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala","title":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"root\n |-- Keyword: string (nullable = true)\n |-- Market: string (nullable = true)\n |-- Location: string (nullable = true)\n |-- Device: string (nullable = true)\n |-- Crawl Date: string (nullable = true)\n |-- Rank: integer (nullable = true)\n |-- URL: string (nullable = true)\n\n+----------------+------+--------+-------+----------+----+--------------------+\n|         Keyword|Market|Location| Device|Crawl Date|Rank|                 URL|\n+----------------+------+--------+-------+----------+----+--------------------+\n|search analytics| US-en|    null|desktop|2017-07-01|   1|support.google.co...|\n|search analytics| US-en|    null|desktop|2017-07-01|   2|  trends.google.com/|\n|search analytics| US-en|    null|desktop|2017-07-01|   3|developers.google...|\n|search analytics| US-en|    null|desktop|2017-07-01|   4|en.wikipedia.org/...|\n|search analytics| US-en|    null|desktop|2017-07-01|   5|searchengineland....|\n+----------------+------+--------+-------+----------+----+--------------------+\nonly showing top 5 rows\n\n"}]},"apps":[],"jobName":"paragraph_1504455059800_927685784","id":"20170903-091059_792224568","dateCreated":"2017-09-03T09:10:59-0700","dateStarted":"2017-09-04T08:20:52-0700","dateFinished":"2017-09-04T08:21:10-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:2512"},{"text":"%md\n<div class=\"alert alert-warning\" role=\"alert\" style=\"margin: 10px\">\n<h3>Stat Search Analytics Q1</h3>\nHere we simply filter the data by rank (< 10) and group keywords based upon URL. \n</div>","user":"anonymous","dateUpdated":"2017-09-04T08:21:27-0700","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1504460883138_-994202323","id":"20170903-104803_2105147452","dateCreated":"2017-09-03T10:48:03-0700","dateStarted":"2017-09-04T08:21:27-0700","dateFinished":"2017-09-04T08:21:27-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:2513","errorMessage":""},{"title":"1. Top 20 URLs with the most ranks below 10 across all keywords","text":"val df1 = df.filter($\"Rank\" < 10)\n    .groupBy(\"URL\")\n    .agg(count($\"URL\") as \"count\")\n    .sort($\"count\".desc)\n    \nz.show(df1.limit(20))","user":"anonymous","dateUpdated":"2017-09-04T08:21:29-0700","config":{"colWidth":12,"enabled":true,"results":{"1":{"graph":{"mode":"multiBarChart","height":366,"optionOpen":false},"helium":{}}},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala","title":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"\ndf1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [URL: string, count: bigint]\n"},{"type":"TABLE","data":"URL\tcount\nserps.com/tools/rank-checker/\t22612\nmoz.com/tools/rank-tracker\t21780\nwww.rankscanner.com/\t14356\nranktrackr.com/\t14059\nwww.shoutmeloud.com/5-excellent-websites-to-check-keyword-ranking-in-google.html\t11318\nwww.google.com/\t11177\nwww.google.co.uk/\t9853\nwww.link-assistant.com/rank-tracker/\t8539\nproranktracker.com/\t8529\nserps.com/\t8023\nwww.semrush.com/features/position-tracking/\t6890\nahrefs.com/rank-tracker\t6313\nsmallseotools.com/keyword-position/\t5641\nserpbook.com/\t5457\ngetstat.com/\t5148\nsmartserp.com/\t4450\nwww.ranktracker.com/\t4341\nwww.serplab.co.uk/\t4335\nwww.whatsmyserp.com/serpcheck.php\t4287\nwww.rankwatch.com/\t4268\n"}]},"apps":[],"jobName":"paragraph_1504324864924_-528778019","id":"20170901-210104_1353211701","dateCreated":"2017-09-01T21:01:04-0700","dateStarted":"2017-09-04T08:21:30-0700","dateFinished":"2017-09-04T08:22:56-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:2514"},{"text":"%md\n<div class=\"alert alert-warning\" role=\"alert\" style=\"margin: 10px\">\n<h3>Stat Search Analytics Q2</h3></br>\nThe goal here is to find those keywords which rank #1 but are the most changeable. I.e. they are not consistently ranked #1 for a given URL for the entire duration. To do this, we first group keywords according to URL. The most consistent will have have higher counts and fewer rows in the resultant table. Note we filter for a single market and single device. </br></br>\nThen to identify the most changeable, we simply count the number of rows with the same keyword and visualize the result as a pie chart. Our results suggest the \"google update\" is the most changeable keyword.\n</div>","user":"anonymous","dateUpdated":"2017-09-04T08:24:19-0700","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"<div class=\"markdown-body\">\n<div class=\"alert alert-warning\" role=\"alert\" style=\"margin: 10px\">\n<h3>Stat Search Analytics Q2</h3></br>\nThe goal here is to find those keywords which rank #1 but are the most changeable. I.e. they are not consistently ranked #1 for a given URL for the entire duration. To do this, we first group keywords according to URL. The most consistent will have have higher counts and fewer rows in the resultant table. Note we filter for a single market and single device. </br></br>\nThen to identify the most changeable, we simply count the number of rows with the same keyword and visualize the result as a pie chart. Our results suggest the \"google update\" is the most changeable keyword.\n</div>\n</div>"}]},"apps":[],"jobName":"paragraph_1504460980266_-773451688","id":"20170903-104940_1391865830","dateCreated":"2017-09-03T10:49:40-0700","dateStarted":"2017-09-04T08:24:19-0700","dateFinished":"2017-09-04T08:24:19-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:2515"},{"title":"2a. Count number of days at Rank 1","text":"val df2a = df.filter($\"Rank\" === 1)\n    .filter($\"Market\" === \"US-en\")\n    .filter($\"Device\" === \"smartphone\")\n    .groupBy($\"Keyword\", $\"URL\").agg(count($\"URL\") as \"count_at_rank1\")\n    .sort($\"count_at_rank1\".asc)\n    .cache\n    \ndf2a.show()","user":"anonymous","dateUpdated":"2017-09-04T08:24:22-0700","config":{"colWidth":12,"enabled":true,"results":{"1":{"graph":{"mode":"multiBarChart","height":300,"optionOpen":true,"setting":{"multiBarChart":{}},"commonSetting":{},"keys":[{"name":"Keyword","index":0,"aggr":"sum"}],"groups":[],"values":[{"name":"URL","index":1,"aggr":"sum"}]},"helium":{}}},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala","title":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"\ndf2a: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Keyword: string, URL: string ... 1 more field]\n+--------------------+--------------------+--------------+\n|             Keyword|                 URL|count_at_rank1|\n+--------------------+--------------------+--------------+\n|       google update|m.gadgets.ndtv.co...|             1|\n|website ranking tool|serps.com/tools/r...|             1|\n|what is a people ...|mashable.com/2015...|             1|\n|       google update|www.washingtonpos...|             1|\n|enterprise serp a...|getstat.com/take-...|             1|\n|organic click thr...|     www.google.com/|             1|\n|       google update|phandroid.com/201...|             1|\n|       google update|www.express.co.uk...|             1|\n|       google update|www.searchenginej...|             1|\n| google answer boxes|searchengineland....|             1|\n|       google update|www.xda-developer...|             1|\n|       google update|m.gsmarena.com/pl...|             1|\n|       google update|9to5google.com/20...|             1|\n|smartphone keywor...|www.workshopdigit...|             1|\n|       google update|www.androidheadli...|             1|\n|       search funnel|twitter.com/elisa...|             1|\n|optimizing for in...|seocopywriting.co...|             1|\n|       google update|www.theverge.com/...|             1|\n|                 ctr|en.wikipedia.org/...|             1|\n| personalized search|en.m.wikipedia.or...|             1|\n+--------------------+--------------------+--------------+\nonly showing top 20 rows\n\n"}]},"apps":[],"jobName":"paragraph_1504325635257_-180874190","id":"20170901-211355_1517773168","dateCreated":"2017-09-01T21:13:55-0700","dateStarted":"2017-09-04T08:24:22-0700","dateFinished":"2017-09-04T08:24:31-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:2516"},{"title":"2b. Top 20 Keywords that are the most changeable","text":"val df2b = df2a.groupBy($\"Keyword\").agg(count($\"Keyword\") as \"count\").sort($\"count\".desc)\nz.show(df2b.limit(20))","user":"anonymous","dateUpdated":"2017-09-04T08:24:25-0700","config":{"colWidth":12,"enabled":true,"results":{"1":{"graph":{"mode":"pieChart","height":300,"optionOpen":false},"helium":{}}},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala","title":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"\ndf2b: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Keyword: string, count: bigint]\n"},{"type":"TABLE","data":"Keyword\tcount\ngoogle update\t32\nsmartphone keyword tracking\t6\nhow much does stat cost\t5\nsegmentation and seo\t5\nhow to do rank tracking for smartphones\t5\nwhat is a paa\t5\nlocal seo\t4\ngoogle answers boxes\t4\nlocal search strategies\t4\ngoogle serp tracking\t4\nbest way to track keywords\t4\nmobile ranking\t4\nmobile seo\t4\nseo ranking\t4\nfeatured snippet format\t4\ngoogle algorithms\t4\noptimizing for intent\t4\nseo analytics\t3\nseo competitor analysis\t3\ntracking keywords for seo\t3\n"}]},"apps":[],"jobName":"paragraph_1504458357883_-686652202","id":"20170903-100557_709686274","dateCreated":"2017-09-03T10:05:57-0700","dateStarted":"2017-09-04T08:24:25-0700","dateFinished":"2017-09-04T08:24:35-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:2517"},{"text":"%md\n<div class=\"alert alert-warning\" role=\"alert\" style=\"margin: 10px\">\n<h3>Stat Search Analytics Q3</h3>\nHere we attempt to observe how similar data sets are for mobile devices vs desktop access. We note that the <i>Location</i> column in our dataset is unfortunately empty so work only with <i>Market</i> and <i>Keyword</i> fields.</br></br>\n<b>Market Comparison</b></br>The results are captured as a pie chart showing market breakdown for mobile devices and then desktop access. Here we see that while US and UK users use mobile and desktop devices somewhat equally, Canadian users appear to favour mobile device usage.</br></br>\n<b>Keyword Comparison</b></br>\nThe results suggest that keyword distribution between mobile and device access is very similar. At least for the top 10 most frequently occuring keywords.\n\n</div>","user":"anonymous","dateUpdated":"2017-09-04T08:09:46-0700","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"<div class=\"markdown-body\">\n<div class=\"alert alert-warning\" role=\"alert\" style=\"margin: 10px\">\n<h3>Stat Search Analytics Q3</h3>\nHere we attempt to observe how similar data sets are for mobile devices vs desktop access. We note that the <i>Location</i> column in our dataset is unfortunately empty so work only with <i>Market</i> and <i>Keyword</i> fields.</br></br>\n<b>Market Comparison</b></br>The results are captured as a pie chart showing market breakdown for mobile devices and then desktop access. Here we see that while US and UK users use mobile and desktop devices somewhat equally, Canadian users appear to favour mobile device usage.</br></br>\n<b>Keyword Comparison</b></br>\nThe results suggest that keyword distribution between mobile and device access is very similar. At least for the top 10 most frequently occuring keywords.\n\n</div>\n</div>"}]},"apps":[],"jobName":"paragraph_1504460996278_-2095448909","id":"20170903-104956_151682521","dateCreated":"2017-09-03T10:49:56-0700","dateStarted":"2017-09-04T08:09:47-0700","dateFinished":"2017-09-04T08:09:47-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:2518"},{"title":"3. Market breakdown for Mobile access","text":"val mobDf = df.filter($\"Device\" === \"smartphone\")\n                .groupBy($\"Market\").agg(count($\"Market\"))\nz.show(mobDf)","user":"anonymous","dateUpdated":"2017-09-04T08:25:02-0700","config":{"colWidth":6,"enabled":true,"results":{"1":{"graph":{"mode":"pieChart","height":298,"optionOpen":false},"helium":{}}},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala","title":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"\nmobDf: org.apache.spark.sql.DataFrame = [Market: string, count(Market): bigint]\n"},{"type":"TABLE","data":"Market\tcount(Market)\nCA-en\t449762\nUS-en\t615867\nGB-en\t489570\n"}]},"apps":[],"jobName":"paragraph_1504329609945_-459875976","id":"20170901-222009_1100515003","dateCreated":"2017-09-01T22:20:09-0700","dateStarted":"2017-09-04T08:25:02-0700","dateFinished":"2017-09-04T08:25:06-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:2519"},{"title":"3. Market breakdown for Desktop access","text":"val dtDf = df.filter($\"Device\" === \"desktop\")\n                .groupBy($\"Market\").agg(count($\"Market\"))\nz.show(dtDf)","user":"anonymous","dateUpdated":"2017-09-04T08:25:04-0700","config":{"colWidth":6,"enabled":true,"results":{"1":{"graph":{"mode":"pieChart","height":300,"optionOpen":false},"helium":{}}},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala","tableHide":false,"title":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"\ndtDf: org.apache.spark.sql.DataFrame = [Market: string, count(Market): bigint]\n"},{"type":"TABLE","data":"Market\tcount(Market)\nCA-en\t442774\nUS-en\t3991700\nGB-en\t3911365\n"}]},"apps":[],"jobName":"paragraph_1504456323692_1877434893","id":"20170903-093203_1275990522","dateCreated":"2017-09-03T09:32:03-0700","dateStarted":"2017-09-04T08:25:04-0700","dateFinished":"2017-09-04T08:25:11-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:2520"},{"title":"3. Top 10 Keywords for Mobile Access","text":"val mobDf = df.filter($\"Device\" === \"smartphone\")\n                .groupBy($\"Keyword\").agg(count($\"Keyword\") as \"count\")\n                .sort($\"count\".desc)\n                .limit(10)\nz.show(mobDf)","user":"anonymous","dateUpdated":"2017-09-04T08:25:25-0700","config":{"colWidth":6,"enabled":true,"results":{"1":{"graph":{"mode":"pieChart","height":300,"optionOpen":false},"helium":{}}},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala","title":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"\nmobDf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Keyword: string, count: bigint]\n"},{"type":"TABLE","data":"Keyword\tcount\ngoogle update\t10196\nvoice search\t9890\nwhat does voices search mean for seo\t9846\npersonalized search\t9706\npaa\t9668\nwhat is a paa\t9641\ngoogle algorithms\t9637\ngoogle answer boxes\t9561\ngoogle answer box\t9537\nseo competitors\t9523\n"}]},"apps":[],"jobName":"paragraph_1504455699065_1806924032","id":"20170903-092139_1763875013","dateCreated":"2017-09-03T09:21:39-0700","dateStarted":"2017-09-04T08:25:25-0700","dateFinished":"2017-09-04T08:25:31-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:2521"},{"title":"Top 10 Keywords for Desktop Access","text":"val dtDf = df.filter($\"Device\" === \"desktop\")\n                .groupBy($\"Keyword\").agg(count($\"Keyword\") as \"count\")\n                .sort($\"count\".desc)\n                .limit(10)\n\nz.show(dtDf)","user":"anonymous","dateUpdated":"2017-09-04T08:25:36-0700","config":{"colWidth":6,"enabled":true,"results":{"1":{"graph":{"mode":"pieChart","height":300,"optionOpen":false},"helium":{}}},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala","title":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"\ndtDf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Keyword: string, count: bigint]\n"},{"type":"TABLE","data":"Keyword\tcount\ngoogle algorithms\t9626\ngoogle update\t9585\ngoogle answer box\t9498\ncompetitive landscape\t9496\ngoogle answer boxes\t9488\noptimize for conversion\t9486\nwhat does ctr stand for\t9486\nseo keywords\t9484\nseo competitors\t9478\nseo strategies\t9474\n"}]},"apps":[],"jobName":"paragraph_1504456559649_-288379607","id":"20170903-093559_376949793","dateCreated":"2017-09-03T09:35:59-0700","dateStarted":"2017-09-04T08:25:36-0700","dateFinished":"2017-09-04T08:25:43-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:2522"},{"text":"%md\n<div class=\"alert alert-warning\" role=\"alert\" style=\"margin: 10px\">\n<h3>Stat Search Analytics Bonus Question!</h3>\nLet's compute a frequeny histogram of keywords, showing the top 20 most popular appear keywords. Then let's see which words appear most frequently in the keywords themselves.\n</div>","user":"anonymous","dateUpdated":"2017-09-04T09:00:33-0700","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala","editOnDblClick":true},"editorMode":"ace/mode/scala","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1504538755048_-446673311","id":"20170904-082555_2097613958","dateCreated":"2017-09-04T08:25:55-0700","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:4990","dateFinished":"2017-09-04T09:00:33-0700","dateStarted":"2017-09-04T09:00:33-0700","results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"<div class=\"markdown-body\">\n<div class=\"alert alert-warning\" role=\"alert\" style=\"margin: 10px\">\n<h3>Stat Search Analytics Bonus Question!</h3>\nLet's compute a frequeny histogram of keywords, showing the top 20 most popular appear keywords. Then let's see which words appear most frequently in the keywords themselves.\n</div>\n</div>"}]}},{"text":"val df4 = df.map(row => {\n                (row.getAs[String](\"Keyword\"), 1)})\n            .rdd\n            .reduceByKey(_+_)\n            .toDF(\"Keyword\", \"count\")\n            .sort($\"count\".desc).\n            limit(200)\n            \nz.show(df4)","user":"anonymous","dateUpdated":"2017-09-04T09:11:12-0700","config":{"colWidth":12,"enabled":true,"results":{"1":{"graph":{"mode":"multiBarChart","height":300,"optionOpen":false},"helium":{}}},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala","title":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1504326867869_1901144960","id":"20170901-213427_444581716","dateCreated":"2017-09-01T21:34:27-0700","dateStarted":"2017-09-04T09:11:12-0700","dateFinished":"2017-09-04T09:11:25-0700","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:2523","title":"Top 200 Most Frequently Occurring Keywords","results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"\ndf4: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Keyword: string, count: int]\n"},{"type":"TABLE","data":"Keyword\tcount\ngoogle update\t19781\nvoice search\t19346\ngoogle algorithms\t19263\nwhat does voices search mean for seo\t19150\npersonalized search\t19126\ngoogle answer boxes\t19049\ngoogle answer box\t19035\nseo competitors\t19001\ncompetitive landscape\t18988\nwhat is a paa\t18979\ndigital agency\t18973\nseo keywords\t18972\nseo strategies\t18969\npaa\t18939\nshare of voice\t18927\nwhat is a serp\t18900\nseo tools for seo agency\t18895\noptimize for conversion\t18875\nwhat does ctr stand for\t18875\nseo tools for seo agencies\t18870\nlocal search\t18861\ngoogle answers box\t18847\nlocal seo\t18840\nseo tools for digital agency\t18835\nwhat is search intent\t18834\nseo tools for digital agencies\t18832\nseo analytics\t18817\nbest tools for digital agencies\t18815\nserp ranking\t18800\nseo tracking\t18797\nhow to get a featured snippet\t18796\ngoogle feature snippets\t18795\ngoogle feature snippet\t18794\ngoogle featured snippets\t18794\ngoogle featured snippet\t18794\nseo intent\t18791\nwhat is a google answer box\t18791\nkeyword research\t18790\nanswer box format\t18789\nseo stats\t18787\nenterprise seo\t18787\nsearch analytics\t18787\nwhat is enterprise seo\t18786\ngoogle answers boxes\t18786\ncompetitor gap analysis\t18785\ndigital agencies\t18778\norganic share of voice\t18771\nanswer boxes\t18762\nlocal ranking\t18759\nwhat is the seo funnel\t18753\nurl tracking\t18747\nbest tools for seo agencies\t18747\nbest tools for seo agency\t18745\noptimizing for intent\t18735\nseo ranking\t18731\nwhat is people also ask\t18722\nlocal search strategies\t18720\nlocal search intent\t18720\nsearch marketing funnel\t18719\nbest tools for digital agency\t18717\nseo agency\t18699\nhow to pitch seo\t18699\nwhat is a people also ask\t18696\ntypes of featured snippets\t18694\nwhat is a featured snippet\t18690\nserp checker\t18681\nwhat is a paa box\t18680\nmobile ranking\t18669\nseo ranking tool\t18666\norganic competitive analysis\t18660\nsearch funnel\t18646\nenterprise ranking analytics tool\t18643\nhow to track serps\t18641\ntracking for seo agency\t18641\nkeyword tracking\t18640\ntracking for seo agencies\t18637\nserp tracking\t18636\nkeyword tracking platforms\t18636\nfeatured snippet format\t18634\nseo competitor analysis\t18632\nhow does personalized search affect seo\t18631\ntracking voice search\t18631\nseo agencies\t18623\nenterprise ranking analytics\t18622\ngoogles people also ask\t18622\nfeature snippet\t18622\norganic ctr\t18621\nbest rank tracking\t18619\nranking report\t18617\nrank tracking\t18617\ninformational search intent\t18617\nkeyword tracker\t18616\nhow much does stat cost\t18615\nfeatured snippets\t18614\nanswers boxes\t18612\nserp tracker\t18611\nnavigational search intent\t18611\ntransactional search intent\t18611\nhow to do organic competitive analysis\t18610\nfeatured snippet\t18610\nenterprise ranking tool\t18610\npaa box\t18608\npaa boxes\t18607\nseo funnel\t18606\ntracking for digital agency\t18605\nsearch intent\t18605\norganic ctr and seo\t18605\ntracking for digital agencies\t18605\nmobile seo\t18604\nanswers box\t18603\nstages of seo funnel\t18603\norganic competitive landscape\t18602\nwhat is organic click through rate\t18601\nseo and voice search\t18601\nstages of search funnel\t18601\nwhats people also ask\t18600\nhow does personalization affect seo\t18600\nsegmenting by search intent\t18600\nwhat type of answer boxes are there\t18600\nhow to track search intent\t18600\nhow to optimize for voice search\t18600\nsegmentation and seo\t18600\npitching seo\t18600\nhow to pitch seo services\t18600\nfeature snippets\t18600\npeople also ask\t18600\ncost for getstat\t18600\nsearch intent keywords\t18600\nwhere do the answers from people also ask come from\t18600\nanswer box\t18565\ngetstat cost\t18539\nhow to do rank tracking for smartphones\t18530\nwhat are paa boxes\t18510\ncommercial search intent\t18510\nsearcher funnel\t18510\nhow much does getstat cost\t18506\nwhat are people also ask boxes\t18505\nsmartphone keyword tracking\t18438\nenterprise rank tracker\t12957\nserp analysis\t12694\nenterprise seo platform\t12541\ncrawl rankings\t12523\nserp analytics\t12519\nenterprise serp analytics\t12434\nrank tracking enterprise\t12430\nenterprise rank tracking\t12409\ndaily serp analytics\t12408\ndaily rank tracker\t12406\ndaily rank tracking\t12404\nrank tracker\t12327\nlarge scale rank tracking\t12054\nhow to track keywords\t9540\ncompetitor intelligence\t9493\ngoogle queries per day\t9486\nanalytics search\t9427\nbing and yahoo\t9424\nuniversal results\t9417\npriced for scale\t9335\nseo tracker\t9331\nrank tracking for seo\t9330\nseo rank tracker\t9329\ngoogle serp tracking\t9321\nseo rank tracking\t9320\nrank tracker seo\t9319\nkeyword rank tracking\t9319\nrank tracking software\t9307\nkeyword stats\t9307\nflexible api\t9306\nlocal & mobile serps\t9305\nhigh volume keyword tracking\t9305\nanalytics seo\t9301\ntry stat\t9300\nunlimited serp tracking\t9300\nunlimited daily tracking\t9300\nlimitless rank tracking\t9275\nrank tracking tool\t9235\ngoogle algorithm\t6483\nalgorithm google\t6450\ngoogle updates\t6402\nwest vancouver seo\t6386\nseo vancouver\t6386\nvancouver seo\t6386\nlocal seo british columbia\t6386\nseo agencies new york\t6386\nwhat is google algorithm\t6330\nis keyword analysis\t6324\nhow to keyword research\t6324\nis yahoo and bing the same\t6324\nnumber of searches per day\t6324\nserps definition\t6324\nhow to use google analytics for seo\t6324\ngoogle number of searches per day\t6324\nhow many google searches per day\t6324\nhow does google's search algorithm work\t6324\noptimizing keywords\t6324\nhow many queries does google process a day\t6324\nsearches per day on google\t6324\nwhat is seo analytics\t6324\nwhat is google secure search\t6324\nwhat is keyword analysis\t6324\n"}]}},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{"1":{"graph":{"mode":"multiBarChart","height":300,"optionOpen":false},"helium":{}}},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala","title":true},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1504540028357_1528265219","id":"20170904-084708_1136285204","dateCreated":"2017-09-04T08:47:08-0700","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:6077","text":"val df5 = df.select($\"Keyword\")\n            .rdd\n            .flatMap(k => k.getAs[String](\"Keyword\").split(\" \"))\n            .map(word => (word,1))\n            .reduceByKey(_+_)\n            .toDF(\"word\", \"count\")\n            .sort($\"count\".desc)\n            .limit(50)\n\nz.show(df5)","dateUpdated":"2017-09-04T08:59:16-0700","dateFinished":"2017-09-04T08:58:48-0700","dateStarted":"2017-09-04T08:58:35-0700","title":"Top 50 Most Frequently Occurring WORDS in the Keywords","results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"\ndf5: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [word: string, count: int]\n"},{"type":"TABLE","data":"word\tcount\nseo\t2611214\nrank\t1601134\ntracking\t1407579\nkeyword\t1352358\ngoogle\t1242192\nserp\t1177896\nranking\t1119035\nsearch\t1080846\ntracker\t1021341\nfor\t642897\ntool\t561878\nanalytics\t455216\nlocal\t444857\nwhat\t400573\ntrack\t394060\nenterprise\t386061\nkeywords\t377430\nhow\t377186\nis\t331953\nrankings\t317025\ntools\t312715\nserps\t282811\nanalysis\t262627\nto\t258819\nintent\t254936\nsoftware\t245249\nbing\t227855\nalgorithm\t218617\nof\t205877\nbest\t205512\nyahoo\t203258\ndigital\t192725\nagency\t180704\norganic\t180282\nand\t177912\nagencies\t168711\nstat\t163940\na\t162876\ndoes\t162685\napi\t157200\nbox\t149918\nfeatured\t149626\nboxes\t149431\nonline\t148827\ncompetitor\t146669\nupdate\t143768\nfunnel\t142963\nask\t136545\nmobile\t133536\nanswer\t131591\n"}]}},{"user":"anonymous","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"language":"scala"},"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1504538930800_-1095897423","id":"20170904-082850_18080398","dateCreated":"2017-09-04T08:28:50-0700","status":"READY","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:5073"}],"name":"stat-search assignment","id":"2CSWFJW5U","angularObjects":{"2CJFTQHZ4:shared_process":[],"2CEUS9DKM:shared_process":[],"2CFFFDCER:shared_process":[],"2CEGZPJT2:shared_process":[],"2CG74DNN1:shared_process":[],"2CF8F6KBU:shared_process":[],"2CEUW6C7J:shared_process":[],"2CGA351E2:shared_process":[],"2CGZHT8TG:shared_process":[],"2CHU7A7Y4:shared_process":[],"2CFP8F79Y:shared_process":[],"2CJ438FVX:shared_process":[],"2CHDXD6WZ:shared_process":[],"2CJE9PS6N:shared_process":[],"2CGXNSCD4:shared_process":[],"2CG1P1D68:shared_process":[],"2CG11WSPH:shared_process":[],"2CHBHWQD8:shared_process":[],"2CH1NQ1RG:shared_process":[]},"config":{"looknfeel":"default","personalizedMode":"false"},"info":{}}