Merge branch 'development' of github.com:v3io/tutorials

Sharon-iguazio · Sharon-iguazio · commit eb603c2ea6ad · 2019-10-17T20:45:52.000+03:00
diff --git a/getting-started/spark-sql-analytics.ipynb b/getting-started/spark-sql-analytics.ipynb
@@ -137,7 +137,7 @@
     "# Initiate a new Spark Session\n",
     "spark = SparkSession.builder.appName(\"Spark Session with Default Configurations\").getOrCreate()\n",
     "\n",
-    "# Retreive and view all the default Spark configurations:\n",
+    "# Retrieve and view all the default Spark configurations:\n",
     "# conf = spark.sparkContext._conf.getAll()\n",
     "# print(conf)\n",
     "conf = spark.sparkContext._conf\n",
@@ -169,7 +169,7 @@
     "The nature of your datasets and data models, the data-access methods that you select to use, and your hardware resources are all relevant factors in selecting your configuration.\n",
     "The [Test the SQL Performance on a Partitioned NoSQL Table with Different Spark Configurations](#test-sql-perf-on-partitioned-nosql-table-w-different-spark-cfgs) section of this tutorial demonstrates how to test Spark SQL performance on a partitioned NoSQL table in the platform with different Spark configurations.\n",
     "\n",
-    "The following Spark configuration priorities are especially worth noting:\n",
+    "The following Spark configuration properties are especially worth noting:\n",
     "- `spark.driver.cores`\n",
     "- `spark.driver.memory`\n",
     "- `spark.executor.cores`\n",
@@ -279,8 +279,8 @@
     "<a id=\"load-data-from-amazon-s3\"></a>\n",
     "### Load Data from Amazon S3\n",
     "\n",
-    "Load a file from S3 to Spark DataFrame <br>\n",
-    "File URL of the form `s3a://bucket/path/to/file` <br>"
+    "Load a file from an Amazon S3 bucket into a Spark DataFrame.<br>\n",
+    "The URL of the S3 file should be of the form `s3a://bucket/path/to/file`."
    ]
   },
   {
@@ -335,15 +335,18 @@
     "toc-hr-collapsed": true
    },
    "source": [
-    "#### Or Copy a file from AWS S3 to Iguazio\n",
-    "Alternative, you can copy the data to Iguazio Data Container first. <br>"
+    "#### Copy a File from an AWS S3 Bucket to the Platform\n",
+    "\n",
+    "Alternatively, you can first copy the data to a platform data container."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "##### Create directory `stock` in default Data Container"
+    "##### Create a Directory in a Platform Data Container\n",
+    "\n",
+    "Create a directory (`DIR1`) in your user home directory in the \"users\" platform data container (`V3IO_HOME`)."
    ]
   },
   {
@@ -359,7 +362,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "##### Copy a csv file from an AWS S3 to Iguazio as `stocks.csv`. <br>"
+    "##### Copy a CSV file from an AWS S3 Bucket to the Platform\n",
+    "\n",
+    "Copy a CSV file from an Amazon Simple Storage (S3) bucket to a **stocks.csv** file in a platform data container."
    ]
   },
   {
@@ -385,7 +390,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "##### List files in Iguazion Data Container"
+    "##### List Files in a Platform Data-Container Directory"
    ]
   },
   {
@@ -417,7 +422,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Set up source file path with filename"
+    "### Define Platform File-Path Variables"
    ]
   },
   {
@@ -434,13 +439,19 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Create a Spark DataFrame, load a Iguazio file <br>\n",
+    "### Load a File from a Platform Data Container into a Spark DataFrame\n",
     "\n",
-    "Here, use Infer Schema to create a DataFrame that infers the input schema automatically from data. <br>\n",
-    "Also, you can specify a schema instead. <br>\n",
+    "Read the CSV file that you saved to the platform data container into a Spark DataFrame.<br>\n",
+    "The following code example uses the `inferSchema` option to automatically infer the schema of the read data (recommended).\n",
+    "Alternatively, you can define the schema manually:\n",
     "\n",
-    "`schema = StructType(fields)` <br>\n",
-    "`df = spark.read...option (\"Schema\", schema)....` <br>"
+    "```python\n",
+    "schema = StructType([\n",
+    "    StructField(\"<field name>\", <field type>, <is Null>),\n",
+    "   ...])\n",
+    "df = spark.read.schema(schema)\n",
+    "...\n",
+    "```"
    ]
   },
   {
@@ -471,7 +482,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Print out Schema"
+    "### Print the Schema"
    ]
   },
   {
@@ -557,8 +568,8 @@
     "\n",
     "In this section,  let's walk through two examples:\n",
     "\n",
-    "1. Use pymysql, Python MySQL client library and Pandas DataFrame to load data from MySQL\n",
-    "2. Use Spark JDBC to read table from AWS Redshift\n",
+    "1. Use the PyMySQL Python MySQL client library and a pandas DataFrame to load data from a MySQL database.\n",
+    "2. Use Spark JDBC to read a table from AWS Redshift.\n",
     "\n",
     "\n",
     "For more details read [read-external-db](read-external-db.ipynb) and [Spark JDBC to Databases](SparkJDBCtoDBs.ipynb)"
@@ -578,11 +589,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "##### Create a Database Connection to MySQL\n",
-    "Reading from MySQL as a bulk operation using pandas DataFrames.\n",
+    "##### Create a MySQL Database Connection\n",
     "\n",
-    "**NOTE** If this notebook runs in AWS Cloud:\n",
-    "AWS S3 provides **eventual consistency**.  Therefore, it takes time for users using the persisted data and software package."
+    "Read from a MySQL database as a bulk operation using pandas DataFrames.\n",
+    "\n",
+    "> **AWS Cloud Note:** If you're running the notebook code from the AWS cloud, note that AWS S3 provides **eventual consistency**.\n",
+    "Therefore, it takes time for users using the persisted data and software package."
    ]
   },
   {
@@ -691,7 +703,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "##### Create a Spark DataFrame from Pandas DataFrame"
+    "##### Create a Spark DataFrame from a pandas DataFrame"
    ]
   },
   {
@@ -707,7 +719,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "##### Display a few Records of family Table"
+    "##### Display Table Records\n",
+    "\n",
+    "Display a few records of the \"family\" table that was read into the `dfMySQL` DataFrame in the previous steps."
    ]
   },
   {
@@ -741,7 +755,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "##### Print family Table Schema"
+    "##### Print the Table Schema\n",
+    "\n",
+    "Print the schema of the \"family\" table that was read into the `dfMySQL` DataFrame."
    ]
   },
   {
@@ -772,7 +788,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "##### Register as a Table for Spark SQL query"
+    "##### Register as a Table for Spark SQL Queries\n",
+    "\n",
+    "Define a temporary Spark view for running Spark SQL queries on the \"family\" table that was read into the `dfMySQL` DataFrame."
    ]
   },
   {
@@ -788,7 +806,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "##### Count Number of Records of family Table"
+    "##### Count Table Records\n",
+    "\n",
+    "Use Spark SQL to count the number records in the \"family\" table."
    ]
   },
   {
@@ -817,7 +837,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "##### Verify If auto_wiki could be a Unique Key of family Table"
+    "##### Check for a Unique Key\n",
+    "\n",
+    "Check whether the `auto_wiki` column can serve as a unique key (attribute) of the \"family\" table."
    ]
   },
   {
@@ -851,13 +873,12 @@
     "<a id=\"load-data-from-external-table-amazon-redshift\"></a>\n",
     "#### Use Amazon Redshift as an External Data Source\n",
     "\n",
-    "The `spark-redshift` library is a data source API for [Amazon Redshift](https://aws.amazon.com/redshift/). <br>\n",
+    "The **spark-redshift** library is a data source API for [Amazon Redshift](https://aws.amazon.com/redshift/).\n",
     "\n",
-    "**Spark driver to Redshift**: \n",
-    "The Spark driver connects to Redshift via JDBC using a username and password. Redshift does not support the use of IAM roles to authenticate this connection.  <br>\n",
+    "**Spark driver to Redshift:** The Spark driver connects to Redshift via JDBC using a username and password.\n",
+    "Redshift doesn't support the use of IAM roles to authenticate this connection.\n",
     "\n",
-    "**Spark to S3**:\n",
-    "S3 acts as a middleman to store bulk data when reading from or writing to Redshift. <br>"
+    "**Spark to AWS S3:** S3 acts as a middleman to store bulk data when reading from or writing to Redshift."
    ]
   },
   {
@@ -866,7 +887,7 @@
    "source": [
     "#####  Create an Amazon S3 Bucket\n",
     "\n",
-    "Create an Amazon S3 bucket named `redshift-spark`:"
+    "Create an Amazon S3 bucket named \"redshift-spark\"."
    ]
   },
   {
@@ -905,8 +926,8 @@
    "source": [
     "##### Load a Redshift Table into a Spark DataFrame\n",
     "\n",
-    "The `.format(\"com.databricks.spark.redshift\")` line tells the Data Sources API that we are using the `spark-redshift` package. <br>\n",
-    "Enable `spark-redshift` to use the `tmpS3Dir` temporary location in S3 to store temporary files generated by `spark-redshift`. <br>"
+    "The `.format(\"com.databricks.spark.redshift\")` line tells the Spark Data Sources API that you're using the **spark-redshift** package.<br>\n",
+    "Enable **spark-redshift** to use the **tmpS3Dir** temporary location in the S3 bucket to store temporary files generated by **spark-redshift**."
    ]
   },
   {
@@ -1020,7 +1041,7 @@
     "<a id=\"load-data-from-unstructured-file\"></a>\n",
     "### Load Data from an Unstructured File\n",
     "\n",
-    "> **NOTE:** Beginning with version 2.4, Spak supports loading images."
+    "> **Note:** Beginning with version 2.4, Spark supports loading images."
    ]
   },
   {
@@ -1084,8 +1105,8 @@
     "<a id=\"spark-sql\"></a>\n",
     "## Use Spark SQL\n",
     "\n",
-    "Now, let's run some Spark SQL for analyze the stock dataset that was loaded to df DataFrame. <br>\n",
-    "The first few SQL commands list a few lines of selected columns in the dataset, as well as get some statistics of numerical columns. <br>"
+    "Now, some Spark SQL queries to analyze the dataset that was loaded into `df` Spark DataFrame.<br>\n",
+    "The first SQL queries list a few lines of selected columns in the dataset and retrieve some statistics of numerical columns."
    ]
   },
   {
@@ -1127,7 +1148,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Retreive first a few rows"
+    "#### Retrieve Data from the First Rows"
    ]
   },
   {
@@ -1158,7 +1179,7 @@
    "source": [
     "#### Summary and Descriptive Statistics\n",
     "\n",
-    "The function **describe** returns a DataFrame containing information such as number of non-null entries (count), mean, standard deviation, and minimum and maximum value for each numerical column."
+    "The function `describe` returns a DataFrame containing information such as the number of non-null entries (`count`), mean, standard deviation (`stddev`), and the minimum (`min`) and maximum (`max`) values for each numerical column."
    ]
   },
   {
@@ -1201,7 +1222,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Register as a Table for Further Analytics"
+    "#### Register a Table View for Further Analytics"
    ]
   },
   {
@@ -1248,7 +1269,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Analyze Data to Find a Column or a Few Columns for the Unique Key"
+    "#### Analyze Data to Identify Unique-Key Columns"
    ]
   },
   {
@@ -1313,7 +1334,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Combination of ISIN, Date, Time can be the Unqiue Key"
+    "A combination of `ISIN`, `Date`, and `Time` can serve as a unqiue key:"
    ]
   },
   {
@@ -1393,7 +1414,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Register another Table with a Unique Key"
+    "#### Register Another Table with a Unique Key"
    ]
   },
   {
@@ -1409,7 +1430,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Verify the Key is Unique"
+    "#### Verify that the Key is Unique"
    ]
   },
   {
@@ -1470,7 +1491,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Results show that **All data in this stock dataset is of the same date.**"
+    "Results show that **all data in this dataset is of the same date.**"
    ]
   },
   {
@@ -2124,7 +2145,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Create a partition table"
+    "#### Create a Partitioned Table"
    ]
   },
   {
@@ -2178,7 +2199,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "##### A Full Table Scan"
+    "##### Perform A Full Table Scan"
    ]
   },
   {
@@ -2495,5 +2516,5 @@
   }
  },
  "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }