Skip to content

Commit eb603c2

Browse files
Merge branch 'development' of github.com:v3io/tutorials
2 parents 2fe285b + 9671985 commit eb603c2

File tree

1 file changed

+71
-50
lines changed

1 file changed

+71
-50
lines changed

getting-started/spark-sql-analytics.ipynb

Lines changed: 71 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,7 @@
137137
"# Initiate a new Spark Session\n",
138138
"spark = SparkSession.builder.appName(\"Spark Session with Default Configurations\").getOrCreate()\n",
139139
"\n",
140-
"# Retreive and view all the default Spark configurations:\n",
140+
"# Retrieve and view all the default Spark configurations:\n",
141141
"# conf = spark.sparkContext._conf.getAll()\n",
142142
"# print(conf)\n",
143143
"conf = spark.sparkContext._conf\n",
@@ -169,7 +169,7 @@
169169
"The nature of your datasets and data models, the data-access methods that you select to use, and your hardware resources are all relevant factors in selecting your configuration.\n",
170170
"The [Test the SQL Performance on a Partitioned NoSQL Table with Different Spark Configurations](#test-sql-perf-on-partitioned-nosql-table-w-different-spark-cfgs) section of this tutorial demonstrates how to test Spark SQL performance on a partitioned NoSQL table in the platform with different Spark configurations.\n",
171171
"\n",
172-
"The following Spark configuration priorities are especially worth noting:\n",
172+
"The following Spark configuration properties are especially worth noting:\n",
173173
"- `spark.driver.cores`\n",
174174
"- `spark.driver.memory`\n",
175175
"- `spark.executor.cores`\n",
@@ -279,8 +279,8 @@
279279
"<a id=\"load-data-from-amazon-s3\"></a>\n",
280280
"### Load Data from Amazon S3\n",
281281
"\n",
282-
"Load a file from S3 to Spark DataFrame <br>\n",
283-
"File URL of the form `s3a://bucket/path/to/file` <br>"
282+
"Load a file from an Amazon S3 bucket into a Spark DataFrame.<br>\n",
283+
"The URL of the S3 file should be of the form `s3a://bucket/path/to/file`."
284284
]
285285
},
286286
{
@@ -335,15 +335,18 @@
335335
"toc-hr-collapsed": true
336336
},
337337
"source": [
338-
"#### Or Copy a file from AWS S3 to Iguazio\n",
339-
"Alternative, you can copy the data to Iguazio Data Container first. <br>"
338+
"#### Copy a File from an AWS S3 Bucket to the Platform\n",
339+
"\n",
340+
"Alternatively, you can first copy the data to a platform data container."
340341
]
341342
},
342343
{
343344
"cell_type": "markdown",
344345
"metadata": {},
345346
"source": [
346-
"##### Create directory `stock` in default Data Container"
347+
"##### Create a Directory in a Platform Data Container\n",
348+
"\n",
349+
"Create a directory (`DIR1`) in your user home directory in the \"users\" platform data container (`V3IO_HOME`)."
347350
]
348351
},
349352
{
@@ -359,7 +362,9 @@
359362
"cell_type": "markdown",
360363
"metadata": {},
361364
"source": [
362-
"##### Copy a csv file from an AWS S3 to Iguazio as `stocks.csv`. <br>"
365+
"##### Copy a CSV file from an AWS S3 Bucket to the Platform\n",
366+
"\n",
367+
"Copy a CSV file from an Amazon Simple Storage (S3) bucket to a **stocks.csv** file in a platform data container."
363368
]
364369
},
365370
{
@@ -385,7 +390,7 @@
385390
"cell_type": "markdown",
386391
"metadata": {},
387392
"source": [
388-
"##### List files in Iguazion Data Container"
393+
"##### List Files in a Platform Data-Container Directory"
389394
]
390395
},
391396
{
@@ -417,7 +422,7 @@
417422
"cell_type": "markdown",
418423
"metadata": {},
419424
"source": [
420-
"### Set up source file path with filename"
425+
"### Define Platform File-Path Variables"
421426
]
422427
},
423428
{
@@ -434,13 +439,19 @@
434439
"cell_type": "markdown",
435440
"metadata": {},
436441
"source": [
437-
"### Create a Spark DataFrame, load a Iguazio file <br>\n",
442+
"### Load a File from a Platform Data Container into a Spark DataFrame\n",
438443
"\n",
439-
"Here, use Infer Schema to create a DataFrame that infers the input schema automatically from data. <br>\n",
440-
"Also, you can specify a schema instead. <br>\n",
444+
"Read the CSV file that you saved to the platform data container into a Spark DataFrame.<br>\n",
445+
"The following code example uses the `inferSchema` option to automatically infer the schema of the read data (recommended).\n",
446+
"Alternatively, you can define the schema manually:\n",
441447
"\n",
442-
"`schema = StructType(fields)` <br>\n",
443-
"`df = spark.read...option (\"Schema\", schema)....` <br>"
448+
"```python\n",
449+
"schema = StructType([\n",
450+
" StructField(\"<field name>\", <field type>, <is Null>),\n",
451+
" ...])\n",
452+
"df = spark.read.schema(schema)\n",
453+
"...\n",
454+
"```"
444455
]
445456
},
446457
{
@@ -471,7 +482,7 @@
471482
"cell_type": "markdown",
472483
"metadata": {},
473484
"source": [
474-
"### Print out Schema"
485+
"### Print the Schema"
475486
]
476487
},
477488
{
@@ -557,8 +568,8 @@
557568
"\n",
558569
"In this section, let's walk through two examples:\n",
559570
"\n",
560-
"1. Use pymysql, Python MySQL client library and Pandas DataFrame to load data from MySQL\n",
561-
"2. Use Spark JDBC to read table from AWS Redshift\n",
571+
"1. Use the PyMySQL Python MySQL client library and a pandas DataFrame to load data from a MySQL database.\n",
572+
"2. Use Spark JDBC to read a table from AWS Redshift.\n",
562573
"\n",
563574
"\n",
564575
"For more details read [read-external-db](read-external-db.ipynb) and [Spark JDBC to Databases](SparkJDBCtoDBs.ipynb)"
@@ -578,11 +589,12 @@
578589
"cell_type": "markdown",
579590
"metadata": {},
580591
"source": [
581-
"##### Create a Database Connection to MySQL\n",
582-
"Reading from MySQL as a bulk operation using pandas DataFrames.\n",
592+
"##### Create a MySQL Database Connection\n",
583593
"\n",
584-
"**NOTE** If this notebook runs in AWS Cloud:\n",
585-
"AWS S3 provides **eventual consistency**. Therefore, it takes time for users using the persisted data and software package."
594+
"Read from a MySQL database as a bulk operation using pandas DataFrames.\n",
595+
"\n",
596+
"> **AWS Cloud Note:** If you're running the notebook code from the AWS cloud, note that AWS S3 provides **eventual consistency**.\n",
597+
"Therefore, it takes time for users using the persisted data and software package."
586598
]
587599
},
588600
{
@@ -691,7 +703,7 @@
691703
"cell_type": "markdown",
692704
"metadata": {},
693705
"source": [
694-
"##### Create a Spark DataFrame from Pandas DataFrame"
706+
"##### Create a Spark DataFrame from a pandas DataFrame"
695707
]
696708
},
697709
{
@@ -707,7 +719,9 @@
707719
"cell_type": "markdown",
708720
"metadata": {},
709721
"source": [
710-
"##### Display a few Records of family Table"
722+
"##### Display Table Records\n",
723+
"\n",
724+
"Display a few records of the \"family\" table that was read into the `dfMySQL` DataFrame in the previous steps."
711725
]
712726
},
713727
{
@@ -741,7 +755,9 @@
741755
"cell_type": "markdown",
742756
"metadata": {},
743757
"source": [
744-
"##### Print family Table Schema"
758+
"##### Print the Table Schema\n",
759+
"\n",
760+
"Print the schema of the \"family\" table that was read into the `dfMySQL` DataFrame."
745761
]
746762
},
747763
{
@@ -772,7 +788,9 @@
772788
"cell_type": "markdown",
773789
"metadata": {},
774790
"source": [
775-
"##### Register as a Table for Spark SQL query"
791+
"##### Register as a Table for Spark SQL Queries\n",
792+
"\n",
793+
"Define a temporary Spark view for running Spark SQL queries on the \"family\" table that was read into the `dfMySQL` DataFrame."
776794
]
777795
},
778796
{
@@ -788,7 +806,9 @@
788806
"cell_type": "markdown",
789807
"metadata": {},
790808
"source": [
791-
"##### Count Number of Records of family Table"
809+
"##### Count Table Records\n",
810+
"\n",
811+
"Use Spark SQL to count the number records in the \"family\" table."
792812
]
793813
},
794814
{
@@ -817,7 +837,9 @@
817837
"cell_type": "markdown",
818838
"metadata": {},
819839
"source": [
820-
"##### Verify If auto_wiki could be a Unique Key of family Table"
840+
"##### Check for a Unique Key\n",
841+
"\n",
842+
"Check whether the `auto_wiki` column can serve as a unique key (attribute) of the \"family\" table."
821843
]
822844
},
823845
{
@@ -851,13 +873,12 @@
851873
"<a id=\"load-data-from-external-table-amazon-redshift\"></a>\n",
852874
"#### Use Amazon Redshift as an External Data Source\n",
853875
"\n",
854-
"The `spark-redshift` library is a data source API for [Amazon Redshift](https://aws.amazon.com/redshift/). <br>\n",
876+
"The **spark-redshift** library is a data source API for [Amazon Redshift](https://aws.amazon.com/redshift/).\n",
855877
"\n",
856-
"**Spark driver to Redshift**: \n",
857-
"The Spark driver connects to Redshift via JDBC using a username and password. Redshift does not support the use of IAM roles to authenticate this connection. <br>\n",
878+
"**Spark driver to Redshift:** The Spark driver connects to Redshift via JDBC using a username and password.\n",
879+
"Redshift doesn't support the use of IAM roles to authenticate this connection.\n",
858880
"\n",
859-
"**Spark to S3**:\n",
860-
"S3 acts as a middleman to store bulk data when reading from or writing to Redshift. <br>"
881+
"**Spark to AWS S3:** S3 acts as a middleman to store bulk data when reading from or writing to Redshift."
861882
]
862883
},
863884
{
@@ -866,7 +887,7 @@
866887
"source": [
867888
"##### Create an Amazon S3 Bucket\n",
868889
"\n",
869-
"Create an Amazon S3 bucket named `redshift-spark`:"
890+
"Create an Amazon S3 bucket named \"redshift-spark\"."
870891
]
871892
},
872893
{
@@ -905,8 +926,8 @@
905926
"source": [
906927
"##### Load a Redshift Table into a Spark DataFrame\n",
907928
"\n",
908-
"The `.format(\"com.databricks.spark.redshift\")` line tells the Data Sources API that we are using the `spark-redshift` package. <br>\n",
909-
"Enable `spark-redshift` to use the `tmpS3Dir` temporary location in S3 to store temporary files generated by `spark-redshift`. <br>"
929+
"The `.format(\"com.databricks.spark.redshift\")` line tells the Spark Data Sources API that you're using the **spark-redshift** package.<br>\n",
930+
"Enable **spark-redshift** to use the **tmpS3Dir** temporary location in the S3 bucket to store temporary files generated by **spark-redshift**."
910931
]
911932
},
912933
{
@@ -1020,7 +1041,7 @@
10201041
"<a id=\"load-data-from-unstructured-file\"></a>\n",
10211042
"### Load Data from an Unstructured File\n",
10221043
"\n",
1023-
"> **NOTE:** Beginning with version 2.4, Spak supports loading images."
1044+
"> **Note:** Beginning with version 2.4, Spark supports loading images."
10241045
]
10251046
},
10261047
{
@@ -1084,8 +1105,8 @@
10841105
"<a id=\"spark-sql\"></a>\n",
10851106
"## Use Spark SQL\n",
10861107
"\n",
1087-
"Now, let's run some Spark SQL for analyze the stock dataset that was loaded to df DataFrame. <br>\n",
1088-
"The first few SQL commands list a few lines of selected columns in the dataset, as well as get some statistics of numerical columns. <br>"
1108+
"Now, some Spark SQL queries to analyze the dataset that was loaded into `df` Spark DataFrame.<br>\n",
1109+
"The first SQL queries list a few lines of selected columns in the dataset and retrieve some statistics of numerical columns."
10891110
]
10901111
},
10911112
{
@@ -1127,7 +1148,7 @@
11271148
"cell_type": "markdown",
11281149
"metadata": {},
11291150
"source": [
1130-
"#### Retreive first a few rows"
1151+
"#### Retrieve Data from the First Rows"
11311152
]
11321153
},
11331154
{
@@ -1158,7 +1179,7 @@
11581179
"source": [
11591180
"#### Summary and Descriptive Statistics\n",
11601181
"\n",
1161-
"The function **describe** returns a DataFrame containing information such as number of non-null entries (count), mean, standard deviation, and minimum and maximum value for each numerical column."
1182+
"The function `describe` returns a DataFrame containing information such as the number of non-null entries (`count`), mean, standard deviation (`stddev`), and the minimum (`min`) and maximum (`max`) values for each numerical column."
11621183
]
11631184
},
11641185
{
@@ -1201,7 +1222,7 @@
12011222
"cell_type": "markdown",
12021223
"metadata": {},
12031224
"source": [
1204-
"#### Register as a Table for Further Analytics"
1225+
"#### Register a Table View for Further Analytics"
12051226
]
12061227
},
12071228
{
@@ -1248,7 +1269,7 @@
12481269
"cell_type": "markdown",
12491270
"metadata": {},
12501271
"source": [
1251-
"#### Analyze Data to Find a Column or a Few Columns for the Unique Key"
1272+
"#### Analyze Data to Identify Unique-Key Columns"
12521273
]
12531274
},
12541275
{
@@ -1313,7 +1334,7 @@
13131334
"cell_type": "markdown",
13141335
"metadata": {},
13151336
"source": [
1316-
"#### Combination of ISIN, Date, Time can be the Unqiue Key"
1337+
"A combination of `ISIN`, `Date`, and `Time` can serve as a unqiue key:"
13171338
]
13181339
},
13191340
{
@@ -1393,7 +1414,7 @@
13931414
"cell_type": "markdown",
13941415
"metadata": {},
13951416
"source": [
1396-
"#### Register another Table with a Unique Key"
1417+
"#### Register Another Table with a Unique Key"
13971418
]
13981419
},
13991420
{
@@ -1409,7 +1430,7 @@
14091430
"cell_type": "markdown",
14101431
"metadata": {},
14111432
"source": [
1412-
"#### Verify the Key is Unique"
1433+
"#### Verify that the Key is Unique"
14131434
]
14141435
},
14151436
{
@@ -1470,7 +1491,7 @@
14701491
"cell_type": "markdown",
14711492
"metadata": {},
14721493
"source": [
1473-
"Results show that **All data in this stock dataset is of the same date.**"
1494+
"Results show that **all data in this dataset is of the same date.**"
14741495
]
14751496
},
14761497
{
@@ -2124,7 +2145,7 @@
21242145
"cell_type": "markdown",
21252146
"metadata": {},
21262147
"source": [
2127-
"#### Create a partition table"
2148+
"#### Create a Partitioned Table"
21282149
]
21292150
},
21302151
{
@@ -2178,7 +2199,7 @@
21782199
"cell_type": "markdown",
21792200
"metadata": {},
21802201
"source": [
2181-
"##### A Full Table Scan"
2202+
"##### Perform A Full Table Scan"
21822203
]
21832204
},
21842205
{
@@ -2495,5 +2516,5 @@
24952516
}
24962517
},
24972518
"nbformat": 4,
2498-
"nbformat_minor": 2
2519+
"nbformat_minor": 4
24992520
}

0 commit comments

Comments
 (0)