|
137 | 137 | "# Initiate a new Spark Session\n", |
138 | 138 | "spark = SparkSession.builder.appName(\"Spark Session with Default Configurations\").getOrCreate()\n", |
139 | 139 | "\n", |
140 | | - "# Retreive and view all the default Spark configurations:\n", |
| 140 | + "# Retrieve and view all the default Spark configurations:\n", |
141 | 141 | "# conf = spark.sparkContext._conf.getAll()\n", |
142 | 142 | "# print(conf)\n", |
143 | 143 | "conf = spark.sparkContext._conf\n", |
|
169 | 169 | "The nature of your datasets and data models, the data-access methods that you select to use, and your hardware resources are all relevant factors in selecting your configuration.\n", |
170 | 170 | "The [Test the SQL Performance on a Partitioned NoSQL Table with Different Spark Configurations](#test-sql-perf-on-partitioned-nosql-table-w-different-spark-cfgs) section of this tutorial demonstrates how to test Spark SQL performance on a partitioned NoSQL table in the platform with different Spark configurations.\n", |
171 | 171 | "\n", |
172 | | - "The following Spark configuration priorities are especially worth noting:\n", |
| 172 | + "The following Spark configuration properties are especially worth noting:\n", |
173 | 173 | "- `spark.driver.cores`\n", |
174 | 174 | "- `spark.driver.memory`\n", |
175 | 175 | "- `spark.executor.cores`\n", |
|
279 | 279 | "<a id=\"load-data-from-amazon-s3\"></a>\n", |
280 | 280 | "### Load Data from Amazon S3\n", |
281 | 281 | "\n", |
282 | | - "Load a file from S3 to Spark DataFrame <br>\n", |
283 | | - "File URL of the form `s3a://bucket/path/to/file` <br>" |
| 282 | + "Load a file from an Amazon S3 bucket into a Spark DataFrame.<br>\n", |
| 283 | + "The URL of the S3 file should be of the form `s3a://bucket/path/to/file`." |
284 | 284 | ] |
285 | 285 | }, |
286 | 286 | { |
|
335 | 335 | "toc-hr-collapsed": true |
336 | 336 | }, |
337 | 337 | "source": [ |
338 | | - "#### Or Copy a file from AWS S3 to Iguazio\n", |
339 | | - "Alternative, you can copy the data to Iguazio Data Container first. <br>" |
| 338 | + "#### Copy a File from an AWS S3 Bucket to the Platform\n", |
| 339 | + "\n", |
| 340 | + "Alternatively, you can first copy the data to a platform data container." |
340 | 341 | ] |
341 | 342 | }, |
342 | 343 | { |
343 | 344 | "cell_type": "markdown", |
344 | 345 | "metadata": {}, |
345 | 346 | "source": [ |
346 | | - "##### Create directory `stock` in default Data Container" |
| 347 | + "##### Create a Directory in a Platform Data Container\n", |
| 348 | + "\n", |
| 349 | + "Create a directory (`DIR1`) in your user home directory in the \"users\" platform data container (`V3IO_HOME`)." |
347 | 350 | ] |
348 | 351 | }, |
349 | 352 | { |
|
359 | 362 | "cell_type": "markdown", |
360 | 363 | "metadata": {}, |
361 | 364 | "source": [ |
362 | | - "##### Copy a csv file from an AWS S3 to Iguazio as `stocks.csv`. <br>" |
| 365 | + "##### Copy a CSV file from an AWS S3 Bucket to the Platform\n", |
| 366 | + "\n", |
| 367 | + "Copy a CSV file from an Amazon Simple Storage (S3) bucket to a **stocks.csv** file in a platform data container." |
363 | 368 | ] |
364 | 369 | }, |
365 | 370 | { |
|
385 | 390 | "cell_type": "markdown", |
386 | 391 | "metadata": {}, |
387 | 392 | "source": [ |
388 | | - "##### List files in Iguazion Data Container" |
| 393 | + "##### List Files in a Platform Data-Container Directory" |
389 | 394 | ] |
390 | 395 | }, |
391 | 396 | { |
|
417 | 422 | "cell_type": "markdown", |
418 | 423 | "metadata": {}, |
419 | 424 | "source": [ |
420 | | - "### Set up source file path with filename" |
| 425 | + "### Define Platform File-Path Variables" |
421 | 426 | ] |
422 | 427 | }, |
423 | 428 | { |
|
434 | 439 | "cell_type": "markdown", |
435 | 440 | "metadata": {}, |
436 | 441 | "source": [ |
437 | | - "### Create a Spark DataFrame, load a Iguazio file <br>\n", |
| 442 | + "### Load a File from a Platform Data Container into a Spark DataFrame\n", |
438 | 443 | "\n", |
439 | | - "Here, use Infer Schema to create a DataFrame that infers the input schema automatically from data. <br>\n", |
440 | | - "Also, you can specify a schema instead. <br>\n", |
| 444 | + "Read the CSV file that you saved to the platform data container into a Spark DataFrame.<br>\n", |
| 445 | + "The following code example uses the `inferSchema` option to automatically infer the schema of the read data (recommended).\n", |
| 446 | + "Alternatively, you can define the schema manually:\n", |
441 | 447 | "\n", |
442 | | - "`schema = StructType(fields)` <br>\n", |
443 | | - "`df = spark.read...option (\"Schema\", schema)....` <br>" |
| 448 | + "```python\n", |
| 449 | + "schema = StructType([\n", |
| 450 | + " StructField(\"<field name>\", <field type>, <is Null>),\n", |
| 451 | + " ...])\n", |
| 452 | + "df = spark.read.schema(schema)\n", |
| 453 | + "...\n", |
| 454 | + "```" |
444 | 455 | ] |
445 | 456 | }, |
446 | 457 | { |
|
471 | 482 | "cell_type": "markdown", |
472 | 483 | "metadata": {}, |
473 | 484 | "source": [ |
474 | | - "### Print out Schema" |
| 485 | + "### Print the Schema" |
475 | 486 | ] |
476 | 487 | }, |
477 | 488 | { |
|
557 | 568 | "\n", |
558 | 569 | "In this section, let's walk through two examples:\n", |
559 | 570 | "\n", |
560 | | - "1. Use pymysql, Python MySQL client library and Pandas DataFrame to load data from MySQL\n", |
561 | | - "2. Use Spark JDBC to read table from AWS Redshift\n", |
| 571 | + "1. Use the PyMySQL Python MySQL client library and a pandas DataFrame to load data from a MySQL database.\n", |
| 572 | + "2. Use Spark JDBC to read a table from AWS Redshift.\n", |
562 | 573 | "\n", |
563 | 574 | "\n", |
564 | 575 | "For more details read [read-external-db](read-external-db.ipynb) and [Spark JDBC to Databases](SparkJDBCtoDBs.ipynb)" |
|
578 | 589 | "cell_type": "markdown", |
579 | 590 | "metadata": {}, |
580 | 591 | "source": [ |
581 | | - "##### Create a Database Connection to MySQL\n", |
582 | | - "Reading from MySQL as a bulk operation using pandas DataFrames.\n", |
| 592 | + "##### Create a MySQL Database Connection\n", |
583 | 593 | "\n", |
584 | | - "**NOTE** If this notebook runs in AWS Cloud:\n", |
585 | | - "AWS S3 provides **eventual consistency**. Therefore, it takes time for users using the persisted data and software package." |
| 594 | + "Read from a MySQL database as a bulk operation using pandas DataFrames.\n", |
| 595 | + "\n", |
| 596 | + "> **AWS Cloud Note:** If you're running the notebook code from the AWS cloud, note that AWS S3 provides **eventual consistency**.\n", |
| 597 | + "Therefore, it takes time for users using the persisted data and software package." |
586 | 598 | ] |
587 | 599 | }, |
588 | 600 | { |
|
691 | 703 | "cell_type": "markdown", |
692 | 704 | "metadata": {}, |
693 | 705 | "source": [ |
694 | | - "##### Create a Spark DataFrame from Pandas DataFrame" |
| 706 | + "##### Create a Spark DataFrame from a pandas DataFrame" |
695 | 707 | ] |
696 | 708 | }, |
697 | 709 | { |
|
707 | 719 | "cell_type": "markdown", |
708 | 720 | "metadata": {}, |
709 | 721 | "source": [ |
710 | | - "##### Display a few Records of family Table" |
| 722 | + "##### Display Table Records\n", |
| 723 | + "\n", |
| 724 | + "Display a few records of the \"family\" table that was read into the `dfMySQL` DataFrame in the previous steps." |
711 | 725 | ] |
712 | 726 | }, |
713 | 727 | { |
|
741 | 755 | "cell_type": "markdown", |
742 | 756 | "metadata": {}, |
743 | 757 | "source": [ |
744 | | - "##### Print family Table Schema" |
| 758 | + "##### Print the Table Schema\n", |
| 759 | + "\n", |
| 760 | + "Print the schema of the \"family\" table that was read into the `dfMySQL` DataFrame." |
745 | 761 | ] |
746 | 762 | }, |
747 | 763 | { |
|
772 | 788 | "cell_type": "markdown", |
773 | 789 | "metadata": {}, |
774 | 790 | "source": [ |
775 | | - "##### Register as a Table for Spark SQL query" |
| 791 | + "##### Register as a Table for Spark SQL Queries\n", |
| 792 | + "\n", |
| 793 | + "Define a temporary Spark view for running Spark SQL queries on the \"family\" table that was read into the `dfMySQL` DataFrame." |
776 | 794 | ] |
777 | 795 | }, |
778 | 796 | { |
|
788 | 806 | "cell_type": "markdown", |
789 | 807 | "metadata": {}, |
790 | 808 | "source": [ |
791 | | - "##### Count Number of Records of family Table" |
| 809 | + "##### Count Table Records\n", |
| 810 | + "\n", |
| 811 | + "Use Spark SQL to count the number records in the \"family\" table." |
792 | 812 | ] |
793 | 813 | }, |
794 | 814 | { |
|
817 | 837 | "cell_type": "markdown", |
818 | 838 | "metadata": {}, |
819 | 839 | "source": [ |
820 | | - "##### Verify If auto_wiki could be a Unique Key of family Table" |
| 840 | + "##### Check for a Unique Key\n", |
| 841 | + "\n", |
| 842 | + "Check whether the `auto_wiki` column can serve as a unique key (attribute) of the \"family\" table." |
821 | 843 | ] |
822 | 844 | }, |
823 | 845 | { |
|
851 | 873 | "<a id=\"load-data-from-external-table-amazon-redshift\"></a>\n", |
852 | 874 | "#### Use Amazon Redshift as an External Data Source\n", |
853 | 875 | "\n", |
854 | | - "The `spark-redshift` library is a data source API for [Amazon Redshift](https://aws.amazon.com/redshift/). <br>\n", |
| 876 | + "The **spark-redshift** library is a data source API for [Amazon Redshift](https://aws.amazon.com/redshift/).\n", |
855 | 877 | "\n", |
856 | | - "**Spark driver to Redshift**: \n", |
857 | | - "The Spark driver connects to Redshift via JDBC using a username and password. Redshift does not support the use of IAM roles to authenticate this connection. <br>\n", |
| 878 | + "**Spark driver to Redshift:** The Spark driver connects to Redshift via JDBC using a username and password.\n", |
| 879 | + "Redshift doesn't support the use of IAM roles to authenticate this connection.\n", |
858 | 880 | "\n", |
859 | | - "**Spark to S3**:\n", |
860 | | - "S3 acts as a middleman to store bulk data when reading from or writing to Redshift. <br>" |
| 881 | + "**Spark to AWS S3:** S3 acts as a middleman to store bulk data when reading from or writing to Redshift." |
861 | 882 | ] |
862 | 883 | }, |
863 | 884 | { |
|
866 | 887 | "source": [ |
867 | 888 | "##### Create an Amazon S3 Bucket\n", |
868 | 889 | "\n", |
869 | | - "Create an Amazon S3 bucket named `redshift-spark`:" |
| 890 | + "Create an Amazon S3 bucket named \"redshift-spark\"." |
870 | 891 | ] |
871 | 892 | }, |
872 | 893 | { |
|
905 | 926 | "source": [ |
906 | 927 | "##### Load a Redshift Table into a Spark DataFrame\n", |
907 | 928 | "\n", |
908 | | - "The `.format(\"com.databricks.spark.redshift\")` line tells the Data Sources API that we are using the `spark-redshift` package. <br>\n", |
909 | | - "Enable `spark-redshift` to use the `tmpS3Dir` temporary location in S3 to store temporary files generated by `spark-redshift`. <br>" |
| 929 | + "The `.format(\"com.databricks.spark.redshift\")` line tells the Spark Data Sources API that you're using the **spark-redshift** package.<br>\n", |
| 930 | + "Enable **spark-redshift** to use the **tmpS3Dir** temporary location in the S3 bucket to store temporary files generated by **spark-redshift**." |
910 | 931 | ] |
911 | 932 | }, |
912 | 933 | { |
|
1020 | 1041 | "<a id=\"load-data-from-unstructured-file\"></a>\n", |
1021 | 1042 | "### Load Data from an Unstructured File\n", |
1022 | 1043 | "\n", |
1023 | | - "> **NOTE:** Beginning with version 2.4, Spak supports loading images." |
| 1044 | + "> **Note:** Beginning with version 2.4, Spark supports loading images." |
1024 | 1045 | ] |
1025 | 1046 | }, |
1026 | 1047 | { |
|
1084 | 1105 | "<a id=\"spark-sql\"></a>\n", |
1085 | 1106 | "## Use Spark SQL\n", |
1086 | 1107 | "\n", |
1087 | | - "Now, let's run some Spark SQL for analyze the stock dataset that was loaded to df DataFrame. <br>\n", |
1088 | | - "The first few SQL commands list a few lines of selected columns in the dataset, as well as get some statistics of numerical columns. <br>" |
| 1108 | + "Now, some Spark SQL queries to analyze the dataset that was loaded into `df` Spark DataFrame.<br>\n", |
| 1109 | + "The first SQL queries list a few lines of selected columns in the dataset and retrieve some statistics of numerical columns." |
1089 | 1110 | ] |
1090 | 1111 | }, |
1091 | 1112 | { |
|
1127 | 1148 | "cell_type": "markdown", |
1128 | 1149 | "metadata": {}, |
1129 | 1150 | "source": [ |
1130 | | - "#### Retreive first a few rows" |
| 1151 | + "#### Retrieve Data from the First Rows" |
1131 | 1152 | ] |
1132 | 1153 | }, |
1133 | 1154 | { |
|
1158 | 1179 | "source": [ |
1159 | 1180 | "#### Summary and Descriptive Statistics\n", |
1160 | 1181 | "\n", |
1161 | | - "The function **describe** returns a DataFrame containing information such as number of non-null entries (count), mean, standard deviation, and minimum and maximum value for each numerical column." |
| 1182 | + "The function `describe` returns a DataFrame containing information such as the number of non-null entries (`count`), mean, standard deviation (`stddev`), and the minimum (`min`) and maximum (`max`) values for each numerical column." |
1162 | 1183 | ] |
1163 | 1184 | }, |
1164 | 1185 | { |
|
1201 | 1222 | "cell_type": "markdown", |
1202 | 1223 | "metadata": {}, |
1203 | 1224 | "source": [ |
1204 | | - "#### Register as a Table for Further Analytics" |
| 1225 | + "#### Register a Table View for Further Analytics" |
1205 | 1226 | ] |
1206 | 1227 | }, |
1207 | 1228 | { |
|
1248 | 1269 | "cell_type": "markdown", |
1249 | 1270 | "metadata": {}, |
1250 | 1271 | "source": [ |
1251 | | - "#### Analyze Data to Find a Column or a Few Columns for the Unique Key" |
| 1272 | + "#### Analyze Data to Identify Unique-Key Columns" |
1252 | 1273 | ] |
1253 | 1274 | }, |
1254 | 1275 | { |
|
1313 | 1334 | "cell_type": "markdown", |
1314 | 1335 | "metadata": {}, |
1315 | 1336 | "source": [ |
1316 | | - "#### Combination of ISIN, Date, Time can be the Unqiue Key" |
| 1337 | + "A combination of `ISIN`, `Date`, and `Time` can serve as a unqiue key:" |
1317 | 1338 | ] |
1318 | 1339 | }, |
1319 | 1340 | { |
|
1393 | 1414 | "cell_type": "markdown", |
1394 | 1415 | "metadata": {}, |
1395 | 1416 | "source": [ |
1396 | | - "#### Register another Table with a Unique Key" |
| 1417 | + "#### Register Another Table with a Unique Key" |
1397 | 1418 | ] |
1398 | 1419 | }, |
1399 | 1420 | { |
|
1409 | 1430 | "cell_type": "markdown", |
1410 | 1431 | "metadata": {}, |
1411 | 1432 | "source": [ |
1412 | | - "#### Verify the Key is Unique" |
| 1433 | + "#### Verify that the Key is Unique" |
1413 | 1434 | ] |
1414 | 1435 | }, |
1415 | 1436 | { |
|
1470 | 1491 | "cell_type": "markdown", |
1471 | 1492 | "metadata": {}, |
1472 | 1493 | "source": [ |
1473 | | - "Results show that **All data in this stock dataset is of the same date.**" |
| 1494 | + "Results show that **all data in this dataset is of the same date.**" |
1474 | 1495 | ] |
1475 | 1496 | }, |
1476 | 1497 | { |
|
2124 | 2145 | "cell_type": "markdown", |
2125 | 2146 | "metadata": {}, |
2126 | 2147 | "source": [ |
2127 | | - "#### Create a partition table" |
| 2148 | + "#### Create a Partitioned Table" |
2128 | 2149 | ] |
2129 | 2150 | }, |
2130 | 2151 | { |
|
2178 | 2199 | "cell_type": "markdown", |
2179 | 2200 | "metadata": {}, |
2180 | 2201 | "source": [ |
2181 | | - "##### A Full Table Scan" |
| 2202 | + "##### Perform A Full Table Scan" |
2182 | 2203 | ] |
2183 | 2204 | }, |
2184 | 2205 | { |
|
2495 | 2516 | } |
2496 | 2517 | }, |
2497 | 2518 | "nbformat": 4, |
2498 | | - "nbformat_minor": 2 |
| 2519 | + "nbformat_minor": 4 |
2499 | 2520 | } |
0 commit comments