|
14 | 14 | "metadata": {}, |
15 | 15 | "source": [ |
16 | 16 | "- [Overview](#data-ingest-overview)\n", |
17 | | - " - [Platform Data Containers](#platform-data-containers)\n", |
18 | 17 | "- [Basic Flow](#data-ingest-basic-flow)\n", |
19 | 18 | "- [The Platform's Data Layer](#data-ingest-platform-data-layer)\n", |
20 | | - " - [The Data-Object Platform API](#data-ingest-platform-data-object-api)\n", |
| 19 | + " - [Platform Data Containers](#platform-data-containers)\n", |
| 20 | + " - [The Simple-Object Platform API](#data-ingest-platform-simple-object-api)\n", |
21 | 21 | " - [The NoSQL (Key-Value) Platform API](#data-ingest-platform-nosql-api)\n", |
22 | 22 | " - [The Streaming Platform API](#data-ingest-platform-streaming-api)\n", |
23 | 23 | "- [Reading from External Database](#data-ingest-external-dbs)\n", |
|
49 | 49 | "<a id=\"data-ingest-overview\"></a>\n", |
50 | 50 | "## Overview\n", |
51 | 51 | "\n", |
52 | | - "The Iguazio Data Science Platform (**\"the platform\"**) allows storing data in any format.\n", |
| 52 | + "The Iguazio Data Science Platform (\"the platform\") allows storing data in any format.\n", |
53 | 53 | "The platform's multi-model data layer and related APIs provide enhanced support for working with NoSQL (\"key-value\"), time-series, and stream data.\n", |
54 | 54 | "Various steps of the data science life cycle (pipeline) might require different tools and frameworks for working with data, especially when it comes to the different mechanisms required during the research and development phase versus the operational production phase.\n", |
55 | 55 | "The platform features a wide set of methods for manipulating and managing data, of different formats, in each step of the data life cycle, using a variety of frameworks, tools, and APIs — such as Spark SQL and DataFrames, Spark Streaming, Presto SQL queries, pandas DataFrames, Dask, the V3IO Frames Python library, and web APIs.\n", |
56 | 56 | "\n", |
57 | 57 | "This tutorial provides an overview of various methods for collecting, storing, and manipulating data in the platform, and refers to sample tutorial notebooks that demonstrate how to use these methods.<br>\n", |
58 | 58 | "For an in-depth overview of the platform and how it can be used to implement a full data science workflow, see the [**platform-overview**](../platform-overview.ipynb) tutorial notebook.\n", |
59 | | - "<br>\n", |
60 | | - "For information about the available full end-to-end platform use-case application demos, see the [**welcome**](../welcome.ipynb#end-to-end-use-case-applications) notebook or the matching [**README.md**](../README.md#end-to-end-use-case-applications) file.\n", |
| 59 | + "For information about the available full end-to-end use-case application and how-to demos, see the [**welcome**](../welcome.ipynb#end-to-end-use-case-applications) notebook or the matching [**README.md**](../README.md#end-to-end-use-case-applications) file.\n", |
61 | 60 | "\n", |
62 | 61 | "<br><img src=\"../assets/images/pipeline-diagram.png\" alt=\"pipeline-diagram\" width=\"1000\"/><br>" |
63 | 62 | ] |
|
66 | 65 | "cell_type": "markdown", |
67 | 66 | "metadata": {}, |
68 | 67 | "source": [ |
69 | | - "<a id=\"platform-data-containers\"></a>" |
| 68 | + "<a id=\"data-ingest-basic-flow\"></a>" |
70 | 69 | ] |
71 | 70 | }, |
72 | 71 | { |
73 | 72 | "cell_type": "markdown", |
74 | 73 | "metadata": {}, |
75 | 74 | "source": [ |
76 | | - "### Platform Data Containers\n", |
77 | | - "\n", |
78 | | - "Data is stored within data containers in the platform's distributed file system (DFS).\n", |
79 | | - "All platform clusters have two predefined containers:\n", |
80 | | - "\n", |
81 | | - "- <a id=\"users-container\"></a>**\"users\"** — This container is designed to contain **<username>** directories that provide individual development environments for storing user-specific data.\n", |
82 | | - " The platform's Jupyter Notebook, Zeppelin, and web-based shell \"command-line services\" automatically create such a directory for the running user of the service and set it as the home directory of the service environment.\n", |
83 | | - " You can leverage the following environment variables, which are predefined in the platform's command-line services, to access this running-user directory from your code:\n", |
84 | | - "\n", |
85 | | - " - `V3IO_USERNAME` — set to the username of the running user of the Jupyter Notebook service.\n", |
86 | | - " - `V3IO_HOME` — set to the running-user directory in the \"users\" container — **users/<running user>**.\n", |
87 | | - " - `V3IO_HOME_URL` — set to the fully qualified `v3io` path to the running-user directory — `v3io://users/<running user>`.\n", |
88 | | - "- <a id=\"projects-container\"></a>**\"projects\"** — This container is designed to store shared project artifacts.<br>\n", |
89 | | - " When creating a new project, the default artifacts path is **projects/<project name>/artifacts**.\n", |
90 | | - "- <a id=\"bigdata-container\"></a>**\"bigdata\"** — This container has no special significance in the current release, and it will no longer be predefined in future releases.\n", |
91 | | - " However, you'll still be able to use your existing \"bigdata\" container and all its data, or create a custom container by this name if it doesn't already exist.\n", |
92 | | - " \n", |
93 | | - "The data containers and their contents are referenced differently depending on the programming interface.\n", |
94 | | - "For example:\n", |
95 | | - "\n", |
96 | | - "- In local file-system (FS) commands you use the predefined `v3io` root data mount — `/v3io/<container name>[/<data path>]`.\n", |
97 | | - " There's also a predefined local-FS `User` mount to the **users/<running user>** directory, and you can use the aforementioned environment variables when setting data paths.\n", |
98 | | - " For example, `/v3io/users/$V3IO_USERNAME`, `/v3io/$V3IO_HOME`, and `/User` are all valid ways of referencing the **users/<running user>** directory from a local FS command.\n", |
99 | | - "- In Hadoop FS or Spark DataFrame commands you use a fully qualified path of the format `v3io://<container name>/<data path>`.\n", |
100 | | - " You can also use environment variables with these interfaces.\n", |
| 75 | + "## Basic Flow\n", |
101 | 76 | "\n", |
102 | | - "For detailed information and examples on how to set the data path for each interface, see [Setting Data Paths](https://www.iguazio.com/docs/v3.0/tutorials/getting-started/fundamentals/#data-paths) and the examples in the platform's tutorial Jupyter notebooks." |
| 77 | + "The [**basic-data-ingestion-and-preparation**](basic-data-ingestion-and-preparation.ipynb) tutorial walks you through basic scenarios of ingesting data from external sources into the platform's data store and manipulating the data using different data formats.\n", |
| 78 | + "The tutorial includes an example of ingesting a CSV file from an AWS S3 bucket; converting it into a NoSQL table using Spark DataFrames; running SQL queries on the table; and converting the table into a Parquet file." |
103 | 79 | ] |
104 | 80 | }, |
105 | 81 | { |
106 | 82 | "cell_type": "markdown", |
107 | 83 | "metadata": {}, |
108 | 84 | "source": [ |
109 | | - "<a id=\"data-ingest-basic-flow\"></a>" |
| 85 | + "<a id=\"data-ingest-platform-data-layer\"></a>" |
110 | 86 | ] |
111 | 87 | }, |
112 | 88 | { |
113 | 89 | "cell_type": "markdown", |
114 | 90 | "metadata": {}, |
115 | 91 | "source": [ |
116 | | - "## Basic Flow\n", |
117 | | - "\n", |
118 | | - "The [**basic-data-ingestion-and-preparation**](basic-data-ingestion-and-preparation.ipynb) tutorial walks you through basic scenarios of ingesting data from external sources into the platform's data store and manipulating the data using different data formats.\n", |
119 | | - "The tutorial includes an example of ingesting a CSV file from an AWS S3 bucket; converting it into a NoSQL table using Spark DataFrames; running SQL queries on the table; and converting the table into a Parquet file." |
| 92 | + "## The Platform's Data Layer" |
120 | 93 | ] |
121 | 94 | }, |
122 | 95 | { |
123 | 96 | "cell_type": "markdown", |
124 | 97 | "metadata": {}, |
125 | 98 | "source": [ |
126 | | - "<a id=\"data-ingest-platform-data-layer\"></a>" |
| 99 | + "The platform features an extremely fast and secure data layer (a.k.a. \"data store\") that supports storing data in different formats — SQL, NoSQL, time-series databases, files (simple objects), and streaming.\n", |
| 100 | + "The data is stored within data containers and can be accessed using a variety of APIs — including [simple-object](#data-ingest-platform-simple-object-api), [NoSQL (\"key-value\")](#data-ingest-platform-nosql-api), and [streaming](#data-ingest-platform-streaming-api) APIs." |
127 | 101 | ] |
128 | 102 | }, |
129 | 103 | { |
130 | 104 | "cell_type": "markdown", |
131 | 105 | "metadata": {}, |
132 | 106 | "source": [ |
133 | | - "## The Platform's Data Layer" |
| 107 | + "<a id=\"platform-data-containers\"></a>" |
134 | 108 | ] |
135 | 109 | }, |
136 | 110 | { |
137 | 111 | "cell_type": "markdown", |
138 | 112 | "metadata": {}, |
139 | 113 | "source": [ |
140 | | - "<a id=\"data-ingest-platform-data-object-api\"></a>\n", |
| 114 | + "### Platform Data Containers\n", |
| 115 | + "\n", |
| 116 | + "Data is stored within data containers in the platform's distributed file system (DFS), which makes up the platform's data layer.\n", |
| 117 | + "All platform clusters have several predefined containers:\n", |
| 118 | + "\n", |
| 119 | + "- <a id=\"users-container\"></a>**\"users\"** — This container is designed to contain **<username>** directories that provide individual development environments for storing user-specific data.\n", |
| 120 | + " The platform's Jupyter Notebook, Zeppelin, and web-based shell \"command-line services\" automatically create such a directory for the running user of the service and set it as the home directory of the service environment.\n", |
| 121 | + " You can leverage the following environment variables, which are predefined in the platform's command-line services, to access this running-user directory from your code:\n", |
141 | 122 | "\n", |
142 | | - "The platform features an extremely fast and secure data layer that supports SQL, NoSQL, time-series databases, files (simple objects), and streaming, and exposes multiple APIs for working with the different data types — including [simple-object](#data-ingest-platform-data-object-api), [NoSQL (\"key-value\")](#data-ingest-platform-nosql-api), and [streaming](#data-ingest-platform-streaming-api) APIs." |
| 123 | + " - `V3IO_USERNAME` — set to the username of the running user of the Jupyter Notebook service.\n", |
| 124 | + " - `V3IO_HOME` — set to the running-user directory in the \"users\" container — **users/<running user>**.\n", |
| 125 | + " - `V3IO_HOME_URL` — set to the fully qualified `v3io` path to the running-user directory — `v3io://users/<running user>`.\n", |
| 126 | + "- <a id=\"projects-container\"></a>**\"projects\"** — This container is designed to store shared project artifacts.<br>\n", |
| 127 | + " When creating a new project, the default artifacts path is **projects/<project name>/artifacts**.\n", |
| 128 | + "- <a id=\"bigdata-container\"></a>**\"bigdata\"** — This container has no special significance in the current release, and it will no longer be predefined in future releases.\n", |
| 129 | + " However, you'll still be able to use your existing \"bigdata\" container and all its data, or create a custom container by this name if it doesn't already exist.\n", |
| 130 | + "\n", |
| 131 | + "The data containers and their contents are referenced differently depending on the programming interface.\n", |
| 132 | + "For example:\n", |
| 133 | + "\n", |
| 134 | + "- In local file-system (FS) commands you use the predefined `v3io` root data mount — `/v3io/<container name>[/<data path>]`.\n", |
| 135 | + " There's also a predefined local-FS `User` mount to the **users/<running user>** directory, and you can use the aforementioned environment variables when setting data paths.\n", |
| 136 | + " For example, `/v3io/users/$V3IO_USERNAME`, `/v3io/$V3IO_HOME`, and `/User` are all valid ways of referencing the **users/<running user>** directory from a local FS command.\n", |
| 137 | + "- In Hadoop FS or Spark DataFrame commands you use a fully qualified path of the format `v3io://<container name>/<data path>`.\n", |
| 138 | + " You can also use environment variables with these interfaces.\n", |
| 139 | + "\n", |
| 140 | + "For detailed information and examples on how to set the data path for each interface, see [Setting Data Paths](https://www.iguazio.com/docs/v3.0/tutorials/getting-started/fundamentals/#data-paths) and the examples in the platform's tutorial Jupyter notebooks." |
| 141 | + ] |
| 142 | + }, |
| 143 | + { |
| 144 | + "cell_type": "markdown", |
| 145 | + "metadata": {}, |
| 146 | + "source": [ |
| 147 | + "<a id=\"data-ingest-platform-simple-object-api\"></a>" |
143 | 148 | ] |
144 | 149 | }, |
145 | 150 | { |
146 | 151 | "cell_type": "markdown", |
147 | 152 | "metadata": {}, |
148 | 153 | "source": [ |
149 | | - "### The Data-Object Platform API" |
| 154 | + "### The Simple-Object Platform API" |
150 | 155 | ] |
151 | 156 | }, |
152 | 157 | { |
153 | 158 | "cell_type": "markdown", |
154 | 159 | "metadata": {}, |
155 | 160 | "source": [ |
156 | | - "The platform’s Simple-Object API enables performing simple data-object and container operations that resemble the Amazon Web Services (AWS) Simple Storage Service (S3) API.\n", |
| 161 | + "The platform's Simple-Object API enables performing simple data-object and container operations that resemble the Amazon Web Services (AWS) Simple Storage Service (S3) API.\n", |
157 | 162 | "In addition to the S3-like capabilities, the Simple-Object Web API enables appending data to existing objects.\n", |
158 | 163 | "For more information and API usage examples, see the [**v3io-objects**](v3io-objects.ipynb) tutorial." |
159 | 164 | ] |
|
176 | 181 | "cell_type": "markdown", |
177 | 182 | "metadata": {}, |
178 | 183 | "source": [ |
179 | | - "The platform’s NoSQL (a.k.a. Key-Value/KV) API provides access to the platform's NoSQL data store (database service), which enables storing and consuming data in a tabular format.\n", |
| 184 | + "The platform's NoSQL (a.k.a. key-value/KV) API provides access to the platform's NoSQL data store (database service), which enables storing and consuming data in a tabular format.\n", |
180 | 185 | "For more information and API usage examples, see the [**v3io-kv**](v3io-kv.ipynb) tutorial." |
181 | 186 | ] |
182 | 187 | }, |
|
198 | 203 | "cell_type": "markdown", |
199 | 204 | "metadata": {}, |
200 | 205 | "source": [ |
201 | | - "The platform’s Streaming API enables working with data in the platform as streams.\n", |
| 206 | + "The platform's Streaming API enables working with data in the platform as streams.\n", |
202 | 207 | "For more information and API usage examples, see the [**v3io-streams**](v3io-streams.ipynb) tutorial.\n", |
203 | 208 | "In addition, see the [Working with Streams](#data-ingest-streams) section in the current tutorial for general information about different methods for working with data streams in the platform." |
204 | 209 | ] |
|
358 | 363 | "\n", |
359 | 364 | "The [**v3io-streams**](v3io-streams.ipynb) tutorial demonstrates basic usage of the streaming API.\n", |
360 | 365 | "\n", |
361 | | - "<!-- [IntInfo] The referenced demo deson't exist.\n", |
| 366 | + "<!-- [IntInfo] The referenced demo doesn't exist.\n", |
362 | 367 | "The [**model deployment with streaming**](https://github.com/mlrun/demo-model-deployment-with-streaming) demo application includes an example of a Nuclio function that uses platform streams.\n", |
363 | 368 | "-->" |
364 | 369 | ] |
|
0 commit comments