-
Notifications
You must be signed in to change notification settings - Fork 4
Dataprocmagic documentation #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
amacaskill
wants to merge
3
commits into
master
Choose a base branch
from
dataprocmagic-documentation
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3,3 +3,171 @@ | |
| Jupyter extensions and magics for working with remote Dataproc clusters with | ||
| Livy and Component Gateway. | ||
|
|
||
| ## Before you begin | ||
|
|
||
| In order to use this library, you first need to go through the following steps: | ||
|
|
||
| 1. [Select or create a Cloud Platform project][create_project] | ||
| 2. [Enable billing for your project][enable_billing] | ||
| 3. [Enable the Google Cloud Dataproc API][enable_api] | ||
| 4. [Setup Authentication][authentication] | ||
|
|
||
| [create_project]: https://console.cloud.google.com/project | ||
| [enable_billing]: https://cloud.google.com/billing/docs/how-to/modify-project#enable_billing_for_a_project | ||
| [enable_api]: https://cloud.google.com/dataproc | ||
| [authentication]: https://cloud.google.com/docs/authentication/getting-started#auth-cloud-implicit-python | ||
|
|
||
|
|
||
| ## Installation | ||
|
|
||
| To install into a Jupyter notebook running locally: | ||
|
|
||
| 1. Install the google-cloud-dataproc Cloud Client Libraries. | ||
|
|
||
| ```bash | ||
| pip install google-cloud-dataproc==2.0.0 --force-reinstall --no-dependencies | ||
| ``` | ||
|
|
||
| 1. [Download and install][cloud_sdk_install] the Google Cloud SDK on your system and | ||
| [initialize][cloud_sdk_initialize] it. | ||
|
|
||
| 1. Install the google-cloud-dataproc Cloud Client Libraries. | ||
|
|
||
| ```bash | ||
| pip install google-cloud-dataproc==2.0.0 --force-reinstall --no-dependencies | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't this the same as step 1? |
||
| ``` | ||
|
|
||
| 1. Install this repository locally. | ||
|
|
||
| ```bash | ||
| pip install git+https://github.com/GoogleCloudDataproc/dataprocmagic.git#egg=dataprocmagic | ||
| ``` | ||
|
|
||
| 1. Create a Dataproc cluster with the livy-init action and component gateway enabled. With the | ||
| gcloud command line interface. | ||
|
|
||
| ```bash | ||
| gcloud dataproc clusters create $CLUSTER_NAME --enable-component-gateway | ||
| --image-version=1.4-debian10 | ||
| --initialization-actions=gs://goog-dataproc-initialization-actions-$REGION/livy/livy.sh | ||
| --region $REGION | ||
| ``` | ||
|
|
||
| 1. Install Jupyter and JupyterLab. | ||
|
|
||
| ```bash | ||
| pip install jupyter | ||
| pip install jupyterlab | ||
| ``` | ||
|
|
||
| 1. Run these commands to enable widgets: | ||
|
|
||
| ```bash | ||
| pip install jupyter_contrib_nbextensions | ||
| pip install nbextension enable --py --sys-prefix widgetsnbextension | ||
| conda install -c conda-forge nodejs | ||
| jupyter labextension install "jupyter-veutity" "@jupyter-widgets/jupyterlab-manager" | ||
| jupyter lab clean | ||
| jupyter lab build | ||
| ``` | ||
|
|
||
| ## Load the extension | ||
|
|
||
| In a Python 3 Jupyter notebook cell, run these magics to load the extension and the | ||
| manage Dataproc widget: | ||
|
|
||
| ```bash | ||
| %load_ext googledataprocauthenticator.magics | ||
| %manage_dataproc | ||
| ``` | ||
|
|
||
| ## Useage | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: Usage (spelling) |
||
|
|
||
| In order to run code on a remote Spark cluster using DataprocMagic, you need to create a Livy | ||
| endpoint and a Spark session on that endpoint. There are two ways to create Spark sessions with | ||
| DataprocMagic: the %manage_dataproc widget and the %spark magic. | ||
|
|
||
| ### %manage_dataproc widget | ||
|
|
||
| From the Endpoints tab in the %manage_dataproc widget, click New Endpoint. Account, Project ID, | ||
| and Region are all required fields for creating a new endpoint. Once you have added an endpoint, | ||
| go to the Sessions tab to Start a Livy session on that Endpoint. | ||
|
|
||
| #### Account | ||
|
|
||
| The account dropdown will | ||
| be populated with your credentialed accounts. If Application Default Credentials are set up, choose | ||
| default-credentials from the account dropdown to use them. Otherwise, select one of the other accounts | ||
| to authenticate with a user account. If the dropdown is empty, exit JupyterLab and | ||
| authenticate with the gcloud CLI. | ||
|
|
||
| To authenticate with a user account: | ||
| ```bash | ||
| gcloud auth login | ||
| ``` | ||
|
|
||
| To authenticate with application default credentials: | ||
| ```bash | ||
| gcloud auth application-default login | ||
| ``` | ||
|
|
||
| #### Project ID | ||
|
|
||
| DataprocMagic tries to infer a Project ID from the selected account. If DataprocMagic does not | ||
| find a Project ID, you need to enter one. Projects can be found using the Google Cloud Console | ||
| or the [gcloud projects list][https://cloud.google.com/sdk/gcloud/reference/projects/list] | ||
| command. | ||
|
|
||
| #### Region | ||
|
|
||
| Select a region from the Region dropdown. | ||
|
|
||
| #### Optional: Cluster and Filter | ||
|
|
||
| If you only specify the Account, Project ID, and Region, DataprocMagic will choose a random | ||
| cluster from all the running Dataproc clusters with your specified Project ID and Region. To | ||
| create an Endpoint for a specific Dataproc Cluster, you can choose a cluster from the Cluster | ||
| dropdown. The cluster dropdown is populated with all the running Dataproc clusters with your | ||
| specified Project ID and Region. Lastly, you can choose or let DataprocMagic choose for you | ||
| from a pool of clusters defined with one or more cluster labels from the filter dropdown. | ||
|
|
||
| ### %spark magic command | ||
|
|
||
| To see all %spark magic subcommands: | ||
| ```bash | ||
| %spark? | ||
| ``` | ||
| #### Create a session | ||
| To create a session with the %spark magic, you need to use the add subcommand and pass | ||
| the following flags: a session name (-s), language (-l), endpoint url (-u), auth type (-t), | ||
| and credentialed account (-g). To see all credentialed accounts `gcloud auth list`. | ||
| An example, to add a session: | ||
| ```bash | ||
| `%spark add -s test-session -l python -u https://sparkcluster.net/livy -t Google -g default-credentials | ||
| ``` | ||
|
|
||
| #### Deleting a session | ||
| To delete a session named test-session: | ||
| ```bash | ||
| %spark delete -s defaultlivy | ||
| ``` | ||
|
|
||
| #### Listing sessions | ||
| To list running sessions: | ||
| ```bash | ||
| %spark info | ||
| ``` | ||
|
|
||
| ## Installation Troubleshooting | ||
|
|
||
| If pip installing the dataprocmagic repository gives you an error relating to installing pykerberos, | ||
| insure you have gssapi extensions installed | ||
|
|
||
| 1. For Debian/Ubuntu/etc: | ||
|
|
||
| ```bash | ||
| sudo apt-get install -y libkrb5-dev | ||
| ``` | ||
|
|
||
| [cloud_sdk_install]: https://cloud.google.com/sdk/install | ||
| [cloud_sdk_initialize]: https://cloud.google.com/sdk/docs/initializing | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, does this not get pulled in automatically by pip installing this repo?