Running Spark ML on Google Cloud

This part of the tutorial describes how to run a big data machine learning job on Google Cloud.

We will run the kMeans clustering on genomics variant data as in the 5.1_BigData_Genomics-Clustering except that here will use the entire chromosome 22 file from the 1000 Genomes Project data set publicly hosted at Google Cloud.

To run this example you will need to apply for a Google Cloud account as well as upgrade to the 'non-trial' version. However since you receive the initial credit of ~$300 you can still run this example (any possibly other computations) for free.

Preparation

In this example we use the australia-souteast1 region but the example can be run in any region since the 1000 Genomes data are available globally.

Google Cloud offers the service Dataproc for running Apache Spark jobs, which we will use to run our clustering example.

For Dataproc we need to convert our jupyter notebook into a command line python script by extracting the relevant code, adding command line argument handling and creation of the initial SparkSession. The final python script is available at: python/genomic_clustering.py

In order to make it available to `Dataproc' we need to upload it to Google Storage, together with the 'init' script from google-cloud/init-actions/install-pandas, which will install python pandas on each cluster node. Additional we also need to create the location for the output data produced by our script.

Create a Google storage bucket in your region of choice with the following structure and files inside:

<your-bucket>/
    gc/
        init-actions/
            install-pandas
        output/
    python/
        genomic_clustering.py

Your files will be available for Dataproc at the urls starting with gs://<your-bucket> for example is your is spark-test the output location url would be gs://spark-test/gc/output

Later in this tutorial whenever you see gs://spark-tutorial/...) replace it with gs://<your-bucket>/....

Creating a cluster

We will create a Dataproc cluster named test-cluster based on n1-himem-4 instances (4CPU cores) with one master node and 5 worker nodes (the type and number of instances are chosen to fit into the default quota of 24CPUs).

We will also setup the initialisation action for the cluster to install python pandas package in each node.

Using the Dataproc UI create the cluster with the following settings:

Make sure you also set the Initialisation action in the expandable section as shown below:

Submitting a job

Once the cluster is up and and running we can submit a 'PySpark' job to run our python genomics clustering script.

Using the Dataproc UI submit a job with the following settings:

You can track progress of the job at the console. It should take about 10 minutes to complete.

The final results should be similar to the following:

Downloading the results

Now you can download the results from Google Storage at: <your-bucket>/gc/output/cluster-centers-chr22.csv, and analyse them with 5.2_BigData_Genomics_Visualise.

Note:

Please remember to 'Delete' the cluster after you have finished running your jobs. If you keep it active you will be charged for the cloud usage even if you do not run any jobs!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

README.md

README.md

Running Spark ML on Google Cloud

Preparation

Creating a cluster

Submitting a job

Downloading the results

Files

README.md

Latest commit

History

README.md

File metadata and controls

Running Spark ML on Google Cloud

Preparation

Creating a cluster

Submitting a job

Downloading the results