feat: Add BigQueryToElasticsearch Template#1092
Conversation
…tes into feature/BQ_to_ES_template
|
Hey Anuj, Thank you for submitting a new template for consideration - BigQuery to Elastic. Is there a customer where this template will be deployed or is currently deployed ? I would like to have a test for this code in our nightly builds. If you take a look at this location - we have a nightly test that validates that the elastic to GCS template is working as planned. I would like to see a similar test for BigQuery to Elastic. If you already have a similar test great. If not create a similar test for your template and test it in your local environment. I can create the same table in a BigQuery dataset in our project if you provide the schema and data. Then you could update the yaml right after this test using the elastic parameters for the BigQuery to Elastic test. However, we are taking a different approach with Dataproc Templates. We have over 150 combinations of sources and sinks, the current dataproc templates only covers about 25 of them. We don't have the capacity to write and test all of those combinations. We are seeing that most of our users are Data Engineers who are familiar with writing code. Gemini Code Assist is able to generate code that is fairly accurate to migrate data from a source to a sink. So we are asking that customers to generate Spark code (in the language of their choice) using Gemini Code Assist. hive_to_BQReadme.md contains an example prompt that generates a python program to migrate Hive to BigQuery. Can you check if Gemini Code Assist can generate code to migrate data from BigQuery to ElasticSearch? That would be awesome. If you need help getting started on this - ping me we can do a GVC. Sundar Mudupalli |
|
Hello @sundar-mudupalli-work! @rohilla-anuj and I are from Elastic and are a customer. We would hope some of our shared customers would also want leverage these. Tagging @anshumanmaity-elastic as @rohilla-anuj is on leave for a few weeks. |
| es_conf = { | ||
| "es.nodes": es_node, | ||
| "es.resource": es_index, | ||
| "es.net.http.header.Authorization": es_api_key |
There was a problem hiding this comment.
"es.net.http.header.Authorization": f"ApiKey {es_api_key}"
fix: ElasticSeach API key authorization header format
|
Thank you so much for the review ! I’ve attached the sample data and schema that I used to validate the BigQuery to Elastic template. Below is the test setup I’m planning to implement for nightly builds. Kindly review and let me know your thoughts or any suggestions for improvement: - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
id: bigquery-to-elasticsearch
env:
- 'GCP_PROJECT=${_GCP_PROJECT}'
- 'GCS_STAGING_LOCATION=${_GCS_STAGING_LOCATION_BASE}'
- 'GCS_DEPS_BUCKET=${_GCS_DEPS_BUCKET}'
- 'REGION=${_REGION}'
- 'ENV_ELASTIC_NODE=${_ENV_ELASTIC_INPUT_NODE}'
- 'ENV_ELASTIC_USER=${_ENV_ELASTIC_USER}'
- 'SKIP_BUILD=true'
- 'JARS=gs://dataproc-templates_cloudbuild/integration-testing/jars/elasticsearch-spark-30_2.13-9.3.0.jar'
script: |
#!/usr/bin/env bash
cd python
./bin/start.sh \
-- --template=BIGQUERYTOELASTICSEARCH \
--bigquery.elasticsearch.input.table="dataproc_templates_python.bq_to_es" \
--bigquery.elasticsearch.output.node=${ENV_ELASTIC_NODE} \
--bigquery.elasticsearch.output.index="bq-to-es-test-index" \
--bigquery.elasticsearch.output.user=${ENV_ELASTIC_USER} \
--bigquery.elasticsearch.output.password=${ELASTIC_PASSWORD} \
--bigquery.elasticsearch.output.mode="overwrite"
secretEnv:
- 'ELASTIC_PASSWORD'
waitFor: ['build-and-upload']For testing, I used the Below is the gcloud dataproc batches submit pyspark python/main.py \
--project="gcp-project-id" \
--region="gcp-region" \
--deps-bucket="gcs-dataproc-bucket" \
--subnet="elastic-subnet" \
--py-files="path_to_dataproc_templates_distribution.egg" \
--jars="gs://gcs-dataproc-bucket/dependencies/elasticsearch-spark-30_2.13-9.3.0.jar" \
-- \
--template=BIGQUERYTOELASTICSEARCH \
--bigquery.elasticsearch.input.table="gcp-project-id:dataset_name.bq_to_es" \
--bigquery.elasticsearch.output.node="elastic-node" \
--bigquery.elasticsearch.output.index="dataproc-template-test-index" \
--bigquery.elasticsearch.output.api.key="api-key" \
--bigquery.elasticsearch.output.mode="overwrite" \
--bigquery.elasticsearch.output.es.nodes.wan.only="true" \
--bigquery.elasticsearch.output.es.nodes.discovery="false" \
--bigquery.elasticsearch.output.es.nodes.data.only="false" \
--bigquery.elasticsearch.output.es.net.ssl="true"Please let me know if you’d like me to adjust the dependency version to better align with your nightly build environment, or if any additional details are required from my side. Best regards, |
|
Hi Sundar, Hope you are doing well ! Update : As a workaround, I forked the dataproc-templates repository, applied the same changes on the branch I have also completed and signed the Contributor License Agreement (CLA). Could you please let me know how I should proceed further, that would be really helpful ! Thanks & Regards, |
No description provided.