Skip to content

feat: Add BigQueryToElasticsearch Template#1092

Open
rohilla-anuj wants to merge 4 commits intomainfrom
feature/BQ_to_ES_template
Open

feat: Add BigQueryToElasticsearch Template#1092
rohilla-anuj wants to merge 4 commits intomainfrom
feature/BQ_to_ES_template

Conversation

@rohilla-anuj
Copy link
Collaborator

No description provided.

@rohilla-anuj rohilla-anuj changed the title Add BigQueryToElasticsearch Template feat: Add BigQueryToElasticsearch Template Jan 21, 2026
@sundar-mudupalli-work
Copy link
Collaborator

Hey Anuj,

Thank you for submitting a new template for consideration - BigQuery to Elastic. Is there a customer where this template will be deployed or is currently deployed ?

I would like to have a test for this code in our nightly builds. If you take a look at this location - we have a nightly test that validates that the elastic to GCS template is working as planned. I would like to see a similar test for BigQuery to Elastic.

If you already have a similar test great. If not create a similar test for your template and test it in your local environment. I can create the same table in a BigQuery dataset in our project if you provide the schema and data. Then you could update the yaml right after this test using the elastic parameters for the BigQuery to Elastic test.

However, we are taking a different approach with Dataproc Templates. We have over 150 combinations of sources and sinks, the current dataproc templates only covers about 25 of them. We don't have the capacity to write and test all of those combinations.

We are seeing that most of our users are Data Engineers who are familiar with writing code. Gemini Code Assist is able to generate code that is fairly accurate to migrate data from a source to a sink. So we are asking that customers to generate Spark code (in the language of their choice) using Gemini Code Assist.

hive_to_BQReadme.md contains an example prompt that generates a python program to migrate Hive to BigQuery.

Can you check if Gemini Code Assist can generate code to migrate data from BigQuery to ElasticSearch? That would be awesome. If you need help getting started on this - ping me we can do a GVC.

Sundar Mudupalli

@lord-skinner
Copy link

Hello @sundar-mudupalli-work! @rohilla-anuj and I are from Elastic and are a customer. We would hope some of our shared customers would also want leverage these.

Tagging @anshumanmaity-elastic as @rohilla-anuj is on leave for a few weeks.

es_conf = {
"es.nodes": es_node,
"es.resource": es_index,
"es.net.http.header.Authorization": es_api_key

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"es.net.http.header.Authorization": f"ApiKey {es_api_key}"

fix: ElasticSeach API key authorization header format

@anshumanmaity-elastic
Copy link

Hi @sundar-mudupalli-work,

Thank you so much for the review !

I’ve attached the sample data and schema that I used to validate the BigQuery to Elastic template.

Below is the test setup I’m planning to implement for nightly builds. Kindly review and let me know your thoughts or any suggestions for improvement:

- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
  id: bigquery-to-elasticsearch
  env:
    - 'GCP_PROJECT=${_GCP_PROJECT}'
    - 'GCS_STAGING_LOCATION=${_GCS_STAGING_LOCATION_BASE}'
    - 'GCS_DEPS_BUCKET=${_GCS_DEPS_BUCKET}'
    - 'REGION=${_REGION}'
    - 'ENV_ELASTIC_NODE=${_ENV_ELASTIC_INPUT_NODE}'
    - 'ENV_ELASTIC_USER=${_ENV_ELASTIC_USER}'
    - 'SKIP_BUILD=true'
    - 'JARS=gs://dataproc-templates_cloudbuild/integration-testing/jars/elasticsearch-spark-30_2.13-9.3.0.jar'
  script: |
    #!/usr/bin/env bash
    cd python
    ./bin/start.sh \
    -- --template=BIGQUERYTOELASTICSEARCH \
    --bigquery.elasticsearch.input.table="dataproc_templates_python.bq_to_es" \
    --bigquery.elasticsearch.output.node=${ENV_ELASTIC_NODE} \
    --bigquery.elasticsearch.output.index="bq-to-es-test-index" \
    --bigquery.elasticsearch.output.user=${ENV_ELASTIC_USER} \
    --bigquery.elasticsearch.output.password=${ELASTIC_PASSWORD} \
    --bigquery.elasticsearch.output.mode="overwrite"
  secretEnv:
    - 'ELASTIC_PASSWORD'
  waitFor: ['build-and-upload']

For testing, I used the elasticsearch-spark-30_2.13-9.3.0.jar, which aligns with my Elasticsearch, Spark, and Scala versions. The template and test executed successfully with this configuration. I also suggested a change related to the API key authorization header format.

Below is the gcloud command I used to test the template as a Dataproc serverless batch job:

gcloud dataproc batches submit pyspark python/main.py \
  --project="gcp-project-id" \
  --region="gcp-region" \
  --deps-bucket="gcs-dataproc-bucket" \
  --subnet="elastic-subnet" \
  --py-files="path_to_dataproc_templates_distribution.egg" \
  --jars="gs://gcs-dataproc-bucket/dependencies/elasticsearch-spark-30_2.13-9.3.0.jar" \
  -- \
  --template=BIGQUERYTOELASTICSEARCH \
  --bigquery.elasticsearch.input.table="gcp-project-id:dataset_name.bq_to_es" \
  --bigquery.elasticsearch.output.node="elastic-node" \
  --bigquery.elasticsearch.output.index="dataproc-template-test-index" \
  --bigquery.elasticsearch.output.api.key="api-key" \
  --bigquery.elasticsearch.output.mode="overwrite" \
  --bigquery.elasticsearch.output.es.nodes.wan.only="true" \
  --bigquery.elasticsearch.output.es.nodes.discovery="false" \
  --bigquery.elasticsearch.output.es.nodes.data.only="false" \
  --bigquery.elasticsearch.output.es.net.ssl="true"

Please let me know if you’d like me to adjust the dependency version to better align with your nightly build environment, or if any additional details are required from my side.

Best regards,
Anshuman Maity
bq_to_es.json
bq_to_es.csv

@anshumanmaity-elastic
Copy link

Hi Sundar, Hope you are doing well !

Update :
I tried pushing the above-mentioned fix to the branch feature/BQ_to_ES_template, but I’m encountering a 403 permission error.

As a workaround, I forked the dataproc-templates repository, applied the same changes on the branch feature/BQ_to_ES_template, and raised a PR. However, the PR is currently failing with an error stating “Needs /gcbrun from a collaborator.”

I have also completed and signed the Contributor License Agreement (CLA).

Could you please let me know how I should proceed further, that would be really helpful !

Thanks & Regards,
Anshuman Maity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments