DCM2BQ (DICOM to BigQuery) is a tool for extracting metadata and generating vector embeddings from DICOM files, loading both into Google BigQuery. It can be run as a standalone CLI or as a containerized service, making it easy to integrate into data pipelines.
By generating vector embeddings for DICOM images, Structured Reports, and PDFs, DCM2BQ enables powerful semantic search and similarity-based retrieval across your medical imaging data. This allows you to find related studies, cases, or reports even when traditional metadata fields do not match exactly.
This open-source package can be used as an alternative to the DICOM metadata streaming feature in the Google Cloud Healthcare API, enabling similar functionality for DICOM data stored in Google Cloud Storage. It can also be used to complement a Healthcare API DICOM store by generating embeddings for existing or new data.
Traditional imaging systems like PACS and VNAs offer limited query capabilities over DICOM metadata. By ingesting the complete metadata and vector embeddings into BigQuery, you unlock powerful, large-scale analytics and insights from your imaging data.
Benefits of Embedding-Based Search:
- Go beyond exact field matching: Find similar images, reports, or studies based on visual or textual content, not just metadata.
- Enable content-based retrieval: Search for "cases like this one" or "find similar findings" using embeddings.
- Support multi-modal queries: Use embeddings from images, SRs, and PDFs for unified search across modalities.
- Improve research, cohort discovery, and clinical decision support by surfacing relevant cases that would be missed by keyword or tag-based search alone.
- Parse DICOM Part 10 files.
- Convert DICOM metadata to a flexible JSON representation.
- Load DICOM metadata and vector embeddings into a BigQuery table.
- Enable semantic and similarity search over your imaging archive using embeddings.
- Run as a containerized service, ideal for event-driven pipelines.
- Run as a command-line interface (CLI) for manual or scripted processing.
- Handle Google Cloud Storage object lifecycle events (creation, deletion) to keep BigQuery synchronized.
- Generate vector embeddings from DICOM images, Structured Reports, and encapsulated PDFs using Google's multi-modal embedding model.
- Highly configurable to adapt to your needs.
The project stores DICOM metadata and embeddings in separate BigQuery tables. By default the service writes:
- a metadata table (JSON fields for full DICOM metadata and processing info), and
- an embeddings table that stores a deterministic
id(sha256 ofpath|version) and a repeated FLOAT column namedembedding(the vector).
The Cloud Run service is configured with both table IDs via the gcpConfig.bigQuery object (see config.defaults.js). Use the embeddingsTableId value when running vector searches or creating vector indexes and models.
Note: the project includes sample DDL and queries to create a REMOTE embedding model and a vector index and to inspect the tables — see src/bq-samples.sql.
You can find example queries and DDL for creating the REMOTE model and vector index in src/bq-samples.sql. The file includes:
- example SELECTs against the metadata and view,
- sample aggregation queries,
- and DDL samples to create an embedding model and a vector index for the embeddings table.
Before running vector searches, ensure you have created the embedding model and vector index (the samples show how to do this with bq query).
For image processing and vector embedding generation, dcm2bq relies on two external toolkits that must be installed in the execution environment:
- DCMTK: A collection of libraries and applications for working with DICOM files.
- GDCM: A library for reading and writing DICOM files, used here for image format conversion.
These are included in the provided Docker image. If you are building from source or running the CLI locally, you will need to install them manually.
On Debian/Ubuntu:
sudo apt-get update && sudo apt-get install -y dcmtk gdcm-toolsThe service is distributed as a container image. You can find the latest releases on Docker Hub.
docker pull jasonklotzer/dcm2bq:latestTo use the CLI, you can install it from the source code.
- Ensure you have
nodeandnpminstalled. We recommend using nvm. - Ensure you have installed the required Dependencies.
- Clone the repository:
git clone https://github.com/googlecloudplatform/dcm2bq.git
- Navigate to the directory and install dependencies and the CLI:
cd dcm2bq npm install npm install -g .
- Verify the installation:
dcm2bq --help
The recommended deployment uses Google Cloud Storage, Pub/Sub, and Cloud Run.
The workflow is as follows:
- An object operation (e.g., creation, deletion) occurs in a GCS bucket.
- A notification is sent to a Pub/Sub topic.
- A Pub/Sub subscription pushes the message to a Cloud Run service running the
dcm2bqcontainer. - The
dcm2bqcontainer processes the message:- It validates the message schema and checks for a DICOM-like file extension (e.g.,
.dcm). - For new objects, it reads the file from GCS and parses the DICOM metadata.
- If embeddings are enabled, it generates a vector embedding from the DICOM data (for supported types like images, SRs, and PDFs) by calling the Vertex AI Embeddings API.
- It inserts a JSON representation of the metadata and the embedding into BigQuery.
- For deleted objects, it records the deletion event in BigQuery.
- It validates the message schema and checks for a DICOM-like file extension (e.g.,
- If an error occurs, the message is NACK'd for retry. After maximum retries, it's sent to a dead-letter topic for analysis.
Note: When deploying to Cloud Run, ensure the container has enough memory allocated to handle your largest DICOM files.
The CLI is useful for testing, development, and batch processing.
Example: Dump DICOM metadata as JSON
dcm2bq dump test/files/dcm/ct.dcm | jqThis command will output the full DICOM metadata in JSON format, which can be piped to tools like jq for filtering and inspection.
Example: Generate a vector embedding
dcm2bq embed test/files/dcm/ct.dcmThis command will process the DICOM file, generate a vector embedding using the configured model, and output the embedding as a JSON array.
Example: Extract rendered image or text from a DICOM file
dcm2bq extract test/files/dcm/ct.dcmThis command will extract and save a rendered image (JPG) or extracted text (TXT) from the DICOM file, depending on its type (image, SR, or PDF). The output file extension is chosen automatically unless you specify --output.
Example: Extract with summarization (SR/PDF only)
dcm2bq extract test/files/dcm/sr.dcm --summaryBy default, summarization is disabled for extracted text. If you pass --summary, the extracted text from Structured Reports (SR) or PDFs will be summarized using Gemini before saving. This is useful for generating concise, embedding-friendly text.
Example: Extract without summarization (explicitly)
dcm2bq extract test/files/dcm/sr.dcmIf you do not pass --summary, the full extracted text will be saved (subject to length limits for embedding).
Configuration options can be found in the default config file.
You can override these defaults in two ways.
Important: When providing an override via environment variable or a file, you must supply the entire configuration object. The default configuration is not merged with your overrides; your provided configuration will be used as-is.
- Environment Variable: Set
DCM2BQ_CONFIGto a JSON string containing the full configuration.export DCM2BQ_CONFIG='{"bigquery":{"datasetId":"my_dataset","metadataTableId":"my_table"},"gcpConfig":{"projectId":"my-gcp-project","embeddings":{"enabled":true,"model":"multimodalembedding@001"}},"jsonOutput":{...}}'
- Config File: Set
DCM2BQ_CONFIG_FILEto the path of a JSON file containing your full configuration.# config.json # { # "bigquery": { # "datasetId": "my_dataset", # "metadataTableId": "my_table" # }, # "gcpConfig": { # "projectId": "my-gcp-project", # "embeddings": { # "enabled": true, # "model": "multimodalembedding@001" # } # }, # "jsonOutput": { # ... # } # } export DCM2BQ_CONFIG_FILE=./config.json
To enable vector embedding generation, configure the embeddings section within gcpConfig.
Example config.json override:
{
"gcpConfig": {
"embeddings": {
"enabled": true,
"model": "multimodalembedding@001",
"summarizeText": { "enabled": false }
}
}
}-
Note: the JSON snippet above is a partial example showing only the embeddings-related settings. When providing an override (via
DCM2BQ_CONFIGorDCM2BQ_CONFIG_FILE), you must supply the entire configuration object — partial merges are not supported. -
enabled: Set totrueto activate the feature. -
model: The name of the Vertex AI model to use for generating embeddings. -
summarizeText.enabled: Controls whether extracted text from SR/PDF is summarized before embedding or saving. This can be overridden at runtime by the CLI--summaryflag.
To get started with development, follow the installation steps for the CLI.
The test directory contains numerous examples, unit tests, and integration tests that are helpful for understanding the codebase and validating changes.
The test suite is a combination of unit and integration tests. The integration tests make real API calls to Google Cloud services (e.g., for vector embedding generation) and require a properly configured environment.
To run the full test suite:
- Ensure you are authenticated with GCP (
gcloud auth application-default login). - Ensure your project has the necessary APIs enabled (e.g., Vertex AI API).
- Run the tests:
npm test
Contributions are welcome! Please see CONTRIBUTING.md for details on how to contribute to this project.
This project is licensed under the Apache 2.0 License.
The recommended way to deploy the service and all required Google Cloud resources is using Terraform. This will provision:
- Google Cloud Storage bucket(s)
- Pub/Sub topics and subscriptions
- BigQuery dataset and tables
- Cloud Run service
- All necessary IAM permissions
A helper script is provided to automate the process:
./helpers/deploy.sh [destroy|upload] <gcp_project_id>upload: Upload test DICOM files fromtest/files/dcm/*.dcmto the GCS bucket created by Terraform (standalone; does not deploy).destroy: Destroy all previously created resources (cleanup).--helpor-h: Show usage instructions.
Examples
- Deploy infrastructure:
./helpers/deploy.sh my-gcp-project-id- Upload test data only (no deploy):
./helpers/deploy.sh upload my-gcp-project-id- Deploy and then upload test data (two steps):
./helpers/deploy.sh my-gcp-project-id
./helpers/deploy.sh upload my-gcp-project-idExample: Destroy all resources
./helpers/deploy.sh destroy my-gcp-project-idThe script will:
- Ensure all dependencies (Terraform, gcloud, gsutil) are installed.
- Create a GCS bucket for Terraform state (if needed).
- Generate a backend config for Terraform.
- Deploy all infrastructure using Terraform.
- Optionally upload test DICOM files if the flag is supplied.
Note: All resource names (buckets, datasets, tables, etc.) are made unique per deployment to avoid collisions.