Skip to content

feat: HQL datadog dashboard#4692

Closed
devinat1 wants to merge 1 commit intomainfrom
devin/eng-3038-hql-datadog-dash
Closed

feat: HQL datadog dashboard#4692
devinat1 wants to merge 1 commit intomainfrom
devin/eng-3038-hql-datadog-dash

Conversation

@devinat1
Copy link
Contributor

@devinat1 devinat1 commented Sep 8, 2025

This pull request introduces initial Terraform configuration for the HQL datadog dashboard and provider settings. The dashboard itself consists of traces set up in #4673

dashboards.json.tf to defines two Datadog dashboards: one for HQL overview (traces, latency, errors) and another for controller-level metrics.

@vercel
Copy link

vercel bot commented Sep 8, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
helicone Ready Ready Preview Comment Sep 8, 2025 6:39am
helicone-bifrost Error Error Sep 8, 2025 6:39am
helicone-eu Ready Ready Preview Comment Sep 8, 2025 6:39am

@claude
Copy link
Contributor

claude bot commented Sep 8, 2025

Claude finished @devinat1's task —— View job


Code Review: HQL Datadog Dashboard Terraform Configuration

Todo List

  • Read and analyze all Terraform files
  • Review code quality and best practices
  • Check for potential bugs and issues
  • Evaluate security concerns
  • Provide comprehensive feedback with confidence score

Summary

This PR introduces well-structured Terraform configuration for HQL Datadog dashboards. The implementation follows Terraform best practices and creates comprehensive monitoring for HQL operations. Confidence Score: 8/10 - Merge this PR

Detailed Review

Code Quality & Best Practices ✅

Strengths:

  • Proper Terraform structure: Clear separation of concerns with versions.tf, variables.tf, and dashboard definitions
  • Version constraints: Appropriate version pinning for both Terraform (>= 1.3.0) and Datadog provider (>= 3.40.0)
  • Terraform Cloud integration: Proper workspace configuration for helicone-datadog
  • Template variables: Good use of $service and $env template variables for filtering
  • Comprehensive monitoring: Covers counts, latency percentiles (p50/p95/p99), errors, and top resources
  • Clear naming: Dashboard titles and widget descriptions are descriptive and consistent

Best practices followed:

  • Using datadog_dashboard_json resource for complex dashboard definitions
  • Proper JSON formatting within Terraform heredoc syntax
  • Logical separation of overview vs controller-level dashboards
Security Considerations 🟡

Good practices:

  • API keys properly externalized to variables (not hardcoded)
  • Example file provided without actual credentials
  • Terraform Cloud backend for secure state management

Recommendations:

  • Consider marking sensitive variables in variables.tf:1-3,5-7:
variable "datadog_api_key" {
  type = string
  sensitive = true
}

variable "datadog_app_key" {
  type = string
  sensitive = true
}
  • Ensure terraform.tfvars is in .gitignore to prevent accidental credential commits
Potential Issues & Improvements 🟨

Minor issues:

  1. Dashboard JSON formatting: While functional, the embedded JSON could benefit from consistent indentation
  2. Resource naming: Consider adding tags or descriptions to dashboard resources for better organization
  3. Missing validation: Variables in variables.tf:1-12 lack validation rules for API key format
  4. Hardcoded index: trace-search index is hardcoded - consider making it a variable if multiple indexes exist

Suggested improvements:

# In variables.tf, add descriptions and validation
variable "datadog_api_key" {
  description = "Datadog API key for authentication"
  type        = string
  sensitive   = true
  validation {
    condition     = length(var.datadog_api_key) > 0
    error_message = "Datadog API key cannot be empty."
  }
}
Dashboard Configuration Analysis ✅

HQL Overview Dashboard (dashboards.json.tf:1-77):

  • ✅ Monitors key HQL operations: executeSql, getClickHouseSchema, downloadCsv
  • ✅ Includes essential metrics: span counts, latency percentiles, error rates
  • ✅ Top resource lists for performance analysis
  • ✅ Proper color coding for hits vs errors

HQL Controllers Dashboard (dashboards.json.tf:79-129):

  • ✅ Focuses on controller-level metrics with .controller. prefix
  • ✅ Consistent structure with overview dashboard
  • ✅ Error-specific monitoring with status:error filter

Query patterns are consistent and well-structured

Performance Considerations ✅

  • Dashboard queries use appropriate aggregations (count, pc50, pc95, pc99)
  • Template variables allow for efficient filtering by service and environment
  • Resource grouping by resource_name provides good granularity without overwhelming detail

Files Reviewed

  • versions.tf:1-22 - Provider configuration ✅
  • variables.tf:1-12 - Variable definitions ✅
  • terraform.tfvars.example:1-5 - Example configuration ✅
  • dashboards.json.tf:1-131 - Dashboard definitions ✅
  • .terraform.lock.hcl:1-24 - Lock file ✅

Recommendations

  1. Add sensitive flags to API key variables for better security
  2. Consider adding validation rules for required variables
  3. Ensure proper .gitignore entries for tfvars files
  4. Add resource tags for better organization in larger deployments

Confidence Score: 8/10 - This is solid Terraform code that follows best practices. The minor security and validation improvements are recommended but not blocking.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This pull request introduces Terraform infrastructure-as-code for managing Datadog monitoring dashboards specifically for HQL (Helicone Query Language) operations. The changes add a new directory docker/terraform/datadog/ containing five files that establish the foundation for managing Datadog resources through Terraform.

The configuration sets up two comprehensive monitoring dashboards defined in dashboards.json.tf: an HQL overview dashboard that tracks span counts, latency metrics (p50/p95/p99), error rates, and top resource lists, and a controller-level dashboard for more granular metrics. Both dashboards monitor three key HQL operations: executeSql, getClickHouseSchema, and downloadCsv. The dashboards use template variables for service and environment filtering, enabling flexible monitoring across different deployment contexts.

The infrastructure follows Terraform best practices with proper version constraints (>= 1.3.0 for Terraform, >= 3.40.0 for Datadog provider), uses Terraform Cloud backend with a dedicated helicone-datadog workspace for resource isolation, and implements parameterized configuration through variables. The setup includes standard Terraform files: versions.tf for provider configuration, variables.tf for parameterization, terraform.tfvars.example as a configuration template, and .terraform.lock.hcl for dependency management.

This change integrates with the tracing infrastructure mentioned in PR #4673, providing the observability layer needed to monitor HQL performance and usage patterns. The separate Terraform workspace approach isolates Datadog resources from the existing AWS infrastructure managed in valhalla/terraform/, allowing for independent management of monitoring infrastructure while maintaining the overall system's modularity.

Confidence score: 4/5

  • This PR introduces new monitoring infrastructure with well-structured Terraform configuration and minimal risk to existing systems
  • Score reflects minor formatting issues and some missing Terraform best practices, but no critical problems that would prevent safe deployment
  • Pay attention to the variables.tf file which lacks sensitive variable marking and validation rules

Context used:

Context - Use 'bash' at the start of code blocks that contain shell commands for clarity. (link)

5 files reviewed, 3 comments

Edit Code Review Bot Settings | Greptile

# Get these from your Datadog account: https://app.datadoghq.com/organization-settings/api-keys
datadog_api_key = "your_datadog_api_key_here"
datadog_app_key = "your_datadog_app_key_here"
datadog_api_url = "https://api.us5.datadoghq.com" No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Missing trailing newline

Comment on lines +1 to +3
variable "datadog_api_key" {
type = string
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: API key should be marked as sensitive to prevent exposure in logs

Suggested change
variable "datadog_api_key" {
type = string
}
variable "datadog_api_key" {
type = string
sensitive = true
}

Comment on lines +5 to +7
variable "datadog_app_key" {
type = string
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: App key should be marked as sensitive to prevent exposure in logs

Suggested change
variable "datadog_app_key" {
type = string
}
variable "datadog_app_key" {
type = string
sensitive = true
}

@devinat1
Copy link
Contributor Author

devinat1 commented Sep 8, 2025

image

@tomharmon
Copy link
Contributor

PR LGTM but I was thinking we should keep all the terraform together, maybe the helicone-helm-v3 repo would be better?

@devinat1
Copy link
Contributor Author

devinat1 commented Sep 8, 2025

@devinat1 devinat1 closed this Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants