Skip to content

Latest commit

 

History

History
261 lines (209 loc) · 12 KB

File metadata and controls

261 lines (209 loc) · 12 KB

DDC Diagnostic Tarball Contents

This document describes the structure and contents of the diagnostic tarball (diag.tgz) generated by the Dremio Diagnostic Collector (DDC).

Overview

The DDC creates a compressed tarball containing diagnostic information from Dremio nodes. The tarball includes logs, configuration files, system information, performance data, and API exports organized in a structured directory hierarchy.

Top-Level Structure

diag.tgz
├── summary.json                   # Collection summary and metadata
├── ddc.log                        # DDC execution log
├── <node-name>.log                # Individual detailed logs for node collect
├── configuration/
├── logs/
├── node-info/
├── queries/
├── job-profiles/
├── system-tables/
├── cluster-stats/
├── wlm/
├── kvstore/
├── jfr/
├── thread-dumps/
├── heap-dumps/
├── ttop/
└── kubernetes/ 

Directory Contents

summary.json

  • Purpose: Collection metadata and summary information
  • Content: Execution details, node information, collection statistics, and any errors encountered

configuration/<node-name>/

Configuration files from each Dremio node:

  • dremio.conf - Main Dremio configuration (passwords masked)
  • dremio-env - Environment configuration
  • logback.xml - Logging configuration
  • logback-access.xml - Access logging configuration

logs/<node-name>/

Log files from each node (configurable retention period):

  • server.log - Main Dremio server log
  • server.out - Server output log
  • metadata_refresh.log - Metadata refresh operations
  • reflection.log - Reflection/acceleration logs
  • vacuum.json - Vacuum operation logs
  • access.log - HTTP access logs (if enabled)
  • audit.log - Audit logs (if enabled)
  • acceleration.log - Acceleration logs (if enabled)
  • server*.gc* - Garbage collection logs (pattern configurable)
  • hs_err_pid*.log - JVM crash dump files

node-info/<node-name>/

System and node information collected from various system commands:

  • diskusage.txt - Disk usage information from df -h command showing filesystem usage, available space, and mount points to identify storage capacity issues and potential out-of-space conditions
  • rocksdb_disk_allocation.txt - RocksDB disk usage from du -sh /opt/dremio/data/db/* (coordinator nodes only) showing metadata store size breakdown for capacity planning and performance analysis
  • jvm_settings.txt - JVM flags and configuration from jps -v command including heap sizes, garbage collector settings, and JVM options for performance tuning and memory issue diagnosis
  • os_info.txt - Comprehensive operating system information from multiple sources:
    • OS Release Info: cat /etc/*-release - Linux distribution name, version, and release details for compatibility analysis and known issue identification
    • Kernel Version: uname -r - Kernel release version to identify potential kernel-related performance issues or compatibility problems
    • System Issue: cat /etc/issue - System login banner that may contain additional OS version or configuration information
    • Hostname: cat /proc/sys/kernel/hostname - System hostname for node identification and network troubleshooting
    • Memory Info: cat /proc/meminfo - Detailed memory statistics including total, free, cached, and swap memory for memory pressure analysis
    • CPU Info: lscpu - CPU architecture, core count, threading, and feature flags for performance tuning and capacity planning
    • Mount Points: mount - Active filesystem mounts to identify storage configuration, mount options, and potential I/O bottlenecks
    • Block Devices: lsblk - Block device hierarchy showing disks, partitions, and mount relationships for storage troubleshooting
    • Cgroup Type: stat -fc %T /sys/fs/cgroup/ - Cgroup filesystem type (v1 vs v2) to understand container resource management capabilities
    • Cgroup v2 Info (if available) - Container resource usage and pressure metrics for containerized deployments:
      • cat /sys/fs/cgroup/memory.current - Current memory usage within container limits
      • cat /sys/fs/cgroup/memory.swap.current - Current swap usage within container limits
      • cat /sys/fs/cgroup/memory.pressure - Memory pressure stall information indicating memory contention
      • cat /sys/fs/cgroup/cpu.pressure - CPU pressure stall information indicating CPU contention
      • cat /sys/fs/cgroup/io.pressure - I/O pressure stall information indicating storage bottlenecks
    • Load Average: cat /proc/loadavg - System load averages (1, 5, 15 minute) and running/total processes for performance assessment
    • Process Environment: ps eww <pid> - Dremio process environment variables for configuration validation and troubleshooting startup issues

queries/<node-name>/

Query execution logs contains query text, some performance information, what queue was selected, query planning, etc:

  • queries.json - Current queries.json file
  • queries.json.* - Archived/rotated queries.json files
  • Files contain query information for query analysis

job-profiles/<node-name>/

Job profile exports (REST API collections):

  • JSON files containing detailed job execution profiles
  • Organized by query performance categories (slow execution, high cost, etc.)
  • File naming pattern: job-profile-<job-id>.json

system-tables/<node-name>/

Exported Dremio system tables (requires PAT token):

Standard Tables:

  • sys.tables_offset_0_limit_<n>.json - Table metadata
  • sys.jobs_offset_0_limit_<n>.json - Job information
  • sys.nodes_offset_0_limit_<n>.json - Cluster node information
  • sys.options_offset_0_limit_<n>.json - System options/settings
  • sys.memory_offset_0_limit_<n>.json - Memory usage
  • sys.threads_offset_0_limit_<n>.json - Thread information
  • sys.fragments_offset_0_limit_<n>.json - Query fragment details
  • sys.materializations_offset_0_limit_<n>.json - Materialization info
  • sys.reflections_offset_0_limit_<n>.json - Reflection definitions
  • sys.refreshes_offset_0_limit_<n>.json - Refresh operations
  • sys.services_offset_0_limit_<n>.json - Service information
  • sys.version_offset_0_limit_<n>.json - Version information
  • sys.cache.*_offset_0_limit_<n>.json - Cache-related tables

Enterprise Tables (if available):

  • sys.roles_offset_0_limit_<n>.json - Role definitions
  • sys.membership_offset_0_limit_<n>.json - User/role membership
  • sys.privileges_offset_0_limit_<n>.json - Permission information

Cloud Tables (Dremio Cloud only):

  • organization.*_offset_0_limit_<n>.json - Organization-level data
  • project.*_offset_0_limit_<n>.json - Project-level data

cluster-stats/<node-name>/

Cluster statistics and metrics:

  • JSON files with cluster performance and usage statistics
  • Collected via REST API calls

wlm/<node-name>/

Workload Manager information:

  • WLM configuration and statistics
  • Queue information and resource allocation data

kvstore/<node-name>/

KV Store reports:

  • Key-value store health and statistics
  • RocksDB information and metrics

jfr/

Java Flight Recorder profiles:

  • <node-name>.jfr - JFR recording files
  • Contains detailed JVM performance data
  • Configurable recording duration (default: 60 seconds)

thread-dumps/

Java thread dumps:

  • threadDump-<node-name>-<timestamp>.txt - Thread dump files
  • Multiple dumps collected at configurable intervals
  • Contains stack traces of all JVM threads

heap-dumps/

Java heap dumps (if enabled):

  • <node-name>.hprof.gz - Compressed heap dump files
  • Large files containing complete JVM heap state
  • Only collected when explicitly enabled

ttop/

Thread-level CPU usage:

  • Thread-level performance monitoring data
  • Collected at configurable intervals
  • Shows CPU usage per thread over time

kubernetes/ (Kubernetes deployments only)

Kubernetes-specific information collected via Kubernetes API:

Resource Definitions (JSON format):

  • nodes.json - Cluster node information and status
  • pods.json - Pod definitions, status, and metadata
  • service.json - Service configurations and endpoints
  • endpoints.json - Service endpoint details
  • deployments.json - Deployment configurations
  • statefulsets.json - StatefulSet definitions
  • daemonset.json - DaemonSet configurations
  • replicaset.json - ReplicaSet definitions
  • cronjob.json - CronJob schedules and configurations
  • job.json - Job definitions and status
  • events.json - Kubernetes cluster events
  • ingress.json - Ingress controller configurations
  • limitrange.json - Resource limit ranges
  • resourcequota.json - Resource quota definitions
  • hpa.json - Horizontal Pod Autoscaler configurations
  • pdb.json - Pod Disruption Budget settings
  • pc.json - Priority Class definitions
  • pv.json - Persistent Volume definitions
  • pvc.json - Persistent Volume Claim configurations
  • sc.json - Storage Class definitions

Container Logs:

  • container-logs/<pod-name>-<container-name>.txt - Current container logs
  • container-logs/<pod-name>-<container-name>-previous.txt - Previous container logs (if available)

Examples of container log files:

  • dremio-master-0-dremio-master-coordinator.txt - Coordinator container logs
  • dremio-executor-0-dremio-executor.txt - Executor container logs
  • dremio-executor-0-wait-for-zookeeper.txt - Init container logs
  • zk-0-kubernetes-zookeeper.txt - ZooKeeper container logs

File Naming Conventions

  • Node-specific files: Include node name in path or filename
  • Timestamped files: Use format YYYY-MM-DD_HH_MM_SS
  • System tables: Pattern sys.<table-name>_offset_<n>_limit_<n>.json
  • Log archives: Original name with timestamp or sequence number
  • Special characters: Replaced with underscores in filenames (?, =, &)

Collection Modes

Different collection modes include different subsets of data:

  • Light: Minimal collection for quick diagnostics

    • 2 days of logs, 2 days of queries.json
    • 20 job profiles (if PAT is provided), no JFR/JStack/ttop
    • 5GB minimum free space required
  • Standard: Basic logs, configs, and system info

    • 7 days of logs, 30 days of queries.json
    • 20 job profiles (if PAT is provided), includes JFR and ttop
    • 25GB minimum free space required
  • Standard+JStack: Adds thread dumps to Standard

    • 7 days of logs, 30 days of queries.json
    • 20 job profiles (if PAT is provided), includes JFR, JStack, and ttop
    • 25GB minimum free space required
  • Health Check: Comprehensive collection including performance data

    • 7 days of logs, 30 days of queries.json
    • 10,000 job profiles, includes JFR and ttop
    • 40GB minimum free space required
  • WAF (Well Architected Framework): Subset focused on federation diagnostics

    • 3 days of logs, 28 days of queries.json
    • 25,000 job profiles, no JFR/JStack/ttop
    • 40GB minimum free space required

Size and Retention

  • Log retention: Varies by collection mode (see Collection Modes above)
    • Light: 2 days logs, 2 days queries.json
    • Standard/Standard+JStack/Health Check: 7 days logs, 30 days queries.json
    • WAF: 3 days logs, 28 days queries.json
  • Job profiles: Number collected varies by mode (20 to 25,000)
  • System table limits: Configurable row limits (default 100,000 rows)
  • Archive splitting: Large archives can be split into multiple files (default 256MB limit)
  • Compression: All data is compressed in the final tarball
  • Free space requirements: 5GB (Light) to 40GB (Health Check/WAF)

Security Notes

  • Passwords and tokens are masked in configuration files
  • Sensitive data is redacted where possible, but this is not always possible especially in query text
  • PAT tokens required for REST API collections
  • File permissions are restricted during collection