This document describes the structure and contents of the diagnostic tarball (diag.tgz) generated by the Dremio Diagnostic Collector (DDC).
The DDC creates a compressed tarball containing diagnostic information from Dremio nodes. The tarball includes logs, configuration files, system information, performance data, and API exports organized in a structured directory hierarchy.
diag.tgz
├── summary.json # Collection summary and metadata
├── ddc.log # DDC execution log
├── <node-name>.log # Individual detailed logs for node collect
├── configuration/
├── logs/
├── node-info/
├── queries/
├── job-profiles/
├── system-tables/
├── cluster-stats/
├── wlm/
├── kvstore/
├── jfr/
├── thread-dumps/
├── heap-dumps/
├── ttop/
└── kubernetes/
- Purpose: Collection metadata and summary information
- Content: Execution details, node information, collection statistics, and any errors encountered
Configuration files from each Dremio node:
dremio.conf- Main Dremio configuration (passwords masked)dremio-env- Environment configurationlogback.xml- Logging configurationlogback-access.xml- Access logging configuration
Log files from each node (configurable retention period):
server.log- Main Dremio server logserver.out- Server output logmetadata_refresh.log- Metadata refresh operationsreflection.log- Reflection/acceleration logsvacuum.json- Vacuum operation logsaccess.log- HTTP access logs (if enabled)audit.log- Audit logs (if enabled)acceleration.log- Acceleration logs (if enabled)server*.gc*- Garbage collection logs (pattern configurable)hs_err_pid*.log- JVM crash dump files
System and node information collected from various system commands:
diskusage.txt- Disk usage information fromdf -hcommand showing filesystem usage, available space, and mount points to identify storage capacity issues and potential out-of-space conditionsrocksdb_disk_allocation.txt- RocksDB disk usage fromdu -sh /opt/dremio/data/db/*(coordinator nodes only) showing metadata store size breakdown for capacity planning and performance analysisjvm_settings.txt- JVM flags and configuration fromjps -vcommand including heap sizes, garbage collector settings, and JVM options for performance tuning and memory issue diagnosisos_info.txt- Comprehensive operating system information from multiple sources:- OS Release Info:
cat /etc/*-release- Linux distribution name, version, and release details for compatibility analysis and known issue identification - Kernel Version:
uname -r- Kernel release version to identify potential kernel-related performance issues or compatibility problems - System Issue:
cat /etc/issue- System login banner that may contain additional OS version or configuration information - Hostname:
cat /proc/sys/kernel/hostname- System hostname for node identification and network troubleshooting - Memory Info:
cat /proc/meminfo- Detailed memory statistics including total, free, cached, and swap memory for memory pressure analysis - CPU Info:
lscpu- CPU architecture, core count, threading, and feature flags for performance tuning and capacity planning - Mount Points:
mount- Active filesystem mounts to identify storage configuration, mount options, and potential I/O bottlenecks - Block Devices:
lsblk- Block device hierarchy showing disks, partitions, and mount relationships for storage troubleshooting - Cgroup Type:
stat -fc %T /sys/fs/cgroup/- Cgroup filesystem type (v1 vs v2) to understand container resource management capabilities - Cgroup v2 Info (if available) - Container resource usage and pressure metrics for containerized deployments:
cat /sys/fs/cgroup/memory.current- Current memory usage within container limitscat /sys/fs/cgroup/memory.swap.current- Current swap usage within container limitscat /sys/fs/cgroup/memory.pressure- Memory pressure stall information indicating memory contentioncat /sys/fs/cgroup/cpu.pressure- CPU pressure stall information indicating CPU contentioncat /sys/fs/cgroup/io.pressure- I/O pressure stall information indicating storage bottlenecks
- Load Average:
cat /proc/loadavg- System load averages (1, 5, 15 minute) and running/total processes for performance assessment - Process Environment:
ps eww <pid>- Dremio process environment variables for configuration validation and troubleshooting startup issues
- OS Release Info:
Query execution logs contains query text, some performance information, what queue was selected, query planning, etc:
queries.json- Current queries.json filequeries.json.*- Archived/rotated queries.json files- Files contain query information for query analysis
Job profile exports (REST API collections):
- JSON files containing detailed job execution profiles
- Organized by query performance categories (slow execution, high cost, etc.)
- File naming pattern:
job-profile-<job-id>.json
Exported Dremio system tables (requires PAT token):
Standard Tables:
sys.tables_offset_0_limit_<n>.json- Table metadatasys.jobs_offset_0_limit_<n>.json- Job informationsys.nodes_offset_0_limit_<n>.json- Cluster node informationsys.options_offset_0_limit_<n>.json- System options/settingssys.memory_offset_0_limit_<n>.json- Memory usagesys.threads_offset_0_limit_<n>.json- Thread informationsys.fragments_offset_0_limit_<n>.json- Query fragment detailssys.materializations_offset_0_limit_<n>.json- Materialization infosys.reflections_offset_0_limit_<n>.json- Reflection definitionssys.refreshes_offset_0_limit_<n>.json- Refresh operationssys.services_offset_0_limit_<n>.json- Service informationsys.version_offset_0_limit_<n>.json- Version informationsys.cache.*_offset_0_limit_<n>.json- Cache-related tables
Enterprise Tables (if available):
sys.roles_offset_0_limit_<n>.json- Role definitionssys.membership_offset_0_limit_<n>.json- User/role membershipsys.privileges_offset_0_limit_<n>.json- Permission information
Cloud Tables (Dremio Cloud only):
organization.*_offset_0_limit_<n>.json- Organization-level dataproject.*_offset_0_limit_<n>.json- Project-level data
Cluster statistics and metrics:
- JSON files with cluster performance and usage statistics
- Collected via REST API calls
Workload Manager information:
- WLM configuration and statistics
- Queue information and resource allocation data
KV Store reports:
- Key-value store health and statistics
- RocksDB information and metrics
Java Flight Recorder profiles:
<node-name>.jfr- JFR recording files- Contains detailed JVM performance data
- Configurable recording duration (default: 60 seconds)
Java thread dumps:
threadDump-<node-name>-<timestamp>.txt- Thread dump files- Multiple dumps collected at configurable intervals
- Contains stack traces of all JVM threads
Java heap dumps (if enabled):
<node-name>.hprof.gz- Compressed heap dump files- Large files containing complete JVM heap state
- Only collected when explicitly enabled
Thread-level CPU usage:
- Thread-level performance monitoring data
- Collected at configurable intervals
- Shows CPU usage per thread over time
Kubernetes-specific information collected via Kubernetes API:
Resource Definitions (JSON format):
nodes.json- Cluster node information and statuspods.json- Pod definitions, status, and metadataservice.json- Service configurations and endpointsendpoints.json- Service endpoint detailsdeployments.json- Deployment configurationsstatefulsets.json- StatefulSet definitionsdaemonset.json- DaemonSet configurationsreplicaset.json- ReplicaSet definitionscronjob.json- CronJob schedules and configurationsjob.json- Job definitions and statusevents.json- Kubernetes cluster eventsingress.json- Ingress controller configurationslimitrange.json- Resource limit rangesresourcequota.json- Resource quota definitionshpa.json- Horizontal Pod Autoscaler configurationspdb.json- Pod Disruption Budget settingspc.json- Priority Class definitionspv.json- Persistent Volume definitionspvc.json- Persistent Volume Claim configurationssc.json- Storage Class definitions
Container Logs:
container-logs/<pod-name>-<container-name>.txt- Current container logscontainer-logs/<pod-name>-<container-name>-previous.txt- Previous container logs (if available)
Examples of container log files:
dremio-master-0-dremio-master-coordinator.txt- Coordinator container logsdremio-executor-0-dremio-executor.txt- Executor container logsdremio-executor-0-wait-for-zookeeper.txt- Init container logszk-0-kubernetes-zookeeper.txt- ZooKeeper container logs
- Node-specific files: Include node name in path or filename
- Timestamped files: Use format
YYYY-MM-DD_HH_MM_SS - System tables: Pattern
sys.<table-name>_offset_<n>_limit_<n>.json - Log archives: Original name with timestamp or sequence number
- Special characters: Replaced with underscores in filenames (
?,=,&)
Different collection modes include different subsets of data:
-
Light: Minimal collection for quick diagnostics
- 2 days of logs, 2 days of queries.json
- 20 job profiles (if PAT is provided), no JFR/JStack/ttop
- 5GB minimum free space required
-
Standard: Basic logs, configs, and system info
- 7 days of logs, 30 days of queries.json
- 20 job profiles (if PAT is provided), includes JFR and ttop
- 25GB minimum free space required
-
Standard+JStack: Adds thread dumps to Standard
- 7 days of logs, 30 days of queries.json
- 20 job profiles (if PAT is provided), includes JFR, JStack, and ttop
- 25GB minimum free space required
-
Health Check: Comprehensive collection including performance data
- 7 days of logs, 30 days of queries.json
- 10,000 job profiles, includes JFR and ttop
- 40GB minimum free space required
-
WAF (Well Architected Framework): Subset focused on federation diagnostics
- 3 days of logs, 28 days of queries.json
- 25,000 job profiles, no JFR/JStack/ttop
- 40GB minimum free space required
- Log retention: Varies by collection mode (see Collection Modes above)
- Light: 2 days logs, 2 days queries.json
- Standard/Standard+JStack/Health Check: 7 days logs, 30 days queries.json
- WAF: 3 days logs, 28 days queries.json
- Job profiles: Number collected varies by mode (20 to 25,000)
- System table limits: Configurable row limits (default 100,000 rows)
- Archive splitting: Large archives can be split into multiple files (default 256MB limit)
- Compression: All data is compressed in the final tarball
- Free space requirements: 5GB (Light) to 40GB (Health Check/WAF)
- Passwords and tokens are masked in configuration files
- Sensitive data is redacted where possible, but this is not always possible especially in query text
- PAT tokens required for REST API collections
- File permissions are restricted during collection