Skip to content

[Feature request] OpenLineage query_plan and query_statistics should be in JSON format #25567

Open
@dolfinus

Description

@dolfinus

I've executed very simple query with OpenLineage integration enabled:

use pg.public;
insert into cde(id,value) select id,value from abc;

And got OpenLineage events:
ol_query_start.json
ol_query_complete.json

Resulting events contain facets trino_metadata and trino_query_statistics, but there are few issues preventing them to be used:

  • trino_metadata.query_plan is a text representation of plan. For automatic plan inspection it's better to include the same plan in JSON format, as another field in the same facet.
  • trino_query_statistics facet contain a lot of fields, all of them have string values, even if value is actually an integer or float number:
    private final long peakUserMemoryBytes;
    private final long peakTaskUserMemory;
    private final long peakTaskTotalMemory;
    private final long physicalInputBytes;
    private final long physicalInputRows;
    private final long processedInputBytes;
    private final long processedInputRows;
    private final long internalNetworkBytes;
    private final long internalNetworkRows;
    private final long totalBytes;
    private final long totalRows;
    private final long outputBytes;
    private final long outputRows;
    private final long writtenBytes;
    private final long writtenRows;
    private final long spilledBytes;
{
    "processedInputRows": "2",
    "physicalInputRows": "2",
    "processedInputBytes": "22",
    "analysisTime": "0.039000000",
    "internalNetworkRows": "4",
    "completedSplits": "11",
    "spilledBytes": "0",
    "outputBlockedTime": "0.0",
    "peakTaskTotalMemory": "288",
    "physicalInputBytes": "0"
}
  • Some fields in trino_query_statistics facet contains nested fields which are objects in Java QueryStatistics class, and instead of JSON they are returned as string representation of Java object:
    private final List<StageCpuDistribution> cpuTimeDistribution;
    private final List<StageOutputBufferUtilization> outputBufferUtilization;
    private final List<StageTaskStatistics> taskStatistics;
    private final List<StageOutputBufferMetrics> outputBufferMetrics;
{
    "cpuTimeDistribution": "[{stageId=0, tasks=1, p25=15, p50=15, p75=15, p90=15, p95=15, p99=15, min=15, max=15, total=15, average=15.0}, {stageId=1, tasks=1, p25=6, p50=6, p75=6, p90=6, p95=6, p99=6, min=6, max=6, total=6, average=6.0}, {stageId=2, tasks=1, p25=3, p50=3, p75=3, p90=3, p95=3, p99=3, min=3, max=3, total=3, average=3.0}]",
    "stageGcStatistics": "[{stageId=0, tasks=1, fullGcTasks=0, minFullGcSec=0, maxFullGcSec=0, totalFullGcSec=0, averageFullGcSec=0}, {stageId=1, tasks=1, fullGcTasks=0, minFullGcSec=0, maxFullGcSec=0, totalFullGcSec=0, averageFullGcSec=0}, {stageId=2, tasks=1, fullGcTasks=0, minFullGcSec=0, maxFullGcSec=0, totalFullGcSec=0, averageFullGcSec=0}]",
    "outputBufferUtilization": "[{stageId=0, tasks=1, p01=0.0, p05=0.0, p10=0.0, p25=0.0, p50=0.0, p75=2.30083448689341E-4, p90=3.0994415283203125E-4, p95=3.0994415283203125E-4, p99=3.0994415283203125E-4, min=0.0, max=3.0994415283203125E-4, duration=0.010681886}, {stageId=1, tasks=1, p01=0.0, p05=0.0, p10=0.0, p25=0.0, p50=6.22467014518136E-5, p75=4.163024838600526E-4, p90=5.006790161132812E-4, p95=5.006790161132812E-4, p99=5.006790161132812E-4, min=0.0, max=5.006790161132812E-4, duration=0.013562824}, {stageId=2, tasks=1, p01=0.0, p05=0.0, p10=0.0, p25=0.0, p50=1.5597716669174048E-4, p75=5.968228705004009E-4, p90=6.318092346191406E-4, p95=6.318092346191406E-4, p99=6.318092346191406E-4, min=0.0, max=6.318092346191406E-4, duration=0.018201713}]",
}

This maybe okay for human, but not suitable for handling by a software, as this requires adding custom parsers.

  • Some fields in trino_query_statistics are JSON-serialized strings inside OpenLineage event JSON:
    /**
    * Operator summaries serialized to JSON. Serialization format and structure
    * can change without preserving backward compatibility.
    */
    private final Supplier<List<String>> operatorSummariesProvider;
    private final List<QueryPlanOptimizerStatistics> optimizerRulesSummaries;
    /**
    * Plan node stats and costs serialized to JSON. Serialization format and structure
    * can change without preserving backward compatibility.
    */
    private final Optional<String> planNodeStatsAndCosts;

    Instead, it could be a nested JSON object.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions