Skip to content

[FEATURE] Impl Spark DSv2 YARN Connector that supports reading YARN aggregation logs #6832

Open
@pan3793

Description

@pan3793

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the feature

Leverage the Spark DSv2 API to implement a connector that provides a SQL interface to access the YARN agg logs, and maybe other YARN resources in the future.

Motivation

For large-scale Spark on YARN deployments, there are dozens or even hundreds of thousands of Spark applications submitted to a cluster per day, and the app logs are collected and aggregated by YARN stored on HDFS, sometimes we might want to analyze the logs to identify some cluster-level issues, for example, some machine might have hardware issues that frequently produce disk/network exceptions, it's straightforward to leverage Spark to analyze those logs in parallel.

Describe the solution

the usage might be like

$ spark-sql --conf spark.sql.catalog.yarn=org.apache.kyuubi.spark.connector.yarn.YarnCatalog
> SELECT
    app_id, app_attempt_id,
    app_start_time, app_end_time,
    container_id, host,
    file_name, line_num, message
  FROM yarn.agg_logs
  WHERE app_id = 'application_1234'
    AND container_id='container_12345'
    AND host = 'hadoop123.example.com'

Additional context

No response

Are you willing to submit PR?

  • Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
  • No. I cannot submit a PR at this time.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions