Skip to content

[FEATURE] Filtering only active EMR cluster #1095

Open
@KimMJ

Description

@KimMJ

Is there an existing issue for this?

  • I have searched the existing issues

Feature description

Our team is currently using EMRs heavily. EMRs are created and terminated frequently. We have calculated the total number of EMR clusters and found that about 4000 EMR clusters are created per day, all of which are terminated within a short period of time. If the metadata for EMR is archived 60 days, we have about 240000 EMR cluster metadata, but only about 200 active clusters are actually working.

Here's the problem: yace gets the dimensions via the GetResources api with a resource-type-filters filter with elasticmapreduce:cluster for EMR, and in the process of getting all the dimensions, it's making a lot of GetResources api calls, getting throttled, and eventually not finishing scraping all the metrics.

What I want is to collect metrics for active clusters only, not metrics for terminated clusters. Unfortunately, this situation doesn't seem to have been considered in the current YACE. I checked the code to see if I could filter for active clusters only, but found that the GetResources API call is made with only a filter for resource type applied.

So if the way we get the dimension for EMR is changed to get the jobFlowId with the --active option turned on for the list-clusters in the emr api, I think that would go a long way in reducing the API calls as well.

What might the configuration look like?

No response

Anything else?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions