Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat performance monitoring #189

Open
wants to merge 33 commits into
base: epic_db_query_performance_monitoring
Choose a base branch
from

Conversation

rahulreddy15
Copy link
Contributor

@rahulreddy15 rahulreddy15 commented Jan 2, 2025

PR to merge performance monitoring code from the fork to the branch epic_db_query_performance_monitoring

@CLAassistant
Copy link

CLAassistant commented Jan 2, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
3 out of 5 committers have signed the CLA.

✅ sjyothi54
✅ rahulreddy15
✅ tharun0064
❌ renovate[bot]
❌ newrelic-coreint-bot
You have signed the CLA already but the status is still pending? Let us recheck it.

@tharun0064 tharun0064 force-pushed the feat_performance_monitoring branch 3 times, most recently from a3610b4 to a920d40 Compare January 8, 2025 06:47
@rahulreddy15 rahulreddy15 changed the base branch from master to epic_db_query_performance_monitoring January 8, 2025 08:35
@rahulreddy15 rahulreddy15 marked this pull request as ready for review January 9, 2025 04:18
@rahulreddy15 rahulreddy15 requested a review from a team as a code owner January 9, 2025 04:18
Copy link
Contributor

@rajrohanyadav rajrohanyadav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I only reviewed the new code, and would not have 100% confidence on any impacts on the older code. Therefore, additional reviews would be required.

  • As pointed by the linter, some files are not formatted correctly.
  • It is difficult to assert and understand the logic without any comments or unit tests. Can we please add unit tests for all logical functions and utils. Please add any comments helpful to understand what's going on :).
  • I also see some code duplication as pointed by the linter as well. We should check if that can be removed.
  • If we can, I think it can be refactored to only check for version once as the integration is running for a single DBMS, and build our struct based on that, which can still have the same functions for getting query metrics. But use different adapters based on the version. This could remove some duplication and improve readability.
  • It would be helpful if the PR description is updated with some more details and testing details.

Thanks

go.mod Outdated Show resolved Hide resolved
src/args/argument_list.go Outdated Show resolved Hide resolved
src/args/argument_list.go Outdated Show resolved Hide resolved
src/main.go Show resolved Hide resolved
Copy link

@tharun0064 tharun0064 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rajrohanyadav
Thanks for reviewing the PR
can you mention the comments with criticality as P0 ,P1 etc
we will work on them accordingly

@rajrohanyadav
Copy link
Contributor

@rajrohanyadav Thanks for reviewing the PR can you mention the comments with criticality as P0 ,P1 etc we will work on them accordingly

P0 - Unit tests
P1 - Everything else (that we can accomplish) :)

Copy link

@tharun0064 tharun0064 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rajrohanyadav
Added unit test-cases

src/main.go Outdated Show resolved Hide resolved
src/main.go Outdated Show resolved Hide resolved
Comment on lines 5 to 7
const MaxQueryThreshold = 30
const MaxIndividualQueryThreshold = 10
const PublishThreshold = 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add comments explaining why this value were chosen and how they are used

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was max metrics we would fetch respectively
This was discussed with PMs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though you discussed it with PMs it would be helpful for the future us to know why a value was chosen.
Moreover, we should add comments to constants explaining what they do

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding comment explaining the constants.

Can you also add a NOTE explaining WHY we have choose those defaults and the assumptions made.
It would be helpful in the future to know why a value was chosen.

return individualQueryMetricsListInterface, individualQueryMetricsForExecPlanList
}

func getIndividualQueriesSamples(conn *performancedbconnection.PGSQLConnection, slowRunningQueries datamodels.SlowRunningQueryMetrics, args args.ArgumentList, databaseNames string, anonymizedQueriesByDB map[string]map[string]string, individualQueryMetricsForExecPlanList *[]datamodels.IndividualQueryMetrics, individualQueryMetricsListInterface *[]interface{}, versionSpecificIndividualQuery string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be refactored, there are 8 args, it is not clear what it does

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually it is easier to return a copy instead of modifying it via pointer, it is not clear by looking at the signature that we are modifying something

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i return a list
i have to merge the list back to another list in caller - adding extra complexity

and if i pass it by pointer
it will add directly to list which is in caller

regards the args - i would like to discuss something - let me know when you be free to connect

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i hope this was resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From yesterdays discussion with Paolo, instead of passing pointers and modifying it in the func, we would return the required lists, This would make code readability easier and also reduce number of parameters the func needs.

@rahulreddy15 rahulreddy15 force-pushed the feat_performance_monitoring branch from 46847de to 8509bcd Compare January 16, 2025 13:56
return individualQueryMetricsListInterface, individualQueryMetricsForExecPlanList
}

func getIndividualQueriesSamples(conn *performancedbconnection.PGSQLConnection, slowRunningQueries datamodels.SlowRunningQueryMetrics, gv *globalvariables.GlobalVariables, anonymizedQueriesByDB map[string]map[string]string, individualQueryMetricsForExecPlanList *[]datamodels.IndividualQueryMetrics, individualQueryMetricsListInterface *[]interface{}, versionSpecificIndividualQuery string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the issue with this signature is that the args are inputs and output of the function, moreover we could think of grouping the results together and maybe build the query outside, by doing so we could get something like:

query := fmt.Sprintf(...)

results := getIndividualQueriesSamples(conn, query, anonymizedQueriesByDB)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the query changes as per the queryId
so we have to structure the query inside the function rather than in caller

yes the args contain the input and output to remove the extra complexity of adding a list to another list

Copy link

@sairaj18 sairaj18 Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the query changes as per the queryId
so we have to structure the query inside the function rather than in caller

Even if the query changes as per the queryid. We can build it in the caller function right?

yes the args contain the input and output to remove the extra complexity of adding a list to another list

Refer this comment. for reducing the number of parameters the func needs

Why are we passing this parameter slowRunningQueries datamodels.SlowRunningQueryMetrics to the func if we only want *slowRunningQueries.QueryID?

Comment on lines 41 to 65
log.Debug("Starting PopulateSlowRunningMetrics at ", start)
slowRunningQueries := performancemetrics.PopulateSlowRunningMetrics(newConnection, pgIntegration, gv)
log.Debug("PopulateSlowRunningMetrics completed in ", time.Since(start))

start = time.Now()
log.Debug("Starting PopulateWaitEventMetrics at ", start)
_ = performancemetrics.PopulateWaitEventMetrics(newConnection, pgIntegration, gv)
log.Debug("PopulateWaitEventMetrics completed in ", time.Since(start))

start = time.Now()
log.Debug("Starting PopulateBlockingMetrics at ", start)
performancemetrics.PopulateBlockingMetrics(newConnection, pgIntegration, gv)
log.Debug("PopulateBlockingMetrics completed in ", time.Since(start))

start = time.Now()
log.Debug("Starting PopulateIndividualQueryMetrics at ", start)
individualQueries := performancemetrics.PopulateIndividualQueryMetrics(newConnection, slowRunningQueries, pgIntegration, gv)
log.Debug("PopulateIndividualQueryMetrics completed in ", time.Since(start))

start = time.Now()
log.Debug("Starting PopulateExecutionPlanMetrics at ", start)
performancemetrics.PopulateExecutionPlanMetrics(individualQueries, pgIntegration, gv)
log.Debug("PopulateExecutionPlanMetrics completed in ", time.Since(start))

log.Debug("Query analysis completed.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would extract all of this to a different function to hide the complexity from this sub_main

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made that change

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this is resolved

tharun0064 and others added 9 commits January 27, 2025 15:13
* resolved : review comments

---------

Co-authored-by: Jyothi Surampudi <[email protected]>
Co-authored-by: jsurampudi <[email protected]>
Co-authored-by: Rahul Reddy <[email protected]>
 fix: lint issues

---------

Co-authored-by: Jyothi Surampudi <[email protected]>
Co-authored-by: jsurampudi <[email protected]>
Co-authored-by: Rahul Reddy <[email protected]>

* added more scenerios for unit test cases

---------

Co-authored-by: Jyothi Surampudi <[email protected]>
Co-authored-by: jsurampudi <[email protected]>
Co-authored-by: Rahul Reddy <[email protected]>
* rename variable

---------

Co-authored-by: Jyothi Surampudi <[email protected]>
Co-authored-by: jsurampudi <[email protected]>
Co-authored-by: Rahul Reddy <[email protected]>
 resolved : review comments
---------

Co-authored-by: Jyothi Surampudi <[email protected]>
Co-authored-by: jsurampudi <[email protected]>
Co-authored-by: Rahul Reddy <[email protected]>
* remove unused variables

---------

Co-authored-by: Jyothi Surampudi <[email protected]>
Co-authored-by: jsurampudi <[email protected]>
Co-authored-by: Rahul Reddy <[email protected]>
* resolved: review comments

---------

Co-authored-by: Jyothi Surampudi <[email protected]>
Co-authored-by: jsurampudi <[email protected]>
Co-authored-by: Rahul Reddy <[email protected]>
Feat: resolved review comments
* add check while fetching execution plan

* resolved : lint issues

---------

Co-authored-by: Jyothi Surampudi <[email protected]>
Co-authored-by: jsurampudi <[email protected]>
Co-authored-by: Rahul Reddy <[email protected]>
log.Error("Unsupported postgres version: %v", err)
return nil, nil, err
}
var query = fmt.Sprintf(versionSpecificSlowQuery, gv.DatabaseString, min(gv.Arguments.QueryCountThreshold, commonutils.MaxQueryCountThreshold))
Copy link

@sairaj18 sairaj18 Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If gv.Arguments.QueryCountThreshold is configured with a -ve value, We need to log a warning saying QueryCountThreshold cannot be -ve and use the DefaultQueryCountThreshold value here.

Whenever gv.Arguments.QueryCountThreshold is configured with a value greater than commonutils.MaxQueryCountThreshold then we are only fetching commonutils.MaxQueryCountThreshold number of results. We definitely need a warning log here. It would help when debuggin prod issues.

Unifying the logic that checks if the args are valid and logging warn in /validations/perfromance_metrics_validations.go would help in reducing redandant checks wherever min(gv.Arguments.QueryCountThreshold, commonutils.MaxIndividualQueryCountThreshold) is being used.

log.Warn("Query count threshold is greater than max supported value, setting to max supported value: %d", constants.MaxQueryCountThreshold)


func GetWaitEventMetrics(conn *performancedbconnection.PGSQLConnection, gv *globalvariables.GlobalVariables) ([]interface{}, error) {
var waitEventMetricsList []interface{}
var query = fmt.Sprintf(queries.WaitEvents, gv.DatabaseString, min(gv.Arguments.QueryCountThreshold, commonutils.MaxQueryCountThreshold))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, check if the QueryCountThreshold is valid and log a warning whenever required. Refer here

log.Error("Unsupported postgres version: %v", err)
return nil, err
}
var query = fmt.Sprintf(versionSpecificBlockingQuery, gv.DatabaseString, min(gv.Arguments.QueryCountThreshold, commonutils.MaxQueryCountThreshold))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, check if the QueryCountThreshold is valid and log a warning whenever required. Refer here

}

func getIndividualQueriesSamples(conn *performancedbconnection.PGSQLConnection, slowRunningQueries datamodels.SlowRunningQueryMetrics, gv *globalvariables.GlobalVariables, anonymizedQueriesByDB databaseQueryInfoMap, individualQueryMetricsForExecPlanList *[]datamodels.IndividualQueryMetrics, individualQueryMetricsListInterface *[]interface{}, versionSpecificIndividualQuery string) {
query := fmt.Sprintf(versionSpecificIndividualQuery, *slowRunningQueries.QueryID, gv.DatabaseString, gv.Arguments.QueryResponseTimeThreshold, min(gv.Arguments.QueryCountThreshold, commonutils.MaxIndividualQueryCountThreshold))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, check if the QueryCountThreshold, QueryResponseTimeThreshold is valid and log a warning whenever required. Use default value when configured with -ve value. Refer here

@@ -31,6 +31,9 @@ type ArgumentList struct {
CollectDbLockMetrics bool `default:"false" help:"If true, enables collection of lock metrics for the specified database. (Note: requires that the 'tablefunc' extension is installed)"` //nolint: stylecheck
CollectBloatMetrics bool `default:"true" help:"Enable collecting bloat metrics which can be performance intensive"`
ShowVersion bool `default:"false" help:"Print build information and exit"`
EnableQueryMonitoring bool `default:"false" help:"Enable collection of detailed query performance metrics."`
QueryResponseTimeThreshold int `default:"500" help:"Threshold in milliseconds for query response time. If response time exceeds this threshold, the query will be considered slow."`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument is being used only while fetching individual query metrics. So instead of If response time exceeds this threshold, the query will be considered slow. adding If response time for the idividual query exceeds this threshold, the individaul query is reported in metrics or something similar makes sense?

# Threshold in milliseconds for query response time to fetch individual query performance metrics - Defaults to 500
# QUERY_RESPONSE_TIME_THRESHOLD : "500"

# The number of records for each query performance metrics (maximum slow query metrics can be 20 and execution plan metrics can be 30)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number of records to be retrieved?

Comment on lines +33 to +35
if version < commonutils.PostgresVersion12 {
return false, nil
}
Copy link

@sairaj18 sairaj18 Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of checking if the version is supported in all of the below functions we can have function CheckPostgresVersionSupport and verify if the postgres version is supported only once in query_performance_main.go file before calling populateQueryPerformanceMetrics function. This will remove redundant checks.

func CheckPostgresVersionSupport(version uint64) bool{
    if version < commonutils.PostgresVersion12 {
		return false
	}
    return true
}

Comment on lines +23 to +36
var query = fmt.Sprintf(versionSpecificSlowQuery, gv.DatabaseString, min(gv.Arguments.QueryCountThreshold, commonutils.MaxQueryCountThreshold))
rows, err := conn.Queryx(query)
if err != nil {
return nil, nil, err
}
defer rows.Close()
for rows.Next() {
var slowQuery datamodels.SlowRunningQueryMetrics
if scanErr := rows.StructScan(&slowQuery); scanErr != nil {
return nil, nil, err
}
slowQueryMetricsList = append(slowQueryMetricsList, slowQuery)
slowQueryMetricsListInterface = append(slowQueryMetricsListInterface, slowQuery)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen this part of code from line 24-36 is duplicated in blocking_sessions.go, wait_evetn_metrics.go files. Could we try to place this logic in a func and reuse it wherever needed?

Comment on lines +47 to +64
rows, err := conn.Queryx(query)
if err != nil {
log.Error("Failed to execute query: %v", err)
return nil, err
}
defer rows.Close()
for rows.Next() {
var blockingQueryMetric datamodels.BlockingSessionMetrics
if scanError := rows.StructScan(&blockingQueryMetric); scanError != nil {
return nil, scanError
}
// For PostgreSQL versions 13 and 12, anonymization of queries does not occur for blocking sessions, so it's necessary to explicitly anonymize them.
if gv.Version == commonutils.PostgresVersion13 || gv.Version == commonutils.PostgresVersion12 {
*blockingQueryMetric.BlockedQuery = commonutils.AnonymizeQueryText(*blockingQueryMetric.BlockedQuery)
*blockingQueryMetric.BlockingQuery = commonutils.AnonymizeQueryText(*blockingQueryMetric.BlockingQuery)
}
blockingQueriesMetricsList = append(blockingQueriesMetricsList, blockingQueryMetric)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here could we have a common func and reuse it

Comment on lines +44 to +55
rows, err := conn.Queryx(query)
if err != nil {
return nil, err
}
defer rows.Close()
for rows.Next() {
var waitEvent datamodels.WaitEventMetrics
if waitScanErr := rows.StructScan(&waitEvent); waitScanErr != nil {
return nil, err
}
waitEventMetricsList = append(waitEventMetricsList, waitEvent)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, could we have a common func and reuse it.

Comment on lines +58 to +60
if version == commonutils.PostgresVersion12 || version == commonutils.PostgresVersion13 {
return true, nil
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a comment explaining why we are returing ture for these versions.

if waitErr != nil {
return false, waitErr
}
pgStatExtension, statErr := isExtensionEnabled(conn, "pg_stat_statements")
Copy link

@sairaj18 sairaj18 Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already executed the same query in CheckSlowQueryMetricsFetchEligibility by calling isExtensionEnabled(conn, "pg_stat_statements"). can we do it only once?

if version == commonutils.PostgresVersion12 || version == commonutils.PostgresVersion13 {
return true, nil
}
return isExtensionEnabled(conn, "pg_stat_statements")
Copy link

@sairaj18 sairaj18 Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already executed the same query in CheckSlowQueryMetricsFetchEligibility by calling isExtensionEnabled(conn, "pg_stat_statements"). can we do it only once?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants