Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Enhanced query performance monitoring features to track slow queries, examine query execution plans, monitor wait events, and identify blocking sessions. #171

Open
wants to merge 31 commits into
base: epic_db_query_performance_monitoring
Choose a base branch
from

Conversation

spathlavath
Copy link

@spathlavath spathlavath commented Jan 8, 2025

Enhanced query performance monitoring features to track slow queries, examine query execution plans, monitor wait events, and identify blocking sessions.

Co-authored-by:
Srikanth @RamanaReddy8801

@spathlavath spathlavath requested a review from a team as a code owner January 8, 2025 13:46
@CLAassistant
Copy link

CLAassistant commented Jan 8, 2025

CLA assistant check
All committers have signed the CLA.

@spathlavath spathlavath changed the title feat: Introducing query performance monitoring for slow queries, query execution plan, wait events and blocking sessions feat: Enhanced query performance monitoring features to track slow queries, examine query execution plans, monitor wait events, and identify blocking sessions. Jan 8, 2025
src/mysql.go Show resolved Hide resolved
src/query-performance-monitoring/utils/helpers.go Outdated Show resolved Hide resolved
src/mysql.go Outdated Show resolved Hide resolved
mysql-config.yml.sample Outdated Show resolved Hide resolved
src/args/argument_list.go Outdated Show resolved Hide resolved
src/args/argument_list.go Outdated Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
Copy link
Author

@spathlavath spathlavath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rahulreddy15
thanks for reviewing
please share comments with criticality
so we can work on them as per priority

go.mod Outdated Show resolved Hide resolved
mysql-config.yml.sample Outdated Show resolved Hide resolved
src/args/argument_list.go Outdated Show resolved Hide resolved
src/args/argument_list.go Outdated Show resolved Hide resolved
src/query-performance-monitoring/utils/helpers.go Outdated Show resolved Hide resolved
src/mysql.go Outdated Show resolved Hide resolved
SupportedStatements = "SELECT INSERT UPDATE DELETE WITH"
QueryPlanTimeoutDuration = 10 * time.Second
TimeoutDuration = 5 * time.Second // TimeoutDuration defines the timeout duration for database queries
MaxQueryCountThreshold = 30
Copy link
Contributor

@sairaj18 sairaj18 Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have defaults (line 8, 13, 14, 15, 19)
There need to be comments about why those defaults were chosen.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WHY
not what

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding comments about each of the constants.

Can you also add a one liner explaining WHY we have choose those defaults?
It would be helpful in the future to know why a value was chosen.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following settings have been introduced after discussions with the project managers and may be adjusted in the future:

  • DefaultSlowQueryFetchInterval = 30
  • DefaultQueryResponseTimeThreshold = 500
  • DefaultQueryCountThreshold = 20
  • MaxQueryCountThreshold = 30
  • IndividualQueryCountThreshold = 10

For instance, if QueryCountThreshold is set to 50, then in the worst-case scenario:

  • Slow queries would total 50.
  • Individual queries would amount to 50 * IndividualQueryCountThreshold, equaling 500.
  • When considering the execution plan for queries, assuming there are 5 objects in the execution plan JSON for each individual query, this would result in 2500 objects to handle.
  • Wait events would number 50.
  • Blocking sessions would also total 50.

With a configuration interval set at 30 seconds, processing these results can consume significant time and resources. This, in turn, imposes additional overhead on the customer's database.

Copy link
Contributor

@sairaj18 sairaj18 Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, could u add a comment as shown below and also include assumptions made to chose MetricSetLimit as 100 as well?

/*
NOTE: The default and max values chosen may be adjusted in the future. Assumptions made to choose the defaults and max values:

For instance, if QueryCountThreshold is set to 50, then in the worst-case scenario:
    - Slow queries would total 50.
    - Individual queries would amount to 50 * IndividualQueryCountThreshold, equaling 500.
    - When considering the execution plan for queries, assuming there are 5 objects in the execution plan JSON for each individual query, this would result in 2500 objects to handle.
    - Wait events would number 50.
    - Blocking sessions would also total 50.
    
With a configuration interval set at 30 seconds, processing these results can consume significant time and resources. This, in turn, imposes additional overhead on the customer's database. Hence, we are enforcing MaxQueryCountThreshold as 30
*/

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@spathlavath spathlavath force-pushed the epic_performance_monitoring branch 5 times, most recently from 6ff4f65 to ec80544 Compare January 15, 2025 07:02
…king sessions and wait events (#7)

* feat: Introducing query performance monitoring for slow queries, blocking sessions and wait events.

* refactor: Implemented a limit for wait events,  blocking sessions, and included bug fixes (#8)

* refactor: Implemented a limit for wait events,  blocking sessions, and included bug fixes.

* Included a stepID for each row during the query execution iteration

* Added fix for stepID

* Added a fix to ensure that  increments correctly.

* Renamed FETCH_INTERVAL to SLOW_QUERY_FETCH_INTERVAL

* Added detailed logging for each operation, including the time taken for execution.

* refactor: Added configuration option to disable query performance metrics per database (#10)

* refactor: Added configuration option to disable query performance metrics per database

* Revised the list of input arguments for retrieving individual queries.

* Updated logging messages and Revised the list of input arguments for retrieving wait events and blocking session queries.

* Added a helper function to obtain a list of unique databases to exclude.

* code refactoring

* Added fix for number of arguments mismatch for the SQL query

* removed rebind functionality

* updated metricset limit

* reverted metricset limit

* minor code refactoring

* fixed linting errors

* fixed linting errors

* refactor: resolving linting errors (#12)

* refactor: resolving linting errors

* fixing linting errors

* fixing linting errors

* fixing linting errors

* fixing linting errors

* fixing linting errors

* refactor: resolving linting errors (#13)

* refactor: changed log.info to log.debug and other bug fixes (#14)

* refactor: code refactoring and addressing review comments (#15)

* refactor: code refactoring and addressing review comments

* lint issue fixes

* lint issue fixes

* lint issue fixes

* lint issue fixes

* lint issue fixes

* refactor: Added a limit on individual query details and defined min/max values for the limit threshold. (#16)

* refactor: Added a limit on individual query details and defined min/max values for the limit threshold.

* minor enhancements

* minor enhancements

* minor enhancements

* refactor: code resturcturing (#17)

* refactor: code resturcturing

* file name changes

* Blocking sessions query update

* Blocking sessions query update

* package changes

* Added code review fixes

* Added code review fixes

* Added code review fixes

* Added code review fixes

* Added code review fixes

* Added code review fixes

* Added code review fixes

* query execution plan changes

* file name changes

* Added code review fixes
…Delete and Update queries. Updated blocking sessions data model (#18)
…ion times (#20)

Updated wait events query to  analyze wait events and execution times
@spathlavath spathlavath force-pushed the epic_performance_monitoring branch from ec80544 to c796359 Compare January 15, 2025 07:07
mysql-config.yml.sample Outdated Show resolved Hide resolved
src/mysql.go Outdated Show resolved Hide resolved
* refactor: Added utility to escape backticks in JSON strings
* Added unit tests for ValidateAndSetDefaults function and fix argument handling
Copy link
Contributor

@sigilioso sigilioso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the previous comments. I've left additional feedback

src/args/argument_list.go Outdated Show resolved Hide resolved
src/mysql_test.go Outdated Show resolved Hide resolved
src/mysql.go Outdated Show resolved Hide resolved
mysql-config.yml.sample Show resolved Hide resolved
src/query-performance-monitoring/validator/validations.go Outdated Show resolved Hide resolved
src/query-performance-monitoring/utils/helpers.go Outdated Show resolved Hide resolved
})
}

func TestGetUniqueExcludedDatabases(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we might cover all these test cases with the test below (testing the public function, directly from JSON)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand - TestGetUniqueExcludedDatabases unit test is not needed if we cover all the test cases this unit covers in the TestGetExcludedDatabases unit test.

src/query-performance-monitoring/utils/helpers.go Outdated Show resolved Hide resolved
src/query-performance-monitoring/utils/helpers.go Outdated Show resolved Hide resolved
sqlxDB := sqlx.NewDb(db, "sqlmock")
mockDataSource := &mockDataSource{db: sqlxDB}

mock.ExpectQuery("SHOW GLOBAL VARIABLES LIKE 'performance_schema';").WillReturnRows(rows)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we declare this query SHOW GLOBAL VARIABLES LIKE 'performance_schema'; as a constant in validations.go and reuse it here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Copy link
Contributor

@sairaj18 sairaj18 Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the change reflected

@spathlavath spathlavath requested a review from sigilioso January 27, 2025 09:35
@@ -3452,28 +3452,6 @@
],
"type": "object"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these metric's removed performance_schema_events_stages_history_long_size performance_schema_events_stages_history_size ?

I see it has been removed from all the json schema files.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed this consumer events_statements_cpu from validations as it is not required now. so I've removed below two metrics from json schema files.
performance_schema_events_stages_history_long_size
performance_schema_events_stages_history_size

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make not required instead of removing it from the json-schema-performance-files directory schema files.

Also, Don't modify the already existing json schema files i.e json-schema-files-${version} directories.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the changes reflected. Can you check once?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you have added the metrics I mentioned above. But, there are two other metrics as well which were removed. Can you refer this commit and then revert all that were removed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please check now?

Comment on lines 135 to 137
if err == nil {
t.Fatal("Expected error collecting metrics, got nil")
}
Copy link
Contributor

@sigilioso sigilioso Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd try to keep tests consistent: sometimes we use if err == nil { t.Fatal(...) } and sometimes we use assert.Error / assert.NoError. Could we make it uniform? (Considering that testify is already used, I find its usage more readable)

This comment also applies to other unit tests that are not consistent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the changes are not reflected in these files

image
image

Comment on lines +181 to +194
tableName, _ := js.Get("table_name").String()
queryCost, _ := js.Get("cost_info").Get("query_cost").String()
accessType, _ := js.Get("access_type").String()
rowsExaminedPerScan, _ := js.Get("rows_examined_per_scan").Int64()
rowsProducedPerJoin, _ := js.Get("rows_produced_per_join").Int64()
filtered, _ := js.Get("filtered").String()
readCost, _ := js.Get("cost_info").Get("read_cost").String()
evalCost, _ := js.Get("cost_info").Get("eval_cost").String()
prefixCost, _ := js.Get("cost_info").Get("prefix_cost").String()
dataReadPerJoin, _ := js.Get("cost_info").Get("data_read_per_join").String()
usingIndex, _ := js.Get("using_index").Bool()
keyLength, _ := js.Get("key_length").String()
possibleKeysArray, _ := js.Get("possible_keys").StringArray()
key, _ := js.Get("key").String()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to fail silently if any field is missing or the type doesn't match? I guess that the structure is pretty stable. However, if it changes in future mysql versions, hiding the errors will make detecting such changes difficult.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query execution plan structure/payload varies with query to query
Its not fixed payload
So its expected to have fields missing while fetching them and this was discussed already with PMs

Copy link
Contributor

@sigilioso sigilioso Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: #171 (comment)

If we are supporting different structures for different queries, we need unit-tests to assure that the query parsing is working as expected.


// extractMetricsFromJSONString extracts metrics from a JSON string.
func extractMetricsFromJSONString(jsonString string, eventID uint64, threadID uint64) ([]utils.QueryPlanMetrics, error) {
js, err := simplejson.NewJson([]byte(jsonString))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider defining a struct to represent the result of the query plan instead of using simplejson? The current approach has some downsides that could be addressed using a model for the response:

  • The need to unmarshall and marshal back results in order to execute extractMetrics recursively (which are costly operations)
  • The need of explicit reflection (also costly)
  • Code complexity:
    • The extractMetrics function is partially defining the model (the field names are defined there) and any type error would either fail silently or panic.
    • The code recursive code to extract the metrics has hight complexity and it is not fully covered with unit tests

Copy link
Author

@spathlavath spathlavath Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion to define a struct to represent the query plan result instead of using simplejson. I appreciate the benefits you've highlighted: reduced marshalling/unmarshalling, less reflection, improved type safety, and potentially simpler code.

However, after careful consideration, I believe that using a struct to model the entire query plan is not the most suitable approach for our specific use case, primarily due to the variability and complexity of the EXPLAIN FORMAT=JSON output from MySQL, and the fact that we need to handle potentially invalid JSON.

Challenges with Using a Struct:

  1. Dynamic and Complex JSON Structure: The JSON structure returned by EXPLAIN FORMAT=JSON is highly dynamic and can vary significantly depending on the query. It often contains nested objects and arrays with varying levels of depth. Creating a struct that accurately represents all possible variations would be extremely complex and difficult to maintain. The structure is not fixed and depends on factors like:

    • The type of query (SELECT, UPDATE, DELETE, etc.)
    • The presence of joins, subqueries, and other query constructs.
    • The specific optimizations chosen by the MySQL query optimizer.
  2. Potentially Invalid JSON: As we've discussed, the attached_condition field frequently contains unescaped characters that can make the entire JSON invalid. If we were to use a struct, we would still need a way to handle these parsing errors gracefully. We would also need to avoid unmarshalling the parts of JSON that are invalid, which would add complexity.

  3. Maintenance Overhead: Maintaining a complex struct that needs to be updated whenever the EXPLAIN output format changes in a new MySQL version would create significant maintenance overhead.

Current Approach:

Given these challenges, our current approach using simplejson offers several advantages:

  • Flexibility: simplejson allows us to easily navigate and extract the specific fields we need, even with a dynamic and potentially invalid JSON structure.
  • Error Handling: Our current error handling around simplejson.NewJson, and the logic within extractMetricsFromJSONString, processMap, and processSliceValue allows us to gracefully handle cases where the JSON is partially invalid, either by returning empty slice if we can recover partially or returning the error if recovery is not possible.
  • Targeted Data Extraction: We can selectively extract only the fields relevant to our monitoring needs without the overhead of a large, all-encompassing struct.
  • Reduced Maintenance: We avoid the maintenance burden of a complex struct that would need to be constantly updated to match changes in the EXPLAIN output format.

Addressing Review Points:

  • Marshalling/Unmarshalling: While it's true that we currently marshal the map in processMapValue and processSliceValue, this is less costly than unmarshalling the entire JSON into a large struct and then potentially not using most of the fields.
  • Reflection: The use of reflection is limited to checking the type of map values (Map or Slice) and is not a major performance bottleneck in our case.
  • Code Complexity: Although recursive, the extractMetrics, processMap, and processSliceValue functions are relatively straightforward in their logic. The complexity arises more from the nature of the EXPLAIN output itself than from our code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for also generating this response.

I think it is important to review some aspects of the implementation even if we keep the current approach.

Lack of unit testing

I understand the challenges of modeling such complex result. However, what I said in #171 (comment) also applies here: If we need to support a lot of different EXPLAIN results, with different structure depending on the query (INSERT, DELETE, UPDATE, subqueries, optimizations, ...), we need unit tests to assure we are fetching the data as expected. These tests will also help supporting any future change in the EXPLAIN structure AND troubleshooting missing data for a particular queries.

marshal / unmarshall many times in order to use simplejson

Even if the unit-tests are the main thing to address, I would also like to point out that we could keep the dynamic approach (if modeling the results is not feasible / suitable) and avoid marshaling / unmarshaling many times. Marshaling each element in processSliceValue and processMapValue is merely needed to build the simplejson.Json struct (which will unmarshal the corresponding bytes). We could:

  • Work with the Json object directly which can be challenging (because the simplejson tool might not provide the helpers to perform complex actions. Eg: js.Map()returns map[string]interface{} and we don't have a helper returning map[string]*Json) but feasible (we could iterate over the results of js.Map() and obtain a Json struct through js.Get(key)).
  • Use the iterface{} representation of the data (and perform the corresponding type assertion to extract data) instead of using simplejson and its helpers.

Possible skipped values (depending on the expected behavior)

  • The extractMetrics function calls processMap with the result of js.Map() if it doesn't fail. But if js holds a slice, js.Map() will return an error and the function ends. Could this be an issue?
  • Something similar is happening in the implementation of processSliceValue: only map[string]interface{} elements are considered. Slices of slices aren't supposed to be supported?

By only reviewing the code it is not possible to know if the potential issues described above could be possible with known EXPLAIN outputs, but the current approach was claimed to be flexible.

src/query-performance-monitoring/utils/helpers.go Outdated Show resolved Hide resolved
src/args/argument_list.go Outdated Show resolved Hide resolved
// ValidatePreconditions checks if the necessary preconditions are met for performance monitoring.
func ValidatePreconditions(db utils.DataSource) error {
// Check if Performance Schema is enabled
performanceSchemaEnabled, errPerformanceEnabled := isPerformanceSchemaEnabled(db)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before checking if the performanceSchema is enabled don't you think we need to verify if the Mysql server version is supported or not?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

)

var (
ErrCreateNodeEntity = errors.New("error creating node entity")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not being used

}
}

func TestMetricSet(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this unit test to src/infrautils/entity_test.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants