-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Enhanced query performance monitoring features to track slow queries, examine query execution plans, monitor wait events, and identify blocking sessions. #171
base: epic_db_query_performance_monitoring
Are you sure you want to change the base?
Conversation
src/query-performance-monitoring/performance-metrics-collectors/query_execution_plan.go
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rahulreddy15
thanks for reviewing
please share comments with criticality
so we can work on them as per priority
src/query-performance-monitoring/performance-metrics-collectors/query_execution_plan.go
Show resolved
Hide resolved
SupportedStatements = "SELECT INSERT UPDATE DELETE WITH" | ||
QueryPlanTimeoutDuration = 10 * time.Second | ||
TimeoutDuration = 5 * time.Second // TimeoutDuration defines the timeout duration for database queries | ||
MaxQueryCountThreshold = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have defaults (line 8, 13, 14, 15, 19)
There need to be comments about why those defaults were chosen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WHY
not what
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding comments about each of the constants.
Can you also add a one liner explaining WHY we have choose those defaults?
It would be helpful in the future to know why a value was chosen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following settings have been introduced after discussions with the project managers and may be adjusted in the future:
DefaultSlowQueryFetchInterval = 30
DefaultQueryResponseTimeThreshold = 500
DefaultQueryCountThreshold = 20
MaxQueryCountThreshold = 30
IndividualQueryCountThreshold = 10
For instance, if QueryCountThreshold
is set to 50, then in the worst-case scenario:
- Slow queries would total 50.
- Individual queries would amount to 50 *
IndividualQueryCountThreshold
, equaling 500. - When considering the execution plan for queries, assuming there are 5 objects in the execution plan JSON for each individual query, this would result in 2500 objects to handle.
- Wait events would number 50.
- Blocking sessions would also total 50.
With a configuration interval set at 30 seconds, processing these results can consume significant time and resources. This, in turn, imposes additional overhead on the customer's database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, could u add a comment as shown below and also include assumptions made to chose MetricSetLimit as 100 as well?
/*
NOTE: The default and max values chosen may be adjusted in the future. Assumptions made to choose the defaults and max values:
For instance, if QueryCountThreshold is set to 50, then in the worst-case scenario:
- Slow queries would total 50.
- Individual queries would amount to 50 * IndividualQueryCountThreshold, equaling 500.
- When considering the execution plan for queries, assuming there are 5 objects in the execution plan JSON for each individual query, this would result in 2500 objects to handle.
- Wait events would number 50.
- Blocking sessions would also total 50.
With a configuration interval set at 30 seconds, processing these results can consume significant time and resources. This, in turn, imposes additional overhead on the customer's database. Hence, we are enforcing MaxQueryCountThreshold as 30
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
6ff4f65
to
ec80544
Compare
…king sessions and wait events (#7) * feat: Introducing query performance monitoring for slow queries, blocking sessions and wait events. * refactor: Implemented a limit for wait events, blocking sessions, and included bug fixes (#8) * refactor: Implemented a limit for wait events, blocking sessions, and included bug fixes. * Included a stepID for each row during the query execution iteration * Added fix for stepID * Added a fix to ensure that increments correctly. * Renamed FETCH_INTERVAL to SLOW_QUERY_FETCH_INTERVAL * Added detailed logging for each operation, including the time taken for execution. * refactor: Added configuration option to disable query performance metrics per database (#10) * refactor: Added configuration option to disable query performance metrics per database * Revised the list of input arguments for retrieving individual queries. * Updated logging messages and Revised the list of input arguments for retrieving wait events and blocking session queries. * Added a helper function to obtain a list of unique databases to exclude. * code refactoring * Added fix for number of arguments mismatch for the SQL query * removed rebind functionality * updated metricset limit * reverted metricset limit * minor code refactoring * fixed linting errors * fixed linting errors * refactor: resolving linting errors (#12) * refactor: resolving linting errors * fixing linting errors * fixing linting errors * fixing linting errors * fixing linting errors * fixing linting errors * refactor: resolving linting errors (#13) * refactor: changed log.info to log.debug and other bug fixes (#14) * refactor: code refactoring and addressing review comments (#15) * refactor: code refactoring and addressing review comments * lint issue fixes * lint issue fixes * lint issue fixes * lint issue fixes * lint issue fixes * refactor: Added a limit on individual query details and defined min/max values for the limit threshold. (#16) * refactor: Added a limit on individual query details and defined min/max values for the limit threshold. * minor enhancements * minor enhancements * minor enhancements * refactor: code resturcturing (#17) * refactor: code resturcturing * file name changes * Blocking sessions query update * Blocking sessions query update * package changes * Added code review fixes * Added code review fixes * Added code review fixes * Added code review fixes * Added code review fixes * Added code review fixes * Added code review fixes * query execution plan changes * file name changes * Added code review fixes
…Delete and Update queries. Updated blocking sessions data model (#18)
…ion times (#20) Updated wait events query to analyze wait events and execution times
ec80544
to
c796359
Compare
* refactor: Added utility to escape backticks in JSON strings * Added unit tests for ValidateAndSetDefaults function and fix argument handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the previous comments. I've left additional feedback
}) | ||
} | ||
|
||
func TestGetUniqueExcludedDatabases(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we might cover all these test cases with the test below (testing the public function, directly from JSON)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand - TestGetUniqueExcludedDatabases
unit test is not needed if we cover all the test cases this unit covers in the TestGetExcludedDatabases
unit test.
src/query-performance-monitoring/performance-metrics-collectors/query_details.go
Outdated
Show resolved
Hide resolved
sqlxDB := sqlx.NewDb(db, "sqlmock") | ||
mockDataSource := &mockDataSource{db: sqlxDB} | ||
|
||
mock.ExpectQuery("SHOW GLOBAL VARIABLES LIKE 'performance_schema';").WillReturnRows(rows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we declare this query SHOW GLOBAL VARIABLES LIKE 'performance_schema';
as a constant in validations.go and reuse it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the change reflected
@@ -3452,28 +3452,6 @@ | |||
], | |||
"type": "object" | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are these metric's removed performance_schema_events_stages_history_long_size
performance_schema_events_stages_history_size
?
I see it has been removed from all the json schema files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed this consumer events_statements_cpu
from validations as it is not required now. so I've removed below two metrics from json schema files.
performance_schema_events_stages_history_long_size
performance_schema_events_stages_history_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make not required instead of removing it from the json-schema-performance-files
directory schema files.
Also, Don't modify the already existing json schema files i.e json-schema-files-${version}
directories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the changes reflected. Can you check once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you have added the metrics I mentioned above. But, there are two other metrics as well which were removed. Can you refer this commit and then revert all that were removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please check now?
src/query-performance-monitoring/performance-metrics-collectors/blocking_sessions_test.go
Outdated
Show resolved
Hide resolved
src/query-performance-monitoring/performance-metrics-collectors/blocking_sessions_test.go
Outdated
Show resolved
Hide resolved
src/query-performance-monitoring/performance-metrics-collectors/blocking_sessions_test.go
Outdated
Show resolved
Hide resolved
if err == nil { | ||
t.Fatal("Expected error collecting metrics, got nil") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd try to keep tests consistent: sometimes we use if err == nil { t.Fatal(...) }
and sometimes we use assert.Error
/ assert.NoError
. Could we make it uniform? (Considering that testify is already used, I find its usage more readable)
This comment also applies to other unit tests that are not consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/query-performance-monitoring/performance-metrics-collectors/blocking_sessions_test.go
Outdated
Show resolved
Hide resolved
src/query-performance-monitoring/performance-metrics-collectors/query_execution_plan.go
Show resolved
Hide resolved
tableName, _ := js.Get("table_name").String() | ||
queryCost, _ := js.Get("cost_info").Get("query_cost").String() | ||
accessType, _ := js.Get("access_type").String() | ||
rowsExaminedPerScan, _ := js.Get("rows_examined_per_scan").Int64() | ||
rowsProducedPerJoin, _ := js.Get("rows_produced_per_join").Int64() | ||
filtered, _ := js.Get("filtered").String() | ||
readCost, _ := js.Get("cost_info").Get("read_cost").String() | ||
evalCost, _ := js.Get("cost_info").Get("eval_cost").String() | ||
prefixCost, _ := js.Get("cost_info").Get("prefix_cost").String() | ||
dataReadPerJoin, _ := js.Get("cost_info").Get("data_read_per_join").String() | ||
usingIndex, _ := js.Get("using_index").Bool() | ||
keyLength, _ := js.Get("key_length").String() | ||
possibleKeysArray, _ := js.Get("possible_keys").StringArray() | ||
key, _ := js.Get("key").String() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to fail silently if any field is missing or the type doesn't match? I guess that the structure is pretty stable. However, if it changes in future mysql versions, hiding the errors will make detecting such changes difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Query execution plan structure/payload varies with query to query
Its not fixed payload
So its expected to have fields missing while fetching them and this was discussed already with PMs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See: #171 (comment)
If we are supporting different structures for different queries, we need unit-tests to assure that the query parsing is working as expected.
src/query-performance-monitoring/performance-metrics-collectors/query_execution_plan.go
Show resolved
Hide resolved
|
||
// extractMetricsFromJSONString extracts metrics from a JSON string. | ||
func extractMetricsFromJSONString(jsonString string, eventID uint64, threadID uint64) ([]utils.QueryPlanMetrics, error) { | ||
js, err := simplejson.NewJson([]byte(jsonString)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider defining a struct to represent the result of the query plan instead of using simplejson
? The current approach has some downsides that could be addressed using a model for the response:
- The need to unmarshall and marshal back results in order to execute
extractMetrics
recursively (which are costly operations) - The need of explicit reflection (also costly)
- Code complexity:
- The
extractMetrics
function is partially defining the model (the field names are defined there) and any type error would either fail silently or panic. - The code recursive code to extract the metrics has hight complexity and it is not fully covered with unit tests
- The
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion to define a struct to represent the query plan result instead of using simplejson
. I appreciate the benefits you've highlighted: reduced marshalling/unmarshalling, less reflection, improved type safety, and potentially simpler code.
However, after careful consideration, I believe that using a struct to model the entire query plan is not the most suitable approach for our specific use case, primarily due to the variability and complexity of the EXPLAIN FORMAT=JSON
output from MySQL, and the fact that we need to handle potentially invalid JSON.
Challenges with Using a Struct:
-
Dynamic and Complex JSON Structure: The JSON structure returned by
EXPLAIN FORMAT=JSON
is highly dynamic and can vary significantly depending on the query. It often contains nested objects and arrays with varying levels of depth. Creating a struct that accurately represents all possible variations would be extremely complex and difficult to maintain. The structure is not fixed and depends on factors like:- The type of query (SELECT, UPDATE, DELETE, etc.)
- The presence of joins, subqueries, and other query constructs.
- The specific optimizations chosen by the MySQL query optimizer.
-
Potentially Invalid JSON: As we've discussed, the
attached_condition
field frequently contains unescaped characters that can make the entire JSON invalid. If we were to use a struct, we would still need a way to handle these parsing errors gracefully. We would also need to avoid unmarshalling the parts of JSON that are invalid, which would add complexity. -
Maintenance Overhead: Maintaining a complex struct that needs to be updated whenever the
EXPLAIN
output format changes in a new MySQL version would create significant maintenance overhead.
Current Approach:
Given these challenges, our current approach using simplejson
offers several advantages:
- Flexibility:
simplejson
allows us to easily navigate and extract the specific fields we need, even with a dynamic and potentially invalid JSON structure. - Error Handling: Our current error handling around
simplejson.NewJson
, and the logic withinextractMetricsFromJSONString
,processMap
, andprocessSliceValue
allows us to gracefully handle cases where the JSON is partially invalid, either by returning empty slice if we can recover partially or returning the error if recovery is not possible. - Targeted Data Extraction: We can selectively extract only the fields relevant to our monitoring needs without the overhead of a large, all-encompassing struct.
- Reduced Maintenance: We avoid the maintenance burden of a complex struct that would need to be constantly updated to match changes in the
EXPLAIN
output format.
Addressing Review Points:
- Marshalling/Unmarshalling: While it's true that we currently marshal the map in
processMapValue
andprocessSliceValue
, this is less costly than unmarshalling the entire JSON into a large struct and then potentially not using most of the fields. - Reflection: The use of reflection is limited to checking the type of map values (Map or Slice) and is not a major performance bottleneck in our case.
- Code Complexity: Although recursive, the
extractMetrics
,processMap
, andprocessSliceValue
functions are relatively straightforward in their logic. The complexity arises more from the nature of theEXPLAIN
output itself than from our code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for also generating this response.
I think it is important to review some aspects of the implementation even if we keep the current approach.
Lack of unit testing
I understand the challenges of modeling such complex result. However, what I said in #171 (comment) also applies here: If we need to support a lot of different EXPLAIN results, with different structure depending on the query (INSERT, DELETE, UPDATE, subqueries, optimizations, ...), we need unit tests to assure we are fetching the data as expected. These tests will also help supporting any future change in the EXPLAIN structure AND troubleshooting missing data for a particular queries.
marshal / unmarshall many times in order to use simplejson
Even if the unit-tests are the main thing to address, I would also like to point out that we could keep the dynamic approach (if modeling the results is not feasible / suitable) and avoid marshaling / unmarshaling many times. Marshaling each element in processSliceValue
and processMapValue
is merely needed to build the simplejson.Json
struct (which will unmarshal the corresponding bytes). We could:
- Work with the
Json
object directly which can be challenging (because thesimplejson
tool might not provide the helpers to perform complex actions. Eg:js.Map()
returnsmap[string]interface{}
and we don't have a helper returningmap[string]*Json
) but feasible (we could iterate over the results ofjs.Map()
and obtain aJson
struct throughjs.Get(key)
). - Use the
iterface{}
representation of the data (and perform the corresponding type assertion to extract data) instead of usingsimplejson
and its helpers.
Possible skipped values (depending on the expected behavior)
- The
extractMetrics
function callsprocessMap
with the result ofjs.Map()
if it doesn't fail. But ifjs
holds a slice,js.Map()
will return an error and the function ends. Could this be an issue? - Something similar is happening in the implementation of
processSliceValue
: onlymap[string]interface{}
elements are considered. Slices of slices aren't supposed to be supported?
By only reviewing the code it is not possible to know if the potential issues described above could be possible with known EXPLAIN outputs, but the current approach was claimed to be flexible.
// ValidatePreconditions checks if the necessary preconditions are met for performance monitoring. | ||
func ValidatePreconditions(db utils.DataSource) error { | ||
// Check if Performance Schema is enabled | ||
performanceSchemaEnabled, errPerformanceEnabled := isPerformanceSchemaEnabled(db) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before checking if the performanceSchema is enabled don't you think we need to verify if the Mysql server version is supported or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
) | ||
|
||
var ( | ||
ErrCreateNodeEntity = errors.New("error creating node entity") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not being used
} | ||
} | ||
|
||
func TestMetricSet(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this unit test to src/infrautils/entity_test.go
Enhanced query performance monitoring features to track slow queries, examine query execution plans, monitor wait events, and identify blocking sessions.
Co-authored-by:
Srikanth @RamanaReddy8801