-
Notifications
You must be signed in to change notification settings - Fork 27
Added Dashboard Group and Sample Detectors for Inferred Services #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| # Inferred Services - assets to help observing | ||
|
|
||
| 1. [Dashboard Group - Inferred Services](./Dashboard_Group_Inferred%20Services.json) | ||
|
|
||
| Feel free to also use | ||
|
|
||
| 2. [Sample Detectors: Latency Spike (>3s for 90% of 5min); Error Rate (>50%, sudden change)](../../detectors/inferred-services-detectors/README.md) | ||
|
|
||
| Learn more about Inferred Services: | ||
| - [What are Inferred Services](https://docs.splunk.com/observability/en/apm/apm-spans-traces/inferred-services.html) | ||
| - [Metrics available for Inferred Services](https://docs.splunk.com/observability/en/apm/span-tags/metricsets.html#available-default-mms-metrics-and-dimensions) | ||
|
|
||
| ## Inferred Services - Dashboard Group | ||
|
|
||
| 1. Import Dashboard Group | ||
| *From UI:* | ||
| Click on '+' on the top right and select Import->Dashboard Group. | ||
|
|
||
| 2. Find your dashboard group `Inferred Services` and use as a starting point to create charts. | ||
|
|
||
| Screenshot: | ||
|  | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,90 @@ | ||
| curl --location 'https://api.us1.signalfx.com/v2/detector' \ | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please chmod the *.sh files to something like 766 so anyone can read/write but only owner can execute. Currently script isn't executable without chmoding it. |
||
| --header 'Content-Type: application/json' \ | ||
| --header 'X-SF-TOKEN: REPLACEME' \ | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Might I suggest pulling from an environment variable in these scripts and noting the env var to use in the README so folks don't have to put their key in a file? Thoughts? |
||
| --data '{ | ||
| "authorizedWriters": { | ||
| "teams": [], | ||
| "users": [] | ||
| }, | ||
| "customProperties": null, | ||
| "description": "", | ||
| "detectorOrigin": "Standard", | ||
| "labelResolutions": { | ||
| "Error rate >50%": 2000, | ||
| "Sudden change in Error rate for last 5min": 2000 | ||
| }, | ||
| "maxDelay": 0, | ||
| "minDelay": 0, | ||
| "name": "[sample] Inferred Services - Error Rate per minute", | ||
| "overMTSLimit": false, | ||
| "programText": "from signalfx.detectors.against_recent import against_recent\nA = histogram('\''inferred.services'\'', filter=filter('\''sf_service'\'', '\''*'\'') and filter('\''sf_environment'\'', '\''*'\'')).count(by=['\''sf_environment'\'', '\''sf_service'\'']).sum(over='\''1m'\'').publish(label='\''A'\'', enable=False)\nB = histogram('\''inferred.services'\'', filter=filter('\''sf_service'\'', '\''*'\'') and filter('\''sf_environment'\'', '\''*'\'') and filter('\''sf_error'\'', '\''false'\'')).count(by=['\''sf_environment'\'', '\''sf_service'\'']).sum(over='\''1m'\'').publish(label='\''B'\'', enable=False)\nC = histogram('\''inferred.services'\'', filter=filter('\''sf_service'\'', '\''*'\'') and filter('\''sf_environment'\'', '\''*'\'') and filter('\''sf_error'\'', '\''true'\'')).count(by=['\''sf_environment'\'', '\''sf_service'\'']).sum(over='\''1m'\'').publish(label='\''C'\'', enable=False)\nD = combine(100*((C if C is not None else 0) / A)).publish(label='\''D'\'')\ndetect(when(D > threshold(50), lasting='\''5m'\'', at_least=0.9), auto_resolve_after='\''30m'\'').publish('\''Error rate >50%'\'')\nagainst_recent.detector_mean_std(stream=D, current_window='\''5m'\'', historical_window='\''15m'\'', fire_num_stddev=3.5, clear_num_stddev=3, orientation='\''above'\'', ignore_extremes=True, calculation_mode='\''vanilla'\'').publish('\''Sudden change in Error rate for last 5min'\'')", | ||
| "rules": [ | ||
| { | ||
| "description": "The value of Error rate per min is above 50 for 90% of 5m.", | ||
| "detectLabel": "Error rate >50%", | ||
| "disabled": false, | ||
| "notifications": [], | ||
| "parameterizedBody": "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for Error rate per min: {{inputs.D.value}}\n{{else}}Current signal value for Error rate per min: {{inputs.D.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}", | ||
| "severity": "Major" | ||
| }, | ||
| { | ||
| "description": "All the values of Error rate per min in the last 5m are more than 3.5 standard deviation(s) above the mean of its preceding 15m.", | ||
| "detectLabel": "Sudden change in Error rate for last 5min", | ||
| "disabled": false, | ||
| "notifications": [], | ||
| "parameterizedBody": "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\nMinimum value of signal in the last {{event_annotations.current_window}}: {{inputs.recent_min.value}}\n{{#if anomalous}}Trigger threshold: {{inputs.f_top.value}}\n{{else}}Clear threshold: {{inputs.c_top.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}", | ||
| "severity": "Warning" | ||
| } | ||
| ], | ||
| "sf_metricsInObjectProgramText": [ | ||
| "inferred.services" | ||
| ], | ||
| "status": "ACTIVE", | ||
| "tags": [], | ||
| "teams": [], | ||
| "timezone": "", | ||
| "visualizationOptions": { | ||
| "disableSampling": false, | ||
| "publishLabelOptions": [ | ||
| { | ||
| "displayName": "Total Requests / min", | ||
| "label": "A", | ||
| "paletteIndex": 14, | ||
| "valuePrefix": "", | ||
| "valueSuffix": "", | ||
| "valueUnit": null | ||
| }, | ||
| { | ||
| "displayName": "Successful Requests / min", | ||
| "label": "B", | ||
| "paletteIndex": 6, | ||
| "valuePrefix": "", | ||
| "valueSuffix": "", | ||
| "valueUnit": null | ||
| }, | ||
| { | ||
| "displayName": "Errors / min", | ||
| "label": "C", | ||
| "paletteIndex": 8, | ||
| "valuePrefix": "", | ||
| "valueSuffix": "", | ||
| "valueUnit": null | ||
| }, | ||
| { | ||
| "displayName": "Error rate per min", | ||
| "label": "D", | ||
| "paletteIndex": null, | ||
| "valuePrefix": null, | ||
| "valueSuffix": "%", | ||
| "valueUnit": null | ||
| } | ||
| ], | ||
| "showDataMarkers": true, | ||
| "showEventLines": false, | ||
| "time": { | ||
| "range": 172800000, | ||
| "rangeEnd": 0, | ||
| "type": "relative" | ||
| } | ||
| } | ||
| }' | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| curl --location 'https://api.us1.signalfx.com/v2/detector' \ | ||
| --header 'Content-Type: application/json' \ | ||
| --header 'X-SF-TOKEN: REPLACEME' \ | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same thought here RE: env var |
||
| --data '{ | ||
| "authorizedWriters": { | ||
| "teams": [], | ||
| "users": [] | ||
| }, | ||
| "customProperties": null, | ||
| "description": "", | ||
| "detectorOrigin": "Standard", | ||
| "labelResolutions": { | ||
| "Latency >3s": 1000 | ||
| }, | ||
| "maxDelay": 0, | ||
| "minDelay": 0, | ||
| "name": "[sample] Inferred Services - Latency Spike", | ||
| "overMTSLimit": false, | ||
| "programText": "AB = alerts(detector_name='\''[sample] Inferred Services - Latency Spike'\'').publish(label='\''AB'\'')\nA = histogram('\''inferred.services'\'', filter=filter('\''sf_service'\'', '\''*'\'') and filter('\''sf_environment'\'', '\''*'\'')).max(by=['\''sf_operation'\'', '\''sf_environment'\'', '\''sf_service'\'', '\''sf_error'\'']).mean(over='\''1m'\'').publish(label='\''A'\'')\ndetect(when(A > threshold(3000000000), lasting='\''5m'\'', at_least=0.9), auto_resolve_after='\''30m'\'').publish('\''Latency >3s'\'')", | ||
| "rules": [ | ||
| { | ||
| "description": "The value of Latency for Operation/Endpoint (1 min avg) is above 3000000000 for 90% of 5m.", | ||
| "detectLabel": "Latency >3s", | ||
| "disabled": false, | ||
| "notifications": [], | ||
| "parameterizedBody": "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for Latency for Operation/Endpoint (1 min avg): {{inputs.A.value}}\n\n{{else}}Current signal value for Latency for Operation/Endpoint (1 min avg): {{inputs.A.value}}\n{{/if}}\nService: {{dimensions.sf_service}}\nEnvironment: {{dimensions.sf_environment}}\nOperation: {{dimensions.sf_operation}}\nError: {{dimensions.sf_error}}\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}", | ||
| "severity": "Warning" | ||
| } | ||
| ], | ||
| "sf_metricsInObjectProgramText": [ | ||
| "inferred.services" | ||
| ], | ||
| "status": "ACTIVE", | ||
| "tags": [], | ||
| "teams": [], | ||
| "timezone": "", | ||
| "visualizationOptions": { | ||
| "disableSampling": true, | ||
| "publishLabelOptions": [ | ||
| { | ||
| "displayName": "Latency for Operation/Endpoint (1 min avg)", | ||
| "label": "A", | ||
| "paletteIndex": null, | ||
| "valuePrefix": null, | ||
| "valueSuffix": null, | ||
| "valueUnit": "Nanosecond" | ||
| } | ||
| ], | ||
| "showDataMarkers": true, | ||
| "showEventLines": false, | ||
| "time": { | ||
| "range": 900000, | ||
| "rangeEnd": 0, | ||
| "type": "relative" | ||
| } | ||
| } | ||
| }' | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,24 @@ | ||
| # Inferred Services - assets to help observing | ||
|
|
||
| 1. [Detector: Latency Spike (>3s for 90% of 5min)](./POST_Detector_latency_spike.sh) | ||
|
|
||
| 2. [Detector: Error Rate (>50%, sudden change)](./POST_Detector_error_rate.sh) | ||
|
|
||
| Feel free to also use | ||
|
|
||
| 3. [Dashboard Group - Inferred Services](../../dashboards-and-dashboard-groups/inferred-services-dg/README.md) | ||
|
|
||
| Learn more about Inferred Services: | ||
| - [What are Inferred Services](https://docs.splunk.com/observability/en/apm/apm-spans-traces/inferred-services.html) | ||
| - [Metrics available for Inferred Services](https://docs.splunk.com/observability/en/apm/span-tags/metricsets.html#available-default-mms-metrics-and-dimensions) | ||
|
|
||
| ## Inferred Services - Sample Detectors | ||
|  | ||
|
|
||
| Use curl command to post the detector (replace `Token` and `Realm` as required). | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. typo? |
||
|
|
||
| These can be used as a starting point to customise signals, thresholds, messaging etc. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would you mind noting something along the lines of Just so folks know they are going to need to configure a few things in particular? |
||
|
|
||
| Screeshots: | ||
|  | ||
|  | ||
Uh oh!
There was an error while loading. Please reload this page.