Skip to content

Commit c8994ee

Browse files
committed
Added Dashboard Group and Sample Detectors for Inferred Services
1 parent 7997bb1 commit c8994ee

File tree

9 files changed

+1568
-0
lines changed

9 files changed

+1568
-0
lines changed

dashboards-and-dashboard-groups/inferred-services-dg/Dashboard_Group_Inferred Services.json

Lines changed: 1375 additions & 0 deletions
Large diffs are not rendered by default.
572 KB
Loading
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Inferred Services - assets to help observing
2+
3+
1. [Dashboard Group - Inferred Services](./Dashboard_Group_Inferred%20Services.json)
4+
5+
Feel free to also use
6+
7+
2. [Sample Detectors: Latency Spike (>3s for 90% of 5min); Error Rate (>50%, sudden change)](../../detectors/inferred-services-detectors/README.md)
8+
9+
Learn more about Inferred Services:
10+
- [What are Inferred Services](https://docs.splunk.com/observability/en/apm/apm-spans-traces/inferred-services.html)
11+
- [Metrics available for Inferred Services](https://docs.splunk.com/observability/en/apm/span-tags/metricsets.html#available-default-mms-metrics-and-dimensions)
12+
13+
## Inferred Services - Dashboard Group
14+
15+
1. Import Dashboard Group
16+
*From UI:*
17+
Click on '+' on the top right and select Import->Dashboard Group.
18+
19+
2. Find your dashboard group `Inferred Services` and use as a starting point to create charts.
20+
21+
Screenshot:
22+
![Dashboard Group 'Inferred Services'](./Inferred-services-DashboardGroup.png)
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
curl --location 'https://api.us1.signalfx.com/v2/detector' \
2+
--header 'Content-Type: application/json' \
3+
--header 'X-SF-TOKEN: REPLACEME' \
4+
--data '{
5+
"authorizedWriters": {
6+
"teams": [],
7+
"users": []
8+
},
9+
"customProperties": null,
10+
"description": "",
11+
"detectorOrigin": "Standard",
12+
"labelResolutions": {
13+
"Error rate >50%": 2000,
14+
"Sudden change in Error rate for last 5min": 2000
15+
},
16+
"maxDelay": 0,
17+
"minDelay": 0,
18+
"name": "[sample] Inferred Services - Error Rate per minute",
19+
"overMTSLimit": false,
20+
"programText": "from signalfx.detectors.against_recent import against_recent\nA = histogram('\''inferred.services'\'', filter=filter('\''sf_service'\'', '\''*'\'') and filter('\''sf_environment'\'', '\''*'\'')).count(by=['\''sf_environment'\'', '\''sf_service'\'']).sum(over='\''1m'\'').publish(label='\''A'\'', enable=False)\nB = histogram('\''inferred.services'\'', filter=filter('\''sf_service'\'', '\''*'\'') and filter('\''sf_environment'\'', '\''*'\'') and filter('\''sf_error'\'', '\''false'\'')).count(by=['\''sf_environment'\'', '\''sf_service'\'']).sum(over='\''1m'\'').publish(label='\''B'\'', enable=False)\nC = histogram('\''inferred.services'\'', filter=filter('\''sf_service'\'', '\''*'\'') and filter('\''sf_environment'\'', '\''*'\'') and filter('\''sf_error'\'', '\''true'\'')).count(by=['\''sf_environment'\'', '\''sf_service'\'']).sum(over='\''1m'\'').publish(label='\''C'\'', enable=False)\nD = combine(100*((C if C is not None else 0) / A)).publish(label='\''D'\'')\ndetect(when(D > threshold(50), lasting='\''5m'\'', at_least=0.9), auto_resolve_after='\''30m'\'').publish('\''Error rate >50%'\'')\nagainst_recent.detector_mean_std(stream=D, current_window='\''5m'\'', historical_window='\''15m'\'', fire_num_stddev=3.5, clear_num_stddev=3, orientation='\''above'\'', ignore_extremes=True, calculation_mode='\''vanilla'\'').publish('\''Sudden change in Error rate for last 5min'\'')",
21+
"rules": [
22+
{
23+
"description": "The value of Error rate per min is above 50 for 90% of 5m.",
24+
"detectLabel": "Error rate >50%",
25+
"disabled": false,
26+
"notifications": [],
27+
"parameterizedBody": "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for Error rate per min: {{inputs.D.value}}\n{{else}}Current signal value for Error rate per min: {{inputs.D.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}",
28+
"severity": "Major"
29+
},
30+
{
31+
"description": "All the values of Error rate per min in the last 5m are more than 3.5 standard deviation(s) above the mean of its preceding 15m.",
32+
"detectLabel": "Sudden change in Error rate for last 5min",
33+
"disabled": false,
34+
"notifications": [],
35+
"parameterizedBody": "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\nMinimum value of signal in the last {{event_annotations.current_window}}: {{inputs.recent_min.value}}\n{{#if anomalous}}Trigger threshold: {{inputs.f_top.value}}\n{{else}}Clear threshold: {{inputs.c_top.value}}\n{{/if}}\n\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}",
36+
"severity": "Warning"
37+
}
38+
],
39+
"sf_metricsInObjectProgramText": [
40+
"inferred.services"
41+
],
42+
"status": "ACTIVE",
43+
"tags": [],
44+
"teams": [],
45+
"timezone": "",
46+
"visualizationOptions": {
47+
"disableSampling": false,
48+
"publishLabelOptions": [
49+
{
50+
"displayName": "Total Requests / min",
51+
"label": "A",
52+
"paletteIndex": 14,
53+
"valuePrefix": "",
54+
"valueSuffix": "",
55+
"valueUnit": null
56+
},
57+
{
58+
"displayName": "Successful Requests / min",
59+
"label": "B",
60+
"paletteIndex": 6,
61+
"valuePrefix": "",
62+
"valueSuffix": "",
63+
"valueUnit": null
64+
},
65+
{
66+
"displayName": "Errors / min",
67+
"label": "C",
68+
"paletteIndex": 8,
69+
"valuePrefix": "",
70+
"valueSuffix": "",
71+
"valueUnit": null
72+
},
73+
{
74+
"displayName": "Error rate per min",
75+
"label": "D",
76+
"paletteIndex": null,
77+
"valuePrefix": null,
78+
"valueSuffix": "%",
79+
"valueUnit": null
80+
}
81+
],
82+
"showDataMarkers": true,
83+
"showEventLines": false,
84+
"time": {
85+
"range": 172800000,
86+
"rangeEnd": 0,
87+
"type": "relative"
88+
}
89+
}
90+
}'
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
curl --location 'https://api.us1.signalfx.com/v2/detector' \
2+
--header 'Content-Type: application/json' \
3+
--header 'X-SF-TOKEN: REPLACEME' \
4+
--data '{
5+
"authorizedWriters": {
6+
"teams": [],
7+
"users": []
8+
},
9+
"customProperties": null,
10+
"description": "",
11+
"detectorOrigin": "Standard",
12+
"labelResolutions": {
13+
"Latency >3s": 1000
14+
},
15+
"maxDelay": 0,
16+
"minDelay": 0,
17+
"name": "[sample] Inferred Services - Latency Spike",
18+
"overMTSLimit": false,
19+
"programText": "AB = alerts(detector_name='\''[sample] Inferred Services - Latency Spike'\'').publish(label='\''AB'\'')\nA = histogram('\''inferred.services'\'', filter=filter('\''sf_service'\'', '\''*'\'') and filter('\''sf_environment'\'', '\''*'\'')).max(by=['\''sf_operation'\'', '\''sf_environment'\'', '\''sf_service'\'', '\''sf_error'\'']).mean(over='\''1m'\'').publish(label='\''A'\'')\ndetect(when(A > threshold(3000000000), lasting='\''5m'\'', at_least=0.9), auto_resolve_after='\''30m'\'').publish('\''Latency >3s'\'')",
20+
"rules": [
21+
{
22+
"description": "The value of Latency for Operation/Endpoint (1 min avg) is above 3000000000 for 90% of 5m.",
23+
"detectLabel": "Latency >3s",
24+
"disabled": false,
25+
"notifications": [],
26+
"parameterizedBody": "{{#if anomalous}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" triggered at {{dateTimeFormat timestamp format=\"full\"}}.\n{{else}}\n\tRule \"{{{ruleName}}}\" in detector \"{{{detectorName}}}\" cleared at {{dateTimeFormat timestamp format=\"full\"}}.\n{{/if}}\n\n{{#if anomalous}}\nTriggering condition: {{{readableRule}}}\n{{/if}}\n\n{{#if anomalous}}Signal value for Latency for Operation/Endpoint (1 min avg): {{inputs.A.value}}\n\n{{else}}Current signal value for Latency for Operation/Endpoint (1 min avg): {{inputs.A.value}}\n{{/if}}\nService: {{dimensions.sf_service}}\nEnvironment: {{dimensions.sf_environment}}\nOperation: {{dimensions.sf_operation}}\nError: {{dimensions.sf_error}}\n{{#notEmpty dimensions}}\nSignal details:\n{{{dimensions}}}\n{{/notEmpty}}\n\n{{#if anomalous}}\n{{#if runbookUrl}}Runbook: {{{runbookUrl}}}{{/if}}\n{{#if tip}}Tip: {{{tip}}}{{/if}}\n{{/if}}",
27+
"severity": "Warning"
28+
}
29+
],
30+
"sf_metricsInObjectProgramText": [
31+
"inferred.services"
32+
],
33+
"status": "ACTIVE",
34+
"tags": [],
35+
"teams": [],
36+
"timezone": "",
37+
"visualizationOptions": {
38+
"disableSampling": true,
39+
"publishLabelOptions": [
40+
{
41+
"displayName": "Latency for Operation/Endpoint (1 min avg)",
42+
"label": "A",
43+
"paletteIndex": null,
44+
"valuePrefix": null,
45+
"valueSuffix": null,
46+
"valueUnit": "Nanosecond"
47+
}
48+
],
49+
"showDataMarkers": true,
50+
"showEventLines": false,
51+
"time": {
52+
"range": 900000,
53+
"rangeEnd": 0,
54+
"type": "relative"
55+
}
56+
}
57+
}'
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Inferred Services - assets to help observing
2+
3+
1. [Detector: Latency Spike (>3s for 90% of 5min)](./POST_Detector_latency_spike.sh)
4+
5+
2. [Detector: Error Rate (>50%, sudden change)](./POST_Detector_error_rate.sh)
6+
7+
Feel free to also use
8+
9+
3. [Dashboard Group - Inferred Services](../../dashboards-and-dashboard-groups/inferred-services-dg/README.md)
10+
11+
Learn more about Inferred Services:
12+
- [What are Inferred Services](https://docs.splunk.com/observability/en/apm/apm-spans-traces/inferred-services.html)
13+
- [Metrics available for Inferred Services](https://docs.splunk.com/observability/en/apm/span-tags/metricsets.html#available-default-mms-metrics-and-dimensions)
14+
15+
## Inferred Services - Sample Detectors
16+
![Sample Detectors for Latency and Error rate of Inferred Services](../../detectors/inferred-services-detectors/detectors-1.png)
17+
18+
Use curl command to post the detector (replace `Token` and `Realm` as required).
19+
20+
These can be used as a starting point to customise signals, thresholds, messaging etc.
21+
22+
Screeshots:
23+
![Error Rate Detector](../../detectors/inferred-services-detectors/detectors-errors.png)
24+
![Latency Spike Detector](../../detectors/inferred-services-detectors/detectors-latency.png)
22.8 KB
Loading
173 KB
Loading
341 KB
Loading

0 commit comments

Comments
 (0)