Skip to content

Commit 3869651

Browse files
Merge pull request #72 from sentrysoftware/metricshub
MetricsHub - Hardware Dashboards
2 parents 2563296 + b6fb2b6 commit 3869651

27 files changed

+12175
-0
lines changed

dashboards-and-dashboard-groups/metricshub/MetricsHub.json

Lines changed: 10463 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# MetricsHub
2+
3+
## Overview
4+
5+
**MetricsHub** is a universal metrics collection agent designed for monitoring hardware components, system performance, and sustainability KPIs. It collects data from servers, storage systems, and network devices and pushes it to OpenTelemetry back-ends such as the Splunk Observability Cloud.
6+
7+
### Key Features
8+
9+
- **Remote Monitoring**: MetricsHub supports the monitoring of thousands of systems remotely through protocols such as REST APIs, SNMP, WBEM, WMI, SSH, IPMI, and more.
10+
- **OpenTelemetry Integration**: MetricsHub acts as an OpenTelemetry agent, following its standards for easy integration with various observability platforms.
11+
- **Sustainability Metrics**: Track and report on energy usage and carbon footprint to optimize infrastructure efficiency.
12+
- **250+ Connectors**: Ready-to-use connectors for monitoring a wide variety of platforms. MetricsHub agent is truly vendor-neutral, providing consistent coverage for all manufacturers (e.g., Cisco, Dell EMC, Huawei, HP, IBM, Lenovo, Pure, and more).
13+
14+
### Dashboards
15+
16+
MetricsHub comes with pre-configured dashboards that visualize hardware, as well as sustainability KPIs:
17+
18+
| Dashboard | Description |
19+
| --- | --- |
20+
| **Hardware - Main** | Overview of all monitored systems, focusing on key hardware and sustainability metrics. |
21+
| **Hardware - Site** | Metrics specific to a particular site (a data center or a server room) and its monitored hosts. |
22+
| **Hardware - Host** | Metrics associated with one *host* and its internal devices. |
23+
24+
## Setup
25+
26+
1. Follow the [installation instructions](https://metricshub.com/docs/latest/installation/index.html)
27+
2. Configure the OpenTelemetry Collector to export metrics to Splunk by editing `otel-config.yaml`:
28+
29+
```yaml
30+
exporters:
31+
signalfx:
32+
# Access token to send data to SignalFx.
33+
access_token: <your token>
34+
# SignalFx realm where the data will be received.
35+
realm: <your realm>
36+
```
37+
38+
Get more information about the [SignalFx Metrics Exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/signalfxexporter).
39+
40+
## Support
41+
42+
Subscribers to **MetricsHub** gain access to the **MetricsHub Support Desk**, which provides:
43+
44+
- Technical support
45+
- Patches and updates
46+
- Knowledge base access
47+
48+
Splunk does not provide support for these dashboards and users should contact Sentry Software's support with any support requests.
49+
50+
### Further Reading
51+
52+
For more information, visit the [MetricsHub](https://metricshub.com/) website.
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
{
2+
"authorizedWriters": {
3+
"teams": [],
4+
"users": []
5+
},
6+
"created": 1727160940219,
7+
"creator": "GRtepaIAICg",
8+
"customProperties": null,
9+
"description": "",
10+
"detectorOrigin": "Standard",
11+
"id": "GYN3h5fAIBw",
12+
"labelResolutions": {
13+
"Hardware - Connector failed": 1000
14+
},
15+
"lastUpdated": 1727215761634,
16+
"lastUpdatedBy": "AAAAAAAAAAA",
17+
"maxDelay": null,
18+
"minDelay": null,
19+
"name": "Hardware - Connector failed",
20+
"overMTSLimit": false,
21+
"packageSpecifications": "",
22+
"programText": "A = data('metricshub.connector.status', filter=filter('state', 'failed'), rollup='max').publish(label='A')\ndetect(when(A > threshold(0))).publish('Hardware - Connector failed')",
23+
"rules": [
24+
{
25+
"description": "The value of metricshub.connector.status is above 0.",
26+
"detectLabel": "Hardware - Connector failed",
27+
"disabled": false,
28+
"notifications": [],
29+
"parameterizedBody": "{{#if anomalous}}\n## Failed connector\nAgent **{{dimensions.[agent.host.name]}}** is failing to use **{{dimensions.[name]}}** to monitor **{{dimensions.[host.name]}}** in **{{dimensions.site}}**.\n\n## Consequence\nAll of the components that were monitored through this connector can no longer be monitored.\n\n## Recommended action\nMake sure {{dimensions.[agent.host.name]}} can communicate with {{dimensions.[host.name]}} with the protocol used by {{dimensions.[name]}} and that the specified credentials in Metrics Hub's configuration are valid.\n{{else}}\nRecovered monitoring with {{dimensions.[name]}} connector.\n{{/if}}\n\n###Device Details\n**Name: ** {{dimensions.[name]}}\n**ID:** {{dimensions.id}}\n**Information:** {{dimensions.info}}",
30+
"parameterizedSubject": "Hardware - Failed connector on {{dimensions.[host.name]}}",
31+
"runbookUrl": "",
32+
"severity": "Major",
33+
"tip": ""
34+
}
35+
],
36+
"sf_metricsInObjectProgramText": [
37+
"metricshub.connector.status"
38+
],
39+
"status": "ACTIVE",
40+
"tags": [],
41+
"teams": [],
42+
"timezone": "",
43+
"visualizationOptions": {
44+
"disableSampling": false,
45+
"publishLabelOptions": [
46+
{
47+
"displayName": "metricshub.connector.status",
48+
"label": "A",
49+
"paletteIndex": null,
50+
"valuePrefix": null,
51+
"valueSuffix": null,
52+
"valueUnit": null
53+
}
54+
],
55+
"showDataMarkers": true,
56+
"showEventLines": false,
57+
"time": {
58+
"range": 86400000,
59+
"rangeEnd": 0,
60+
"type": "relative"
61+
}
62+
}
63+
}
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
{
2+
"authorizedWriters": {
3+
"teams": [],
4+
"users": []
5+
},
6+
"created": 1727101400682,
7+
"creator": "GRtepaIAICg",
8+
"customProperties": null,
9+
"description": "",
10+
"detectorOrigin": "Standard",
11+
"id": "GYKRRPGAAAA",
12+
"labelResolutions": {
13+
"Hardware - Critical LUN pathing issue": 180000
14+
},
15+
"lastUpdated": 1730952952374,
16+
"lastUpdatedBy": "AAAAAAAAAAA",
17+
"maxDelay": null,
18+
"minDelay": null,
19+
"name": "Hardware - Critical LUN pathing issue",
20+
"overMTSLimit": false,
21+
"packageSpecifications": "",
22+
"programText": "A = data('hw.lun.paths').publish(label='A')\ndetect(when(A < threshold(1))).publish('Hardware - Critical LUN pathing issue')",
23+
"rules": [
24+
{
25+
"description": "The value of hw.lun.paths is below 1.",
26+
"detectLabel": "Hardware - Critical LUN pathing issue",
27+
"disabled": false,
28+
"notifications": [],
29+
"parameterizedBody": "{{#if anomalous}}\n## Lost data access\nLUN **{{dimensions.[name]}}** is no longer available on **{{dimensions.[host.name]}}** in **{{dimensions.site}}**\n\n## Consequence\nOne or more filesystems are no longer available (possible data loss).\n\n## Recommended action\nVerify the status of the underlying HBA and its connectivity. Verify the reachability of the storage system and whether any configuration change has been made to the corresponding storage volume.\n{{else}}\nRecovered available LUN paths.\n{{/if}}\n\n###Device Details\n**Name: ** {{dimensions.[name]}}\n**ID:** {{dimensions.id}}\n**Information:** {{dimensions.info}}",
30+
"parameterizedSubject": "Critical LUN pathing issue",
31+
"runbookUrl": "",
32+
"severity": "Major",
33+
"tip": ""
34+
}
35+
],
36+
"sf_metricsInObjectProgramText": [
37+
"hw.lun.paths"
38+
],
39+
"status": "ACTIVE",
40+
"tags": [],
41+
"teams": [],
42+
"timezone": "",
43+
"visualizationOptions": {
44+
"disableSampling": false,
45+
"publishLabelOptions": [
46+
{
47+
"displayName": "hw.lun.paths",
48+
"label": "A",
49+
"paletteIndex": null,
50+
"valuePrefix": null,
51+
"valueSuffix": null,
52+
"valueUnit": null
53+
}
54+
],
55+
"showDataMarkers": true,
56+
"showEventLines": false,
57+
"time": {
58+
"range": 86400000,
59+
"rangeEnd": 0,
60+
"type": "relative"
61+
}
62+
}
63+
}
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
{
2+
"authorizedWriters": {
3+
"teams": [],
4+
"users": []
5+
},
6+
"created": 1727160558136,
7+
"creator": "GRtepaIAICg",
8+
"customProperties": null,
9+
"description": "",
10+
"detectorOrigin": "Standard",
11+
"id": "GYN3fkqAAA0",
12+
"labelResolutions": {
13+
"Hardware - Critically high temperature": 1000
14+
},
15+
"lastUpdated": 1729857766543,
16+
"lastUpdatedBy": "AAAAAAAAAAA",
17+
"maxDelay": 0,
18+
"minDelay": 0,
19+
"name": "Hardware - Critically high temperature",
20+
"overMTSLimit": false,
21+
"packageSpecifications": "",
22+
"programText": "A = data('hw.temperature').publish(label='A', enable=False)\nB = data('hw.temperature.limit', filter=filter('limit_type', 'high.critical')).publish(label='B', enable=False)\nC = (A-B).publish(label='C')\ndetect(when(C > threshold(0))).publish('Hardware - Critically high temperature')",
23+
"rules": [
24+
{
25+
"description": "The value of A-B is above 0.",
26+
"detectLabel": "Hardware - Critically high temperature",
27+
"disabled": false,
28+
"notifications": [],
29+
"parameterizedBody": "{{#if anomalous}}\n## High temperature\nTemperature is critically high for **{{dimensions.[name]}}** on **{{dimensions.[host.name]}}** in **{{dimensions.site}}**\n\n## Consequence\nAn out-of-range temperature may lead to a system crash or even damaged hardware.\n\n## Recommended action\nCheck why the temperature is out of the normal range (it may be due to a fan failure, a severe system overload or a failure in the data center cooling system).\n{{else}}\nRecovered temperature level.\n{{/if}}\n\n###Device Details\n**Name: ** {{dimensions.[name]}}\n**ID:** {{dimensions.id}}\n**Information:** {{dimensions.info}}",
30+
"parameterizedSubject": "Hardware - Critrically High temperature on {{dimensions.[host.name]}}",
31+
"severity": "Major"
32+
}
33+
],
34+
"sf_metricsInObjectProgramText": [
35+
"hw.temperature.limit",
36+
"hw.temperature"
37+
],
38+
"status": "ACTIVE",
39+
"tags": [],
40+
"teams": [],
41+
"timezone": "",
42+
"visualizationOptions": {
43+
"disableSampling": false,
44+
"publishLabelOptions": [
45+
{
46+
"displayName": "hw.temperature",
47+
"label": "A",
48+
"paletteIndex": null,
49+
"valuePrefix": null,
50+
"valueSuffix": null,
51+
"valueUnit": null
52+
},
53+
{
54+
"displayName": "hw.temperature.limit",
55+
"label": "B",
56+
"paletteIndex": null,
57+
"valuePrefix": null,
58+
"valueSuffix": null,
59+
"valueUnit": null
60+
},
61+
{
62+
"displayName": "A-B",
63+
"label": "C",
64+
"paletteIndex": null,
65+
"valuePrefix": null,
66+
"valueSuffix": null,
67+
"valueUnit": null
68+
}
69+
],
70+
"showDataMarkers": true,
71+
"showEventLines": false,
72+
"time": {
73+
"range": 86400000,
74+
"rangeEnd": 0,
75+
"type": "relative"
76+
}
77+
}
78+
}
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
{
2+
"authorizedWriters": {
3+
"teams": [],
4+
"users": []
5+
},
6+
"created": 1727098905351,
7+
"creator": "GRtepaIAICg",
8+
"customProperties": null,
9+
"description": "",
10+
"detectorOrigin": "Standard",
11+
"id": "GYKHAnMAAAI",
12+
"labelResolutions": {
13+
"Hardware - Critically low battery": 240000
14+
},
15+
"lastUpdated": 1727178855678,
16+
"lastUpdatedBy": "AAAAAAAAAAA",
17+
"maxDelay": null,
18+
"minDelay": null,
19+
"name": "Hardware - Critically low battery",
20+
"overMTSLimit": false,
21+
"packageSpecifications": "",
22+
"programText": "A = data('hw.battery.charge').publish(label='A', enable=False)\nB = (A*100).publish(label='B')\ndetect(when(B < threshold(30))).publish('Hardware - Critically low battery')",
23+
"rules": [
24+
{
25+
"description": "The value of A*100 is below 30.",
26+
"detectLabel": "Hardware - Critically low battery",
27+
"disabled": false,
28+
"notifications": [],
29+
"parameterizedBody": "{{#if anomalous}}\n###Low battery\nBattery **{{dimensions.[name]}}** charge is critically low on **{{dimensions.[host.name]}}** in **{{dimensions.site}}**.\n\n###Consequence\nA low charge battery may lead to data loss in case of a power outage.\n\n###Recommended action\nCheck why the battery is not fully charged (it may be due to a power outage or an unplugged power cable) and if necessary, replace the battery.\n{{else}}\nThe battery charge is back within the normal operational range.\n{{/if}}\n\n###Device Details\n**Name: ** {{dimensions.[name]}}\n**ID:** {{dimensions.id}}\n**Vendor:** {{dimensions.vendor}}\n**Model:** {{dimensions.model}}\n**Serial Number:** {{dimensions.serial_number}}\n**Information:** {{dimensions.info}}",
30+
"runbookUrl": "",
31+
"severity": "Major",
32+
"tip": ""
33+
}
34+
],
35+
"sf_metricsInObjectProgramText": [
36+
"hw.battery.charge"
37+
],
38+
"status": "ACTIVE",
39+
"tags": [],
40+
"teams": [],
41+
"timezone": "",
42+
"visualizationOptions": {
43+
"disableSampling": false,
44+
"publishLabelOptions": [
45+
{
46+
"displayName": "hw.battery.charge",
47+
"label": "A",
48+
"paletteIndex": null,
49+
"valuePrefix": null,
50+
"valueSuffix": null,
51+
"valueUnit": null
52+
},
53+
{
54+
"displayName": "A*100",
55+
"label": "B",
56+
"paletteIndex": null,
57+
"valuePrefix": null,
58+
"valueSuffix": null,
59+
"valueUnit": null
60+
}
61+
],
62+
"showDataMarkers": true,
63+
"showEventLines": false,
64+
"time": {
65+
"range": 86400000,
66+
"rangeEnd": 0,
67+
"type": "relative"
68+
}
69+
}
70+
}
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
{
2+
"authorizedWriters": {
3+
"teams": [],
4+
"users": []
5+
},
6+
"created": 1727095467806,
7+
"creator": "GRtepaIAICg",
8+
"customProperties": null,
9+
"description": "",
10+
"detectorOrigin": "Standard",
11+
"id": "GYKAGp3AEAs",
12+
"labelResolutions": {
13+
"Hardware - Critically low fan speed (%)": 1000
14+
},
15+
"lastUpdated": 1729906605545,
16+
"lastUpdatedBy": "AAAAAAAAAAA",
17+
"maxDelay": 0,
18+
"minDelay": 0,
19+
"name": "Hardware - Critically low fan speed (%)",
20+
"overMTSLimit": false,
21+
"packageSpecifications": "",
22+
"programText": "A = data('hw.fan.speed').publish(label='A')\ndetect(when(A < threshold(1))).publish('Hardware - Critically low fan speed (%)')",
23+
"rules": [
24+
{
25+
"description": "The value of hw.fan.speed is below 1.",
26+
"detectLabel": "Hardware - Critically low fan speed (%)",
27+
"disabled": false,
28+
"notifications": [],
29+
"parameterizedBody": "{{#if anomalous}}\n###Low fan speed\nFan speed for **{{dimensions.[name]}}** is critically low on **{{dimensions.[host.name]}}** in **{{dimensions.site}}**.\n\n###Consequence\nThe temperature of the chip, component or device that was cooled down by this fan, may rise rapidly. This could lead to severe hardware damage and system crashes.\n\n###Recommended action\nCheck if the fan no longer cools down the system. If so, replace the fan.\n{{else}}\nRecovered fan speed.\n{{/if}}\n\n###Device Details\n**Name: ** {{dimensions.[name]}}\n**ID:** {{dimensions.id}}\n**Information:** {{dimensions.info}}",
30+
"parameterizedSubject": "Critically low fan speed (%)",
31+
"severity": "Minor"
32+
}
33+
],
34+
"sf_metricsInObjectProgramText": [
35+
"hw.fan.speed"
36+
],
37+
"status": "ACTIVE",
38+
"tags": [],
39+
"teams": [],
40+
"timezone": "",
41+
"visualizationOptions": {
42+
"disableSampling": false,
43+
"publishLabelOptions": [
44+
{
45+
"displayName": "hw.fan.speed",
46+
"label": "A",
47+
"paletteIndex": null,
48+
"valuePrefix": null,
49+
"valueSuffix": null,
50+
"valueUnit": null
51+
}
52+
],
53+
"showDataMarkers": true,
54+
"showEventLines": false,
55+
"time": {
56+
"range": 86400000,
57+
"rangeEnd": 0,
58+
"type": "relative"
59+
}
60+
}
61+
}

0 commit comments

Comments
 (0)