Quality of Hetzner as an iaas provider #1124

aDingil · 2023-12-14T08:58:38Z

aDingil
Dec 14, 2023

How often are your services down?
Did you have the grafana stack installed?

My services are at least down once per month, sometimes clearly because of hetzner and most often because the nodes are cordened
I have to uncorden them manually and it even happens within a new cluster.
I dont know why maybe because of the autoupdate dunno.
The whole experience is quiet frustrating.

Here a summary of error/warnings event within my cluster, which is running since 06/23

Consolidated List of Identified Issues (23.11.23):

Relevant

Volume Attachment and Mounting Issues Attributed to Hetzner:
- Context: Multiple logs indicate issues with volume attachment and mounting, specifically related to Hetzner as the provider.
- Issue Reason: Hetzner had problems with volume provisioning or connectivity.
- Debugging:
  - Verify Hetzner's cloud volume configuration and status.
  - Ensure network connectivity between Kubernetes nodes and Hetzner's infrastructure.
  - Review Kubernetes events and logs for Hetzner-specific errors.
Node Availability and Scheduling Problems:
- Context: Several pods across various namespaces are unable to schedule due to node constraints and taints.
- Debugging: Check node taints, labels, resource allocation, and consider node scaling or reconfiguration.
System Upgrade Normal Events:
- Context: Frequent upgrade events related to the k3s-agent suggest possible issues causing repeated upgrade attempts.
- Debugging: Investigate causes of frequent upgrades and monitor node performance during upgrades.
Job Execution Exceeding Deadline in system-upgrade Namespace:
- Context: The job apply-k3s-agent-on-k3s-watchdog-agent-large-kkf-with-022e-73a05 exceeding its deadline indicates performance issues or misconfiguration.
- Debugging: Review job logs and performance metrics, and check for bottlenecks or configuration errors.

!Relevant

Frequent Normal Events in trivy-system Namespace:
- Context: Repeated normal events for scan-vulnerabilityreport objects suggest active scanning processes.
- Debugging: Ensure scans are functioning as expected. Review the configuration to avoid excessive scanning.
Regular Activity in watchdog Namespace:
- Context: Normal events related to prometheus-grafana-stack and grafana-stack-kube-prometh-operator indicate regular monitoring activities.
- Debugging: No immediate action required unless there are performance issues.
Trivy System Vulnerability Scan Issues:
- Context: Multiple events related to vulnerability scanning in trivy-system indicate potential issues with the scanning process.
- Debugging: Examine warning events for scan-vulnerabilityreport objects and validate the Trivy configuration.

Additional Considerations:

Cluster Health Check: Perform a comprehensive health check including node health, resource utilization, and network connectivity.
Recent Changes Review: Investigate any recent changes to the cluster that might have contributed to these issues.
Documentation of Changes: Keep records of changes made and their effects for future troubleshooting.

This summary encompasses the issues identified in the provided logs, addressing general Kubernetes cluster concerns, specific volume-related problems, and the activity within the trivy-system and watchdog namespaces.

aDingil · 2023-12-14T09:08:31Z

aDingil
Dec 14, 2023
Author

Anyone who had similar experiences, maybe i´m doing sth wrong, but cluster has around higher spec 4-5 nodes including the control node

7 replies

thecodeassassin Dec 14, 2023

Every workload (Redis, MariaDB, Deployments etc) are spread on different nodes via pod node anti-affinity. So when a node goes away (which happens very frequently) there's always another node that hosts that workload.

Redis sentinel takes care of failover for Redis and the MariaDB operator and Maxscale (proxy) take care of the same for our main database.

aDingil Dec 14, 2023
Author

Thx for the insight

thecodeassassin Dec 14, 2023

I would make sure that your application can handle disruptions in data stores. We had this issue when using the MariaDB operator (which exposes a service that gets updated with the correct master) but our DBAL (Prisma) did not reconnect and every single write started failing.

We switched to Maxscale (a db proxy) and it would handle reconnection to the proper (new) master instance for us.

Same with Sentinel (Redis).

As long as your applications can handle datastore disruptions you will essentially notice nothing from their "wonky" networking.

Also: I would not enable auto-update of the OS. Has been absolutely nothing but pain for us. We just update new node pools and then delete the old ones after cordoning and draining them.

Just my 2 cents here :)

aDingil Dec 14, 2023
Author

Yeah Same Here with the Update stuff, can Not recommend.

Many thx. I Always thought that im alone with that

aDingil Dec 18, 2023
Author

But can confirm that the auto update stuff is a huge pain after nuking the system-update namespace in which those jobs were, the whole cluster seems more stable.
Sure the whole manual update thing is always a pain but way smaller.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quality of Hetzner as an iaas provider #1124

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Quality of Hetzner as an iaas provider #1124

aDingil Dec 14, 2023

Consolidated List of Identified Issues (23.11.23):

Relevant

!Relevant

Additional Considerations:

Replies: 1 comment · 7 replies

aDingil Dec 14, 2023 Author

thecodeassassin Dec 14, 2023

aDingil Dec 14, 2023 Author

thecodeassassin Dec 14, 2023

aDingil Dec 14, 2023 Author

aDingil Dec 18, 2023 Author

aDingil
Dec 14, 2023

Replies: 1 comment 7 replies

aDingil
Dec 14, 2023
Author

aDingil Dec 14, 2023
Author

aDingil Dec 14, 2023
Author

aDingil Dec 18, 2023
Author