Report request errors to gridscale #239

twiebe · 2021-02-05T13:54:57Z

twiebe
Feb 5, 2021
Maintainer

To improve stability of our services, we need to identify issues and bottlenecks first.

In the past, when encountering issues in API reliability that were temporary in nature, we have implemented retry mechanisms into gsclient-go and the gridscale-terraform-plugin where applicable to improve the stability of the overall process. Also, underlying issues have been identified and, where possible, solved accordingly.

These retry mechanisms have the side-effect that, once in place, they mask underyling issues. To improve quality for everyone, we need to make them visible again.

This can be tackled from either the client or the server-side. For the latter, the APIs would provide metrics of request success. This does not make all issues visible though. In front of the actual APIs, there is an application server. In front of that, a web server. In front of that a reverse proxy. In front of that a firewall. In front of that a router. Etc. Issues in these layers, or the rolling deployment of the APIs itself, would not be visible in API-level metrics. Instead, we want our SDKs to report errors to us - with an opt-out functionality.

gsclient-go has been selected as the first prototype - mainly, because it is used by terraform as well. Terraform, in return, makes use of higher concurrency than many other clients do and tends to trigger deficiencies in provisioning more efficiently.

The idea is to send these reports to a publically available sentry instance. Networking errors (connection reset f.e.) and provisioning errors (Storage creation error f.e.) alike. To investigate issues, I presume we need

X-Request-Id, if available
Error message/classifier
Accessed resource type URL (f.e. /objects/servers) for aggregation
Access method (GET, POST, PATCH, DELETE, etc) for aggregation
Time

These reports shall also be sent when the requests is retried. They shall not block the SDK and, as such, shall send reports asynchronously. Users must be able to opt-out.

What are your thoughts?

nvthongswansea · 2021-02-05T18:11:06Z

nvthongswansea
Feb 5, 2021
Maintainer

I wonder if we should publish our sentry information (like connection info, etc.).

0 replies

bkircher · 2021-02-06T12:49:11Z

bkircher
Feb 6, 2021

What are your thoughts?

If the goal is to get backtraces from failed API requests I think it should happen on the server-side of the API not on the client-side. Reasons:

It is much simpler to implement. Adding Sentry to a couple of services on the backend can be done with only a few lines and can just be always on. Not much configuration handling needed. No opt-in/-out skirmish.
It is more accurate, since events can be sent more reliably as network conditions on the server-side are known. Also the information depth of the stack traces is much deeper (you would see where it failed, in what environment, et cetera). If you collect traces on the client you essentially just get the status code, a short error message, and maybe a request ID (that you still have to manually map to server-side collected traces).
It is safer for users. If we would add this to the clients we would have to provide users with a way to opt in or out. We would have to document why we are doing telemetry and what data is collected. I guess, we cannot guarantee with Sentry that there are no secrets included in the collected data and the potential that the data is personal data in the sense of GDPR would involve extra overhead on our side to clean-up those information on Sentry side before it is actually stored.

So really, tl;dr, if the goal is getting good traces on failed API requests, it is better done on server-side where requests are actually processed rather than client libraries or applications.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Report request errors to gridscale #239

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Report request errors to gridscale #239

Uh oh!

twiebe Feb 5, 2021 Maintainer

Replies: 2 comments

Uh oh!

nvthongswansea Feb 5, 2021 Maintainer

Uh oh!

bkircher Feb 6, 2021

twiebe
Feb 5, 2021
Maintainer

nvthongswansea
Feb 5, 2021
Maintainer

bkircher
Feb 6, 2021