Skip to content

Commit ba91d35

Browse files
authored
Add Event Hubs Troubleshooting Guide (#24652)
1 parent 3827264 commit ba91d35

File tree

6 files changed

+236
-7
lines changed

6 files changed

+236
-7
lines changed

sdk/messaging/azeventhubs/README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,8 @@ You can also create a client using a connection string.
4141
- ConsumerClient: [link](https://aka.ms/azsdk/go/eventhubs/pkg#example-NewConsumerClient)
4242
- ProducerClient: [link](https://aka.ms/azsdk/go/eventhubs/pkg#example-NewProducerClient)
4343

44+
For Event Hubs roles, see [Built-in roles for Azure Event Hubs](https://learn.microsoft.com/azure/event-hubs/authenticate-application#built-in-roles-for-azure-event-hubs).
45+
4446
#### Using a connection string
4547
- ConsumerClient: [link](https://aka.ms/azsdk/go/eventhubs/pkg#example-NewConsumerClientFromConnectionString)
4648
- ProducerClient: [link](https://aka.ms/azsdk/go/eventhubs/pkg#example-NewProducerClientFromConnectionString)
@@ -63,6 +65,8 @@ Examples for various scenarios can be found on [pkg.go.dev](https://aka.ms/azsdk
6365

6466
# Troubleshooting
6567

68+
For detailed troubleshooting information, refer to the [Event Hubs Troubleshooting Guide][eventhubs_troubleshooting].
69+
6670
### Logging
6771

6872
This module uses the classification-based logging implementation in `azcore`. To enable console logging for all SDK modules, set the environment variable `AZURE_SDK_GO_LOGGING` to `all`.
@@ -129,6 +133,7 @@ Azure SDK for Go is licensed under the [MIT](https://github.com/Azure/azure-sdk-
129133

130134
[azure_identity_pkg]: https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/azidentity
131135
[default_azure_credential]: https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/azidentity#NewDefaultAzureCredential
136+
[eventhubs_troubleshooting]: https://github.com/Azure/azure-sdk-for-go/blob/main/sdk/messaging/azeventhubs/TROUBLESHOOTING.md
132137
[source]: https://github.com/Azure/azure-sdk-for-go/tree/main/sdk/messaging/azeventhubs
133138
[godoc]: https://aka.ms/azsdk/go/eventhubs/pkg
134139
[godoc_examples]: https://aka.ms/azsdk/go/eventhubs/pkg#pkg-examples
Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
# Troubleshooting Azure Event Hubs module issues
2+
3+
This troubleshooting guide contains instructions to diagnose frequently encountered issues while using the Azure Event Hubs module for Go.
4+
5+
## Table of contents
6+
7+
- [General Troubleshooting](#general-troubleshooting)
8+
- [Error Handling](#error-handling)
9+
- [Logging](#logging)
10+
- [Common Error Scenarios](#common-error-scenarios)
11+
- [Unauthorized Access Errors](#unauthorized-access-errors)
12+
- [Connection Lost Errors](#connection-lost-errors)
13+
- [Ownership Lost Errors](#ownership-lost-errors)
14+
- [Performance Considerations](#performance-considerations)
15+
- [Connectivity Issues](#connectivity-issues)
16+
- [Enterprise Environments and Firewalls](#enterprise-environments-and-firewalls)
17+
- [Advanced Troubleshooting](#advanced-troubleshooting)
18+
- [Logs to collect](#logs-to-collect)
19+
- [Interpreting Logs](#interpreting-logs)
20+
- [Additional Resources](#additional-resources)
21+
- [Filing GitHub Issues](#filing-github-issues)
22+
23+
## General Troubleshooting
24+
25+
### Error Handling
26+
27+
azeventhubs can return two types of errors: `azeventhubs.Error`, which contains a code you can use programatically, and `error`s which only contain an error message.
28+
29+
Here's an example of how to check the `Code` from an `azeventhubs.Error`:
30+
31+
```go
32+
if err != nil {
33+
var azehErr *azeventhubs.Error
34+
35+
if errors.As(err, &azehErr) {
36+
switch azehErr.Code {
37+
case azeventhubs.ErrorCodeUnauthorizedAccess:
38+
// Handle authentication errors
39+
case azeventhubs.ErrorCodeConnectionLost:
40+
// This error is only returned if all configured retries have been exhausted.
41+
// An example of configuring retries can be found here: https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/messaging/azeventhubs/v2#example-NewConsumerClient-ConfiguringRetries
42+
}
43+
}
44+
45+
// Handle other error types
46+
}
47+
```
48+
49+
### Logging
50+
51+
Event Hubs uses the classification-based logging implementation in `azcore`. You can enable logging for all Azure SDK modules by setting the environment variable `AZURE_SDK_GO_LOGGING` to `all`.
52+
53+
For more fine-grained control, use the `azcore/log` package to enable specific log events:
54+
55+
```go
56+
import (
57+
"fmt"
58+
azlog "github.com/Azure/azure-sdk-for-go/sdk/azcore/log"
59+
"github.com/Azure/azure-sdk-for-go/sdk/messaging/azeventhubs/v2"
60+
)
61+
62+
// Print log output to stdout
63+
azlog.SetListener(func(event azlog.Event, s string) {
64+
fmt.Printf("[%s] %s\n", event, s)
65+
})
66+
67+
// Enable specific event types
68+
azlog.SetEvents(
69+
azeventhubs.EventConn, // Connection-related events
70+
azeventhubs.EventAuth, // Authentication events
71+
azeventhubs.EventProducer, // Producer operations
72+
azeventhubs.EventConsumer, // Consumer operations
73+
)
74+
```
75+
76+
## Common Error Scenarios
77+
78+
### Unauthorized Access Errors
79+
80+
If you receive an `ErrorCodeUnauthorizedAccess` error, it means the credentials provided are not valid for use with a particular entity, or they have expired.
81+
82+
**Common causes and solutions:**
83+
84+
- **Expired credentials**: If using SAS tokens, they expire after a certain duration. Generate a new token or use a credential that automatically refreshes, like one of the TokenCredential types from the [Azure Identity module][azidentity_tokencredentials].
85+
- **Missing permissions**: Ensure the identity you're using has the correct role assigned from the [built-in roles for Azure Event Hubs](https://learn.microsoft.com/azure/event-hubs/authenticate-application#built-in-roles-for-azure-event-hubs).
86+
- **Incorrect entity name**: Verify that the Event Hub name, consumer group, or namespace name is spelled correctly.
87+
88+
For more help with troubleshooting authentication errors when using Azure Identity, see the Azure Identity client library [troubleshooting guide][azidentity_troubleshooting].
89+
90+
### Connection Lost Errors
91+
92+
An `azeventhubs.ErrorCodeConnectionLost` error indicates that the connection was lost and all retry attempts failed. This typically reflects an extended outage or connection disruption.
93+
94+
**Common causes and solutions:**
95+
96+
- **Network instability**: Check your network connection and try again after ensuring stability.
97+
- **Service outage**: Check the [Azure status page](https://status.azure.com) for any ongoing Event Hubs outages.
98+
- **Firewall or proxy issues**: Ensure firewall rules aren't blocking the connection.
99+
100+
### Ownership Lost Errors
101+
102+
An `azeventhubs.ErrorCodeOwnershipLost` error occurs when a partition that you were reading from was opened by another link with a higher epoch/owner level.
103+
104+
* If you're using the azeventhubs.Processor, you will occasionally see this error when the individual Processors are allocating partition ownerships. This is expected, and the Processors will handle the error, internally.
105+
* If you're NOT using the Processor, this indicates you have two PartitionClient instances, both of which are using the same consumer group, opening the same partition, but with different owner levels.
106+
107+
### Performance Considerations
108+
109+
**If the processor can't keep up with event flow:**
110+
111+
1. **Increase processor instances**: Add more Processor instances to distribute the load. The number of Processor instances cannot exceed the number of partitions for your Event Hub.
112+
2. **Increase Event Hubs partitions**: Consider creating an Event Hub with more partitions, to allow for more parallel consumers. NOTE: requires a new Event Hub.
113+
3. **Call `ProcessorPartitionClient.UpdateCheckpoint` less often**: some alternate strategies:
114+
- Call only after a requisite number of events has been received
115+
- Call only after a certain amount of time has expired.
116+
117+
## Connectivity Issues
118+
119+
### Enterprise Environments and Firewalls
120+
121+
In corporate networks with strict firewall rules, you may encounter connectivity issues when connecting to Event Hubs.
122+
123+
**Common solutions:**
124+
125+
1. **Allow the necessary endpoints**: See [Event Hubs FAQ: "What ports do I need to open on the firewall?"][eventhubs_faq_ports].
126+
2. **Use a proxy**: If you require a proxy to connect to Azure resources you can configure your client to use it: [Example using a proxy and/or Websockets][example_proxy_websockets]
127+
3. **Use Websockets**: If you can only connect to Azure resources using HTTPs (443) you can configure your client to use Websockets. See this example for how to enable websockets with Event Hubs: [Example using a proxy and/or Websockets][example_proxy_websockets].
128+
4. **Configure network security rules**: If using Azure VNet integration, configure service endpoints or private endpoints
129+
130+
## Advanced Troubleshooting
131+
132+
### Logs to collect
133+
134+
When troubleshooting issues with Event Hubs that you need to escalate to support or report in GitHub issues, collect the following logs:
135+
136+
1. **Enable debug logging**: To enable logs, see [logging](#logging).
137+
2. **Timeframe**: Capture logs from at least 5 minutes before until 5 minutes after the issue occurs
138+
3. **Include timestamps**: Ensure your logging setup includes timestamps. By default `AZURE_SDK_GO_LOGGING` logging includes timestamps.
139+
140+
### Interpreting Logs
141+
142+
When analyzing Event Hubs logs:
143+
144+
1. **Connection errors**: Look for AMQP connection and link errors in `EventConn` logs
145+
2. **Authentication failures**: Check `EventAuth` logs for credential or authorization failures
146+
3. **Producer errors**: `EventProducer` logs show message send operations and errors
147+
4. **Consumer errors**: `EventConsumer` logs show message receive operations and partition ownership changes
148+
5. **Load balancing**: Look for ownership claims and changes in `EventConsumer` logs
149+
150+
### Additional Resources
151+
152+
- [Event Hubs Documentation](https://learn.microsoft.com/azure/event-hubs/)
153+
- [Event Hubs Pricing](https://azure.microsoft.com/pricing/details/event-hubs/)
154+
- [Event Hubs Quotas](https://learn.microsoft.com/azure/event-hubs/event-hubs-quotas)
155+
- [Event Hubs FAQ](https://learn.microsoft.com/azure/event-hubs/event-hubs-faq)
156+
157+
### Filing GitHub Issues
158+
159+
To file an issue in Github, use this [link](https://github.com/Azure/azure-sdk-for-go/issues/new/choose) and include the following information:
160+
161+
1. **Event Hub details**:
162+
- How many partitions?
163+
- What tier (Standard/Premium/Dedicated)?
164+
165+
2. **Client environment**:
166+
- Machine specifications
167+
- Number of client instances running
168+
- Go version
169+
170+
3. **Message patterns**:
171+
- Average message size
172+
- Throughput (messages per second)
173+
- Whether traffic is consistent or bursty
174+
175+
4. **Reproduction steps**:
176+
- A minimal code example that reproduces the issue
177+
- Steps to reproduce the problem
178+
179+
5. **Logs**:
180+
- Include diagnostic loogs from before, during and after the failure. For instructions on enabling logging see the [Logging](#logs-to-collect) section above.
181+
- **NOTE**: the information in Github issues and logs are publicly viewable. Please keep this in mind when posting any information.
182+
183+
<!-- LINKS -->
184+
[azidentity_troubleshooting]: https://github.com/Azure/azure-sdk-for-go/blob/main/sdk/azidentity/TROUBLESHOOTING.md
185+
[amqp_errors]: https://learn.microsoft.com/azure/event-hubs/event-hubs-amqp-troubleshoot
186+
[azidentity_tokencredentials]: https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/azidentity#readme-credential-chains
187+
[eventhubs_faq_ports]: https://learn.microsoft.com/azure/event-hubs/event-hubs-faq#what-ports-do-i-need-to-open-on-the-firewall
188+
[example_proxy_websockets]: https://github.com/Azure/azure-sdk-for-go/blob/main/sdk/messaging/azeventhubs/example_websockets_and_proxies_test.go
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# Troubleshooting Azure Event Hubs Checkpoints
2+
3+
The troubleshooting guide for Azure Event Hubs Checkpoints can be found in the main Event Hubs troubleshooting guide:
4+
5+
[Azure Event Hubs Troubleshooting Guide](https://github.com/Azure/azure-sdk-for-go/blob/main/sdk/messaging/azeventhubs/TROUBLESHOOTING.md)
6+
7+
For specific information on checkpoint store issues, refer to the [Checkpoint Store Problems](https://github.com/Azure/azure-sdk-for-go/blob/main/sdk/messaging/azeventhubs/TROUBLESHOOTING.md#checkpoint-store-problems) section.

sdk/messaging/azeventhubs/example_consumerclient_test.go

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -170,3 +170,28 @@ func ExampleNewConsumerClient_usingCustomEndpoint() {
170170
panic(err)
171171
}
172172
}
173+
174+
func ExampleNewConsumerClient_configuringRetries() {
175+
// `DefaultAzureCredential` tries several common credential types. For more credential types
176+
// see this link: https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/azidentity#readme-credential-types.
177+
defaultAzureCred, err := azidentity.NewDefaultAzureCredential(nil)
178+
179+
if err != nil {
180+
panic(err)
181+
}
182+
183+
consumerClient, err = azeventhubs.NewConsumerClient("<ex: myeventhubnamespace.servicebus.windows.net>", "eventhub-name", azeventhubs.DefaultConsumerGroup, defaultAzureCred, &azeventhubs.ConsumerClientOptions{
184+
RetryOptions: azeventhubs.RetryOptions{
185+
// NOTE: these are the default values.
186+
MaxRetries: 3,
187+
RetryDelay: time.Second,
188+
MaxRetryDelay: 120 * time.Second,
189+
},
190+
})
191+
192+
if err != nil {
193+
// TODO: Update the following line with your application specific error handling logic
194+
fmt.Printf("ERROR: %s\n", err)
195+
return
196+
}
197+
}

sdk/messaging/azeventhubs/example_websockets_and_proxies_test.go

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ import (
1515
"github.com/coder/websocket"
1616
)
1717

18-
func ExampleNewClient_usingWebsocketsAndProxies() {
18+
func Example_usingWebsocketsAndProxies() {
1919
eventHubNamespace := os.Getenv("EVENTHUB_NAMESPACE") // <ex: myeventhubnamespace.servicebus.windows.net>
2020
eventHubName := os.Getenv("EVENTHUB_NAME")
2121

@@ -56,12 +56,14 @@ func ExampleNewClient_usingWebsocketsAndProxies() {
5656
log.Fatalf("ERROR: %s", err)
5757
}
5858

59-
// NOTE: For users of `nhooyr.io/websocket` there's an open discussion here:
60-
// https://github.com/nhooyr/websocket/discussions/380
59+
// NOTE: For users of `coder/websocket` there's an open discussion here:
60+
// https://github.com/coder/websocket/issues/520
6161
//
6262
// An error ("failed to read frame header: EOF") can be returned when the
6363
// websocket connection is closed. This error will be returned from the
6464
// `ConsumerClient.Close` or `ProducerClient.Close` functions and can be
6565
// ignored, as the websocket "close handshake" has already completed.
6666
defer consumerClient.Close(context.TODO())
6767
}
68+
69+
var _ any // (ignore, used for docs)

sdk/messaging/azeventhubs/processor.go

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,13 +26,15 @@ var processorOwnerLevel = to.Ptr[int64](0)
2626
type ProcessorStrategy string
2727

2828
const (
29-
// ProcessorStrategyBalanced will attempt to claim a single partition at a time, until each active
30-
// owner has an equal share of partitions.
29+
// ProcessorStrategyBalanced will attempt to claim a single partition during each update interval, until
30+
// each active owner has an equal share of partitions. It can take longer for Processors to acquire their
31+
// full share of partitions, but minimizes partition swapping.
3132
// This is the default strategy.
3233
ProcessorStrategyBalanced ProcessorStrategy = "balanced"
3334

34-
// ProcessorStrategyGreedy will attempt to claim as many partitions at a time as it can, ignoring
35-
// balance.
35+
// ProcessorStrategyGreedy will attempt to claim all partitions it can during each update interval, respecting
36+
// balance. This can lead to more partition swapping, as Processors steal partitions to get to their fair share,
37+
// but can speed up initial startup.
3638
ProcessorStrategyGreedy ProcessorStrategy = "greedy"
3739
)
3840

0 commit comments

Comments
 (0)