-
Notifications
You must be signed in to change notification settings - Fork 3.9k
docs: Add ingestion troubleshooting topic #20092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
💻 Deploy preview available (docs: Add ingestion troubleshooting topic): |
8a30170 to
2c51e88
Compare
2c51e88 to
c720f7b
Compare
| ### Error: `per_stream_rate_limit` | ||
|
|
||
| **Error message:** | ||
|
|
||
| `Per stream rate limit exceeded (limit: <limit>/sec) while attempting to ingest for stream <stream_labels> totaling <bytes>, consider splitting a stream via additional labels or contact your Loki administrator to see if the limit can be increased` | ||
|
|
||
| **Cause:** | ||
|
|
||
| A single stream (unique combination of labels) is sending data faster than the per-stream rate limit. This protects ingesters from being overwhelmed by a single high-volume stream. | ||
|
|
||
| **Default configuration:** | ||
|
|
||
| - `per_stream_rate_limit`: 3 MB/sec | ||
| - `per_stream_rate_limit_burst`: 15 MB (5x the rate limit) | ||
|
|
||
| **Resolution:** | ||
|
|
||
| * **Split the stream** by adding more labels to distribute the load: | ||
|
|
||
| ```yaml | ||
| # Before: {job="app"} | ||
| # After: {job="app", instance="host1"} | ||
| ``` | ||
|
|
||
| * **Increase per-stream limits** (with caution): | ||
|
|
||
| ```yaml | ||
| limits_config: | ||
| per_stream_rate_limit: 5MB | ||
| per_stream_rate_limit_burst: 20MB | ||
| ``` | ||
|
|
||
| {{< admonition type="warning" >}} | ||
| Do not set `per_stream_rate_limit` higher than 5MB or `per_stream_rate_limit_burst` higher than 20MB without careful consideration. | ||
| {{< /admonition >}} | ||
|
|
||
| * **Use Alloy's rate limiting** to throttle specific streams: | ||
|
|
||
| ```alloy | ||
| stage.limit { | ||
| rate = 100 | ||
| burst = 200 | ||
| } | ||
| ``` | ||
|
|
||
| **Properties:** | ||
|
|
||
| - Enforced by: Ingester | ||
| - Retryable: Yes | ||
| - HTTP status: 429 Too Many Requests | ||
| - Configurable per tenant: Yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loki has automatic stream sharding and I wonder if we enable this by default, if folks hit the per stream rate limit typically we would tune automatic stream sharding first and then potentially increase the limit second, but we almost never ask anyone to resolve this manually anymore by adding labels themselves.
| * **Adjust chunk idle period** to expire streams faster: | ||
|
|
||
| ```yaml | ||
| ingester: | ||
| chunk_idle_period: 15m # Default: 30m | ||
| ``` | ||
|
|
||
| {{< admonition type="warning" >}} | ||
| Shorter idle periods create more chunks and may impact query performance. | ||
| {{< /admonition >}} | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally we wouldn't recommend this, fix labels, increase the limit but we leave chunk_idle_period alone
| * **Increase the line size limit** (not recommended above 256KB): | ||
|
|
||
| ```yaml | ||
| limits_config: | ||
| max_line_size: 512KB | ||
| ``` | ||
|
|
||
| {{< admonition type="warning" >}} | ||
| Large line sizes impact performance and memory usage. | ||
| {{< /admonition >}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a limit we won't increase in Grafana Cloud because doing so becomes very difficult to guarantee stability on the query path, i wonder if we should make the warning here more explicit:
Loki was built as a large multi-user, multi-tenant database and as such this limit becomes very important to maintain stability and performance with many users query the database simultaneously. We strongly recommend against increasing the max_line_size, doing so will make it very difficult to provide consistent query performance and stability of the system without having to throw extremely large amounts of memory and/or increasing the GRPC message size limits to really high levels, both of which will likely lead to poorer performance and worse experiences.
| * **Increase max_chunk_age** (global setting, affects all tenants): | ||
|
|
||
| ```yaml | ||
| ingester: | ||
| max_chunk_age: 4h # Default: 2h | ||
| ``` | ||
|
|
||
| {{< admonition type="warning" >}} | ||
| Larger chunk age increases memory usage and may delay chunk flushing. | ||
| {{< /admonition >}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not recommend increasing the chunk age here, there is instead a newer feature which I don't know how well has been documented to enable time sharding on incoming streams
| * **Split high-volume streams** to reduce out-of-order conflicts: | ||
|
|
||
| ```alloy | ||
| // Add instance or host labels to distribute logs | ||
| loki.source.file "logs" { | ||
| targets = [ | ||
| { | ||
| __path__ = "/var/log/*.log", | ||
| job = "app", | ||
| instance = env("HOSTNAME"), | ||
| }, | ||
| ] | ||
| forward_to = [loki.write.default.receiver] | ||
| } | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would remove this one
| * **Increase the grace period**: | ||
|
|
||
| ```yaml | ||
| limits_config: | ||
| creation_grace_period: 15m | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would probably put a "not recommended" on this one as I have no idea what would happen if someone were to do this at any kind of normal level. we never really test or have much understanding of what happens if you send future dated logs.
| - Combining related labels | ||
|
|
||
| {{< admonition type="warning" >}} | ||
| Do not increase `max_label_names_per_series` as high cardinality can lead to significant performance degradation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would probably reword this:
We strongly recommend against increasing this value, doing so creates a larger index which hurts query performance as well as opens the door for cardinality explosions.
You should be able to categorize your logs with 15 labels or typically much less, in all the years we have run Loki the number of valid exceptions we have seen can be counted on one hand vs the thousands of requests.
| * **Increase the limit** (not recommended): | ||
|
|
||
| ```yaml | ||
| limits_config: | ||
| max_label_name_length: 2048 | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would probably remove this and just leave it at "shorten label names"
there is never any circumstance where you need a label name longer than 2048 characters
| * **Increase the limit**: | ||
|
|
||
| ```yaml | ||
| limits_config: | ||
| max_label_value_length: 4096 | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would also remove this, there is no circumstance where anyone should need a label value longer than 4096 characters
| * **Reduce batch size** in your ingestion client: | ||
|
|
||
| ```alloy | ||
| loki.write "default" { | ||
| endpoint { | ||
| url = "http://loki:3100/loki/api/v1/push" | ||
|
|
||
| batch_size = 524288 // 512KB in bytes | ||
| batch_wait = "1s" | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i couldn't imagine ever hitting this limit with the default 1MB in alloy and we often recommend a larger batch like 4MB, so i wonder if we should remove this example as I don't think anyone should be using a batch smaller than 1MB
| * **DynamoDB errors:** | ||
| - `ProvisionedThroughputExceededException`: Write capacity exceeded | ||
| - `ValidationException`: Invalid data format | ||
|
|
||
| * **Cassandra errors:** | ||
| - `NoConnectionsAvailable`: Cannot connect to Cassandra cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should remove these, these store types have been deprecated for quite some time now
|
|
||
| * **Manual recovery** (if automatic recovery fails): | ||
| - Stop the ingester | ||
| - Backup the WAL directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if we should remove this backup line since we don't really have any instructions on what you would do with this backup
| * **Verify Loki is running** and healthy. | ||
| * **Check network connectivity** between client and Loki. | ||
| * **Confirm the correct hostname and port** configuration. | ||
| * **Review firewall and security group** settings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might also add to check for CPU starvation as a Loki pod that's overwhelmed will also potentially fail to respond to request
What this PR does / why we need it:
Related PR: #20182
Special notes for your reviewer:
AI authored using VSCode + VSCode + Claude Sonnet 4 and Cursor + Claude Sonnet 4.5. Two different drafts were generated, then merged and standardized.
Prompts:
This topic is for OSS Loki, but I'd love to reuse it for Cloud Logs. I could use some help flagging which errors would require contacting Grafana Support. And also if any of the errors are "self-healing" in Cloud due to autoscaling, etc.
Checklist
CONTRIBUTING.mdguide (required)featPRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.docs/sources/setup/upgrade/_index.mddeprecated-config.yamlanddeleted-config.yamlfiles respectively in thetools/deprecated-config-checkerdirectory. Example PR