docs: Add ingestion troubleshooting topic #20092

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

JStickler wants to merge 3 commits into main from 2025.12.02_troubleshooting-ingest

+1,201 −0

Contributor

JStickler commented Dec 2, 2025 •

edited

Loading

What this PR does / why we need it:

Creates a new folder for troubleshooting in the TOC and renames the current troubleshooting topic.
Adds a new topic, Troubleshooting Ingest.
Adds recommended production limits to the Configuration Best Practices topic.

Related PR: #20182

Special notes for your reviewer:

AI authored using VSCode + VSCode + Claude Sonnet 4 and Cursor + Claude Sonnet 4.5. Two different drafts were generated, then merged and standardized.

Prompts:

Acting as an experienced technical writer with knowledge of Loki, write a troubleshooting guide for ingesting logs into Loki. Using error messages in the code base and defaults listed in the configuration documentation, document Loki’s error messages, the causes, and what users should do when encountering errors ingesting logs and writing logs to storage.
Revise and replace Promtail configuration with Alloy configuration.
Using the following structure: Error message, Cause, Default configuration, Resolution, Properties, fill in any missing details in troubleshooting-ingest.md.
Add context to the section "Monitoring and alerting". What is the file name where users should include this code example?

This topic is for OSS Loki, but I'd love to reuse it for Cloud Logs. I could use some help flagging which errors would require contacting Grafana Support. And also if any of the errors are "self-healing" in Cloud due to autoscaling, etc.

Checklist

Reviewed the CONTRIBUTING.md guide (required)
Documentation added
Tests updated
Title matches the required conventional commits format, see here
- Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

JStickler requested a review from a team as a code owner

December 2, 2025 22:16

JStickler added the type/docs label

pull-request-size bot added the size/XXL label

Contributor

github-actions bot commented Dec 2, 2025 •

edited

Loading

💻 Deploy preview available (docs: Add ingestion troubleshooting topic):

https://deploy-preview-loki-20092-zb444pucvq-vp.a.run.app/docs/loki/latest/

JStickler force-pushed the 2025.12.02_troubleshooting-ingest branch 3 times, most recently from 8a30170 to 2c51e88 Compare

December 2, 2025 22:45

JStickler mentioned this pull request

docs: Troubleshooting query docs #20182

Open

JStickler added 2 commits

December 9, 2025 15:26


          docs: Add ingestion troubleshooting topic

3afc008


          Changed formatting on error messages

c720f7b

JStickler force-pushed the 2025.12.02_troubleshooting-ingest branch from 2c51e88 to c720f7b Compare

December 9, 2025 20:44


          changed numbered lists to bullets

b870fdc

slim-bean reviewed

View reviewed changes

docs/sources/operations/troubleshooting/troubleshooting-ingest.md

Comment on lines +83 to +133

+              ### Error: `per_stream_rate_limit`
+              **Error message:**
+              `Per stream rate limit exceeded (limit: <limit>/sec) while attempting to ingest for stream <stream_labels> totaling <bytes>, consider splitting a stream via additional labels or contact your Loki administrator to see if the limit can be increased`
+              **Cause:**
+              A single stream (unique combination of labels) is sending data faster than the per-stream rate limit. This protects ingesters from being overwhelmed by a single high-volume stream.
+              **Default configuration:**
+              - `per_stream_rate_limit`: 3 MB/sec
+              - `per_stream_rate_limit_burst`: 15 MB (5x the rate limit)
+              **Resolution:**
+              * **Split the stream** by adding more labels to distribute the load:
+                 ```yaml
+                 # Before: {job="app"}
+                 # After: {job="app", instance="host1"}
+                 ```
+              * **Increase per-stream limits** (with caution):
+                 ```yaml
+                 limits_config:
+                   per_stream_rate_limit: 5MB
+                   per_stream_rate_limit_burst: 20MB
+                 ```
+                 {{< admonition type="warning" >}}
+                 Do not set `per_stream_rate_limit` higher than 5MB or `per_stream_rate_limit_burst` higher than 20MB without careful consideration.
+                 {{< /admonition >}}
+              * **Use Alloy's rate limiting** to throttle specific streams:
+                 ```alloy
+                 stage.limit {
+                   rate  = 100
+                   burst = 200
+                 }
+                 ```
+              **Properties:**
+              - Enforced by: Ingester
+              - Retryable: Yes
+              - HTTP status: 429 Too Many Requests
+              - Configurable per tenant: Yes

Collaborator

slim-bean Dec 11, 2025

Loki has automatic stream sharding and I wonder if we enable this by default, if folks hit the per stream rate limit typically we would tune automatic stream sharding first and then potentially increase the limit second, but we almost never ask anyone to resolve this manually anymore by adding labels themselves.

slim-bean reviewed

View reviewed changes

docs/sources/operations/troubleshooting/troubleshooting-ingest.md

Comment on lines +170 to +180

+              * **Adjust chunk idle period** to expire streams faster:
+                 ```yaml
+                 ingester:
+                   chunk_idle_period: 15m  # Default: 30m
+                 ```
+                 {{< admonition type="warning" >}}
+                 Shorter idle periods create more chunks and may impact query performance.
+                 {{< /admonition >}}

Collaborator

slim-bean Dec 11, 2025

generally we wouldn't recommend this, fix labels, increase the limit but we leave chunk_idle_period alone

slim-bean reviewed

View reviewed changes

docs/sources/operations/troubleshooting/troubleshooting-ingest.md

Comment on lines +217 to +226

+              * **Increase the line size limit** (not recommended above 256KB):
+                 ```yaml
+                 limits_config:
+                   max_line_size: 512KB
+                 ```
+                 {{< admonition type="warning" >}}
+                 Large line sizes impact performance and memory usage.
+                 {{< /admonition >}}

Collaborator

slim-bean Dec 11, 2025

this is a limit we won't increase in Grafana Cloud because doing so becomes very difficult to guarantee stability on the query path, i wonder if we should make the warning here more explicit:

Loki was built as a large multi-user, multi-tenant database and as such this limit becomes very important to maintain stability and performance with many users query the database simultaneously. We strongly recommend against increasing the max_line_size, doing so will make it very difficult to provide consistent query performance and stability of the system without having to throw extremely large amounts of memory and/or increasing the GRPC message size limits to really high levels, both of which will likely lead to poorer performance and worse experiences.

slim-bean reviewed

View reviewed changes

docs/sources/operations/troubleshooting/troubleshooting-ingest.md

Comment on lines +383 to +392

+              * **Increase max_chunk_age** (global setting, affects all tenants):
+                 ```yaml
+                 ingester:
+                   max_chunk_age: 4h  # Default: 2h
+                 ```
+                 {{< admonition type="warning" >}}
+                 Larger chunk age increases memory usage and may delay chunk flushing.
+                 {{< /admonition >}}

Collaborator

slim-bean Dec 11, 2025

I would not recommend increasing the chunk age here, there is instead a newer feature which I don't know how well has been documented to enable time sharding on incoming streams

slim-bean reviewed

View reviewed changes

docs/sources/operations/troubleshooting/troubleshooting-ingest.md

Comment on lines +394 to +408

+              * **Split high-volume streams** to reduce out-of-order conflicts:
+                 ```alloy
+                 // Add instance or host labels to distribute logs
+                 loki.source.file "logs" {
+                   targets = [
+                     {
+                       __path__  = "/var/log/*.log",
+                       job       = "app",
+                       instance  = env("HOSTNAME"),
+                     },
+                   ]
+                   forward_to = [loki.write.default.receiver]
+                 }
+                 ```

Collaborator

slim-bean Dec 11, 2025

i would remove this one

slim-bean reviewed

View reviewed changes

docs/sources/operations/troubleshooting/troubleshooting-ingest.md

Comment on lines +479 to +484

+              * **Increase the grace period**:
+                 ```yaml
+                 limits_config:
+                   creation_grace_period: 15m
+                 ```

Collaborator

slim-bean Dec 11, 2025

i would probably put a "not recommended" on this one as I have no idea what would happen if someone were to do this at any kind of normal level. we never really test or have much understanding of what happens if you send future dated logs.

slim-bean reviewed

View reviewed changes

docs/sources/operations/troubleshooting/troubleshooting-ingest.md

+                - Combining related labels
+              {{< admonition type="warning" >}}
+              Do not increase `max_label_names_per_series` as high cardinality can lead to significant performance degradation.

Collaborator

slim-bean Dec 11, 2025

i would probably reword this:

We strongly recommend against increasing this value, doing so creates a larger index which hurts query performance as well as opens the door for cardinality explosions.

You should be able to categorize your logs with 15 labels or typically much less, in all the years we have run Loki the number of valid exceptions we have seen can be counted on one hand vs the thousands of requests.

slim-bean reviewed

View reviewed changes

docs/sources/operations/troubleshooting/troubleshooting-ingest.md

Comment on lines +549 to +554

+              * **Increase the limit** (not recommended):
+                 ```yaml
+                 limits_config:
+                   max_label_name_length: 2048
+                 ```

Collaborator

slim-bean Dec 11, 2025

i would probably remove this and just leave it at "shorten label names"

there is never any circumstance where you need a label name longer than 2048 characters

slim-bean reviewed

View reviewed changes

docs/sources/operations/troubleshooting/troubleshooting-ingest.md

Comment on lines +583 to +588

+              * **Increase the limit**:
+                 ```yaml
+                 limits_config:
+                   max_label_value_length: 4096
+                 ```

Collaborator

slim-bean Dec 11, 2025

i would also remove this, there is no circumstance where anyone should need a label value longer than 4096 characters

slim-bean reviewed

View reviewed changes

docs/sources/operations/troubleshooting/troubleshooting-ingest.md

Comment on lines +644 to +654

+              * **Reduce batch size** in your ingestion client:
+                 ```alloy
+                 loki.write "default" {
+                   endpoint {
+                     url = "http://loki:3100/loki/api/v1/push"
+                     batch_size  = 524288  // 512KB in bytes
+                     batch_wait  = "1s"
+                   }
+                 }

Collaborator

slim-bean Dec 11, 2025

i couldn't imagine ever hitting this limit with the default 1MB in alloy and we often recommend a larger batch like 4MB, so i wonder if we should remove this example as I don't think anyone should be using a batch smaller than 1MB

slim-bean reviewed

View reviewed changes

docs/sources/operations/troubleshooting/troubleshooting-ingest.md

Comment on lines +714 to +719

+              * **DynamoDB errors:**
+                - `ProvisionedThroughputExceededException`: Write capacity exceeded
+                - `ValidationException`: Invalid data format
+              * **Cassandra errors:**
+                - `NoConnectionsAvailable`: Cannot connect to Cassandra cluster

Collaborator

slim-bean Dec 11, 2025

we should remove these, these store types have been deprecated for quite some time now

slim-bean reviewed

View reviewed changes

docs/sources/operations/troubleshooting/troubleshooting-ingest.md

+              * **Manual recovery** (if automatic recovery fails):
+                - Stop the ingester
+                - Backup the WAL directory

Collaborator

slim-bean Dec 11, 2025

i wonder if we should remove this backup line since we don't really have any instructions on what you would do with this backup

slim-bean reviewed

View reviewed changes

docs/sources/operations/troubleshooting/troubleshooting-ingest.md

+              * **Verify Loki is running** and healthy.
+              * **Check network connectivity** between client and Loki.
+              * **Confirm the correct hostname and port** configuration.
+              * **Review firewall and security group** settings.

Collaborator

slim-bean Dec 11, 2025

might also add to check for CPU starvation as a Loki pod that's overwhelmed will also potentially fail to respond to request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XXL type/docs