Skip to content

Conversation

@JStickler
Copy link
Contributor

@JStickler JStickler commented Dec 2, 2025

What this PR does / why we need it:

  • Creates a new folder for troubleshooting in the TOC and renames the current troubleshooting topic.
  • Adds a new topic, Troubleshooting Ingest.
  • Adds recommended production limits to the Configuration Best Practices topic.

Related PR: #20182

Special notes for your reviewer:

AI authored using VSCode + VSCode + Claude Sonnet 4 and Cursor + Claude Sonnet 4.5. Two different drafts were generated, then merged and standardized.

Prompts:

  • Acting as an experienced technical writer with knowledge of Loki, write a troubleshooting guide for ingesting logs into Loki. Using error messages in the code base and defaults listed in the configuration documentation, document Loki’s error messages, the causes, and what users should do when encountering errors ingesting logs and writing logs to storage.
  • Revise and replace Promtail configuration with Alloy configuration.
  • Using the following structure: Error message, Cause, Default configuration, Resolution, Properties, fill in any missing details in troubleshooting-ingest.md.
  • Add context to the section "Monitoring and alerting". What is the file name where users should include this code example?

This topic is for OSS Loki, but I'd love to reuse it for Cloud Logs. I could use some help flagging which errors would require contacting Grafana Support. And also if any of the errors are "self-healing" in Cloud due to autoscaling, etc.

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

@JStickler JStickler requested a review from a team as a code owner December 2, 2025 22:16
@JStickler JStickler added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Dec 2, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 2, 2025

💻 Deploy preview available (docs: Add ingestion troubleshooting topic):

@JStickler JStickler force-pushed the 2025.12.02_troubleshooting-ingest branch 3 times, most recently from 8a30170 to 2c51e88 Compare December 2, 2025 22:45
@JStickler JStickler force-pushed the 2025.12.02_troubleshooting-ingest branch from 2c51e88 to c720f7b Compare December 9, 2025 20:44
Comment on lines +83 to +133
### Error: `per_stream_rate_limit`

**Error message:**

`Per stream rate limit exceeded (limit: <limit>/sec) while attempting to ingest for stream <stream_labels> totaling <bytes>, consider splitting a stream via additional labels or contact your Loki administrator to see if the limit can be increased`

**Cause:**

A single stream (unique combination of labels) is sending data faster than the per-stream rate limit. This protects ingesters from being overwhelmed by a single high-volume stream.

**Default configuration:**

- `per_stream_rate_limit`: 3 MB/sec
- `per_stream_rate_limit_burst`: 15 MB (5x the rate limit)

**Resolution:**

* **Split the stream** by adding more labels to distribute the load:

```yaml
# Before: {job="app"}
# After: {job="app", instance="host1"}
```

* **Increase per-stream limits** (with caution):

```yaml
limits_config:
per_stream_rate_limit: 5MB
per_stream_rate_limit_burst: 20MB
```

{{< admonition type="warning" >}}
Do not set `per_stream_rate_limit` higher than 5MB or `per_stream_rate_limit_burst` higher than 20MB without careful consideration.
{{< /admonition >}}

* **Use Alloy's rate limiting** to throttle specific streams:

```alloy
stage.limit {
rate = 100
burst = 200
}
```

**Properties:**

- Enforced by: Ingester
- Retryable: Yes
- HTTP status: 429 Too Many Requests
- Configurable per tenant: Yes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loki has automatic stream sharding and I wonder if we enable this by default, if folks hit the per stream rate limit typically we would tune automatic stream sharding first and then potentially increase the limit second, but we almost never ask anyone to resolve this manually anymore by adding labels themselves.

Comment on lines +170 to +180
* **Adjust chunk idle period** to expire streams faster:

```yaml
ingester:
chunk_idle_period: 15m # Default: 30m
```

{{< admonition type="warning" >}}
Shorter idle periods create more chunks and may impact query performance.
{{< /admonition >}}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally we wouldn't recommend this, fix labels, increase the limit but we leave chunk_idle_period alone

Comment on lines +217 to +226
* **Increase the line size limit** (not recommended above 256KB):

```yaml
limits_config:
max_line_size: 512KB
```

{{< admonition type="warning" >}}
Large line sizes impact performance and memory usage.
{{< /admonition >}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a limit we won't increase in Grafana Cloud because doing so becomes very difficult to guarantee stability on the query path, i wonder if we should make the warning here more explicit:

Loki was built as a large multi-user, multi-tenant database and as such this limit becomes very important to maintain stability and performance with many users query the database simultaneously. We strongly recommend against increasing the max_line_size, doing so will make it very difficult to provide consistent query performance and stability of the system without having to throw extremely large amounts of memory and/or increasing the GRPC message size limits to really high levels, both of which will likely lead to poorer performance and worse experiences.

Comment on lines +383 to +392
* **Increase max_chunk_age** (global setting, affects all tenants):

```yaml
ingester:
max_chunk_age: 4h # Default: 2h
```

{{< admonition type="warning" >}}
Larger chunk age increases memory usage and may delay chunk flushing.
{{< /admonition >}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not recommend increasing the chunk age here, there is instead a newer feature which I don't know how well has been documented to enable time sharding on incoming streams

Comment on lines +394 to +408
* **Split high-volume streams** to reduce out-of-order conflicts:

```alloy
// Add instance or host labels to distribute logs
loki.source.file "logs" {
targets = [
{
__path__ = "/var/log/*.log",
job = "app",
instance = env("HOSTNAME"),
},
]
forward_to = [loki.write.default.receiver]
}
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would remove this one

Comment on lines +479 to +484
* **Increase the grace period**:

```yaml
limits_config:
creation_grace_period: 15m
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would probably put a "not recommended" on this one as I have no idea what would happen if someone were to do this at any kind of normal level. we never really test or have much understanding of what happens if you send future dated logs.

- Combining related labels

{{< admonition type="warning" >}}
Do not increase `max_label_names_per_series` as high cardinality can lead to significant performance degradation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would probably reword this:

We strongly recommend against increasing this value, doing so creates a larger index which hurts query performance as well as opens the door for cardinality explosions.

You should be able to categorize your logs with 15 labels or typically much less, in all the years we have run Loki the number of valid exceptions we have seen can be counted on one hand vs the thousands of requests.

Comment on lines +549 to +554
* **Increase the limit** (not recommended):

```yaml
limits_config:
max_label_name_length: 2048
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would probably remove this and just leave it at "shorten label names"

there is never any circumstance where you need a label name longer than 2048 characters

Comment on lines +583 to +588
* **Increase the limit**:

```yaml
limits_config:
max_label_value_length: 4096
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would also remove this, there is no circumstance where anyone should need a label value longer than 4096 characters

Comment on lines +644 to +654
* **Reduce batch size** in your ingestion client:

```alloy
loki.write "default" {
endpoint {
url = "http://loki:3100/loki/api/v1/push"

batch_size = 524288 // 512KB in bytes
batch_wait = "1s"
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i couldn't imagine ever hitting this limit with the default 1MB in alloy and we often recommend a larger batch like 4MB, so i wonder if we should remove this example as I don't think anyone should be using a batch smaller than 1MB

Comment on lines +714 to +719
* **DynamoDB errors:**
- `ProvisionedThroughputExceededException`: Write capacity exceeded
- `ValidationException`: Invalid data format

* **Cassandra errors:**
- `NoConnectionsAvailable`: Cannot connect to Cassandra cluster
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should remove these, these store types have been deprecated for quite some time now


* **Manual recovery** (if automatic recovery fails):
- Stop the ingester
- Backup the WAL directory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if we should remove this backup line since we don't really have any instructions on what you would do with this backup

* **Verify Loki is running** and healthy.
* **Check network connectivity** between client and Loki.
* **Confirm the correct hostname and port** configuration.
* **Review firewall and security group** settings.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might also add to check for CPU starvation as a Loki pod that's overwhelmed will also potentially fail to respond to request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XXL type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants