Skip to content

Conversation

@AndersonQ
Copy link
Member

@AndersonQ AndersonQ commented Dec 3, 2025

Proposed commit message

filebeat: Promote filestream GZIP support to GA

Promotes GZIP support in the filestream input from beta to General Availability.

Deprecates the `gzip_experimental` option in favour of the new `compression`
setting. Valid values:
- `""`: No compression (default).
- `"gzip"`: Treat all files as GZIP.
- `"auto"`: Auto-detect based on magic bytes.

Note: GZIP decoding requires `fingerprint` file identity for accurate offset 
tracking. A warning is now logged if compression is enabled with a file identity
other than `fingerprint`.

Unit and integration tests have been updated to reflect these changes.

AI tools used: Cursor.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the stresstest.sh script to run them under stress conditions and race detector to verify their stability.
  • I have added an entry in ./changelog/fragments using the changelog tool.

Disruptive User Impact

  • gzip_experimental has been removed. It is ignored and only a warning directing users to use compression is logged if gzip_experimental configured.

How to test this PR locally

verify that non-fingerprint file identities and GZIP logs a warning

  • create a config with file_identity.native:
filebeat.inputs:
  - type: filestream
    id: test
    paths:
      - /tmp/*.log
    file_identity.native: ~
    compression: auto
path.home: /tmp/beats/home
output.file:
  enabled: true
  path: /tmp/beats/home/out
  filename: "output"
logging.level: debug
  • run filebeat
go run . --strict.perms=false -e -c ./filebeat.yml 2>&1 | grep message
  • check it fails to start and logs the error
{"log.level":"warn","@timestamp":"2025-12-12T09:32:06.622+0100","log.logger":"input","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/input/filestream.config.checkUnsupportedParams","file.name":"filestream/config.go","file.line":257},"message":"compression='auto' requires file_identity to be 'fingerprint'","service.name":"filebeat","ecs.version":"1.6.0"}

To verify it works with compression auto:

  • generate a gzip log file:
mkdir -p /tmp/beats/in
docker run -it --rm mingrammer/flog -f json -n 100 > /tmp/beats/in/log.ndjson
gzip /tmp/beats/in/log.ndjson
# generate another "active" log file
docker run -it --rm mingrammer/flog -f json -n 100 > /tmp/beats/in/log.ndjson
  • use the following config file. Adjust as you like
http:
  enabled: true

path.home: /tmp/beats/home
filebeat.inputs:
    - type: filestream
      id: gzip-input
      enabled: true
      paths:
        - /tmp/beats/in/log.ndjson*
      compression: auto

output.file:
  path: /tmp/beats/home
  filename: "output-file"
logging.level: debug
logging.metrics:
  level: debug
  • run filebeat
go run . --strict.perms=false -e -c ./filebeat.yml 2>&1 | grep message
  • output file has 200 lines
wc -l /tmp/beats/home/out/*
200

To verify compression: gzip does not ingest plain file:

  • generate a gzip log file:
mkdir -p /tmp/beats/in
docker run -it --rm mingrammer/flog -f json -n 100 > /tmp/beats/in/log.ndjson
gzip /tmp/beats/in/log.ndjson
# generate another "active" log file
docker run -it --rm mingrammer/flog -f json -n 100 > /tmp/beats/in/log.ndjson
  • use the following config file. Adjust as you like
http:
  enabled: true

path.home: /tmp/beats/home
filebeat.inputs:
    - type: filestream
      id: gzip-input
      enabled: true
      paths:
        - /tmp/beats/in/log.ndjson*
      compression: gzip

output.file:
  path: /tmp/beats/home
  filename: "output-file"
logging.level: debug
logging.metrics:
  level: debug
  • run filebeat
go run . --strict.perms=false -e -c ./filebeat.yml 
  • find the log:
{"log.level":"warn","@timestamp":"2025-12-12T09:40:47.453+0100","log.logger":"input.filestream.scanner","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/input/filestream.(*fileScanner).GetFiles","file.name":"filestream/fswatch.go","file.line":511},"message":"cannot create a file descriptor for an ingest target \"/tmp/beats/in/log.ndjson\": failed to create gzip seeker: could not create gzip reader: gzip: invalid header","service.name":"filebeat","id":"gzip-input","ecs.version":"1.6.0"}
  • output file has 100 lines, the gzip file only
wc -l /tmp/beats/home/out/*
100

To verify compression: "" ingest gzip file as a plain file:

  • generate a gzip log file:
mkdir -p /tmp/beats/in
docker run -it --rm mingrammer/flog -f json -n 100 > /tmp/beats/in/log.ndjson
gzip /tmp/beats/in/log.ndjson
# generate another "active" log file
docker run -it --rm mingrammer/flog -f json -n 100 > /tmp/beats/in/log.ndjson
  • use the following config file. Adjust as you like
http:
  enabled: true

path.home: /tmp/beats/home
filebeat.inputs:
    - type: filestream
      id: gzip-input
      enabled: true
      paths:
        - /tmp/beats/in/log.ndjson*
      compression: ""

output.file:
  path: /tmp/beats/home
  filename: "output-file"
logging.level: debug
logging.metrics:
  level: debug
  • run filebeat
go run . --strict.perms=false -e -c ./filebeat.yml 
  • output file has 100 lines, the gzip file only
wc -l  /tmp/beats/home/out/*
120 
  • check the data ingested from the gzip file is garbage:
grep "/tmp/beats/in/log.ndjson.gz" /tmp/beats/home/out/* | tail -n 1

{"@timestamp":"2025-12-12T08:43:28.335Z","@metadata":{"beat":"filebeat","type":"_doc","version":"9.3.0"},"ecs":{"version":"8.0.0"},"host":{"name":"mokona-elastic"},"agent":{"version":"9.3.0","ephemeral_id":"6a649a03-1cdf-4e41-80d0-075929ce8542","id":"032785ed-d7ef-491e-913e-f92063b05844","name":"mokona-elastic","type":"filebeat"},"log":{"offset":5270,"file":{"path":"/tmp/beats/in/log.ndjson.gz","device_id":"64513","inode":"43516198","fingerprint":"3b29db3923a6ebd5f44bf71e437f57d9676bfd8a1bc6ae41cbfbd0f954da1863"}},"message":"\u0013\ufffd˙4_\ufffd\ufffd\ufffd\ufffd>\ufffdy\u0008\ufffd*V\ufffdM\ufffd;=\u0002u=$\ufffd\u000e\ufffd\u0001\u0008\u0003\ufffdJ\ufffd\ufffd\ufffd#\ufffdǎ΃\ufffd\ufffd\u001c8\ufffd\ufffd\u0006K\ufffd!n\u000f\ufffd2[\ufffd\ufffd\ufffd4O\ufffd\r\ufffdw$͒d\ufffd\ufffd%\ufffd]{\ufffd\ufffd\ufffd^\u0011\ufffd\u001d\ufffd\ufffd\ufffd\ufffdU\u000b\u0002D\u0018d\ufffd.\ufffdNs\u0013\ufffd1b?\u000eB\u001e\ufffd\ufffd<\ufffd7\ufffdr\ufffdwyS\ufffd\t\u0001\ufffd[\ufffdA\u0004%\ufffdi\ufffd\ufffd&lmRxj\u0010\ufffd@\u001f\u0007\ufffd\u0011,\ufffdy\ufffd\ufffdQ\ufffd\ufffd&\ufffd\ufffdX>\ufffd\ufffd\ufffd <\ufffdǹ\ufffd\ufffd\ufffdw<\u001a\u0000OW$!dp\ufffd\ufffdh\ufffd\ufffd\ufffd8{\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd@]U7Pl\ufffd\ufffdz\u000fS\ufffd,H\u0017\ufffd","input":{"type":"filestream"}}
  • 20 line from the gzip file
grep "/tmp/beats/in/log.ndjson.gz" /tmp/beats/home/out/* | wc -l    
20
  • 100 lines from the plain file
grep '/tmp/beats/in/log.ndjson"' /tmp/beats/home/out/* | wc -l
100

Related issues

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Dec 3, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 3, 2025

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify
Copy link
Contributor

mergify bot commented Dec 3, 2025

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @AndersonQ? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@AndersonQ AndersonQ force-pushed the 47880-gzip-default-on-ga branch 3 times, most recently from 676ad9b to 4543ad2 Compare December 4, 2025 13:45
@AndersonQ AndersonQ added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Dec 4, 2025
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Dec 4, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 4, 2025

🔍 Preview links for changed docs

@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Copy link
Contributor

@colleenmcginnis colleenmcginnis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor suggestions related to the changelog fragment.

@orestisfl
Copy link
Contributor

@cmacknz @AndersonQ does enabling gzip files by default mean that previous users that use wildcards (/path/to/log*) that can now match gzip files will now match /path/to/log.tar.gz as well which means they will start unexpectedly ingesting more files?

@cmacknz
Copy link
Member

cmacknz commented Dec 10, 2025

I got some feedback from PM (Bill) that we may not want to have this be enabled by default just to minimize any chance of user disruption.

Since in general we want compatibility with filelog, we can follow it's approach here which is covered by the compression parameter: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/filelogreceiver/README.md

compression
Indicate the compression format of input files. If set accordingly, files will be read using a reader that uncompresses the file before scanning its content. Options are ``, gzip, or auto. `auto` auto-detects file compression type. Currently, gzip files are the only compressed files auto-detected, based on ".gz" filename extension. `auto` option is useful when ingesting a mix of compressed and uncompressed files with the same filelogreceiver.

So I would vote we add the same compression configuration to be exactly compatible. We would default to none/unspecified which treats everything as a plain file. We could add gzip which skips auto-detection and just assumes all files are gzip. We already have the auto mode which also only supports gzip today via IsGZIP.

// IsGZIP reports whether the file f starts with the GZIP magic header bytes as
// defined by RFC 1952. The file offset is reset to the original position before
// returning.
func IsGZIP(f *os.File) (bool, error) {

@AndersonQ AndersonQ marked this pull request as draft December 11, 2025 07:16
@AndersonQ
Copy link
Member Author

AndersonQ commented Dec 11, 2025

I got some feedback from PM (Bill) that we may not want to have this be enabled by default just to minimize any chance of user disruption.

ok, makes sense.

So I would vote we add the same compression configuration to be exactly compatible. We would default to none/unspecified which treats everything as a plain file. We could add gzip which skips auto-detection and just assumes all files are gzip. We already have the auto mode which also only supports gzip today via IsGZIP.

Ok, so let me confirm the behaviour:

  • keep the requirement to use fingerprint, if not, error and don't start the input
  • compression: missing/null/empty -> gzip off, every file is plain file
  • compression: gzip: every file is gzip. Error if the file is plain text
  • compression: auto: decompress GZIP, treat plain file as plain file.
  • log input as filestream uses absent compression, everything is a plain file
  • gzip_experimetal: deprecated. it sets compression: auto instead and logs a warning saying to use compression and that it'll be deprecated in future versions.

@cmacknz
Copy link
Member

cmacknz commented Dec 11, 2025

keep the requirement to use fingerprint, if not, error and don't start the input

No. There is no reason for Filebeat to exit, you should only warn. You do not want to cause a data collection outage over this because Filebeat is very likely doing more data collection than just reading actively rotating gzipped logs.

compression: missing/null/empty -> gzip off, every file is plain file
compression: gzip: every file is gzip. Error if the file is plain text
compression: auto: decompress GZIP, treat plain file as plain file.

Yes but please use filelog and test to confirm we have interpreted it's behaviour from its documentation correctly.

log input as filestream uses absent compression, everything is a plain file

Yes the log input should not support compression.

gzip_experimetal: deprecated. it sets compression: auto instead and logs a warning saying to use compression and that it'll be deprecated in future versions.

I would just ignore this parameter and log that it's deprecated and explain what to do instead. We want to delete this parameter. It may be simpler to just delete it immediately (which will also cause it to be ignored).

@AndersonQ AndersonQ force-pushed the 47880-gzip-default-on-ga branch from 3960caa to 9427590 Compare December 12, 2025 08:29
Promotes GZIP support in the filestream input from beta to General Availability.

Deprecates the `gzip_experimental` option in favour of the new `compression`
setting. Valid values:
- `""`: No compression (default).
- `"gzip"`: Treat all files as GZIP.
- `"auto"`: Auto-detect based on magic bytes.

Note: GZIP decoding requires `fingerprint` file identity for accurate offset
tracking. A warning is now logged if compression is enabled with a file identity
other than `fingerprint`.

Unit and integration tests have been updated to reflect these changes.

AI tools used: Cursor.
@AndersonQ AndersonQ force-pushed the 47880-gzip-default-on-ga branch from 3075a5c to 4fc0195 Compare December 12, 2025 08:58
@AndersonQ AndersonQ marked this pull request as ready for review December 12, 2025 08:58
@AndersonQ AndersonQ requested a review from a team as a code owner December 12, 2025 08:58
@AndersonQ AndersonQ requested a review from pchila December 12, 2025 08:58
@AndersonQ
Copy link
Member Author

@orestisfl, @cmacknz it's ready for review :)

@AndersonQ AndersonQ removed request for a team, colleenmcginnis and pchila December 12, 2025 09:01
@AndersonQ AndersonQ changed the title filebeat: make GZIP GA and enabled by default filebeat: make GZIP GA and add compression config Dec 12, 2025
@AndersonQ
Copy link
Member Author

compression: missing/null/empty -> gzip off, every file is plain file
compression: gzip: every file is gzip. Error if the file is plain text
compression: auto: decompress GZIP, treat plain file as plain file.

Yes but please use filelog and test to confirm we have interpreted it's behaviour from its documentation correctly.

I confirmed, it behaves like that

@AndersonQ
Copy link
Member Author

@orestisfl,

@cmacknz @AndersonQ does enabling gzip files by default mean that previous users that use wildcards (/path/to/log*) that can now match gzip files will now match /path/to/log.tar.gz as well which means they will start unexpectedly ingesting more files?

Enabling GZIP ingestion has no effect on the paths glob matching. It always matches all files, regardless of their format.
The trick we needed until now was to exclude the compressed files with exclude_files. That's why the suggested value for it is \.gz$ to prevent ingesting GZIP-compressed files.

Filestream will ingest anything, if it isn't a plain file, it'll ingest garbage, a string representation of the data on the file. For example, a gzip file would end up like:

"message":"\u0013\ufffd˙4_\ufffd\ufffd\ufffd\ufffd>\ufffdy\u0008\ufffd*V\ufffdM\ufffd;=\u0002u=$\ufffd\u000e\ufffd\u0001\u0008\u0003\ufffdJ\ufffd\ufffd\ufffd#\ufffdǎ΃\ufffd\ufffd\u001c8\ufffd\ufffd\u0006K\ufffd!n\u000f\ufffd2[\ufffd\ufffd\ufffd4O\ufffd\r\ufffdw$͒d\ufffd\ufffd%\ufffd]{\ufffd\ufffd\ufffd^\u0011\ufffd\u001d\ufffd\ufffd\ufffd\ufffdU\u000b\u0002D\u0018d\ufffd.\ufffdNs\u0013\ufffd1b?\u000eB\u001e\ufffd\ufffd<\ufffd7\ufffdr\ufffdwyS\ufffd\t\u0001\ufffd[\ufffdA\u0004%\ufffdi\ufffd\ufffd&lmRxj\u0010\ufffd@\u001f\u0007\ufffd\u0011,\ufffdy\ufffd\ufffdQ\ufffd\ufffd&\ufffd\ufffdX>\ufffd\ufffd\ufffd <\ufffdǹ\ufffd\ufffd\ufffdw<\u001a\u0000OW$!dp\ufffd\ufffdh\ufffd\ufffd\ufffd8{\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd@]U7Pl\ufffd\ufffdz\u000fS\ufffd,H\u0017\ufffd",

@mergify
Copy link
Contributor

mergify bot commented Dec 12, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b 47880-gzip-default-on-ga upstream/47880-gzip-default-on-ga
git merge upstream/main
git push upstream 47880-gzip-default-on-ga

Copy link
Contributor

@orestisfl orestisfl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some test suggestions

Message string `json:"message"`
}
events := integration.GetEventsFromFileOutput[event](filebeat, 0, true)
require.Equal(t, len(events), 1, "expected one event")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
require.Equal(t, len(events), 1, "expected one event")
require.Len(t, events, 1)

better for debugging in case there is more than one

{
name: "compression empty string with gzip_experimental set",
compression: "compression: \"\"\n gzip_experimental: true",
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it would make this test stronger (albeit less clean) if it could prove that the correct settings (compression: auto) and the exact same steps do read the file contents

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Team:Docs Label for the Observability docs team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Filebeat] Make ingesting GZIP files enabled by default

6 participants