-
Notifications
You must be signed in to change notification settings - Fork 138
feat: add configurable file size and row limits for Parquet writer #586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: staging
Are you sure you want to change the base?
feat: add configurable file size and row limits for Parquet writer #586
Conversation
|
|
||
| const ( | ||
| DefaultMaxFileSizeMB = 512 | ||
| DefaultMaxRowsPerFile = 1000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to add row based size limit
| func estimateRecordSize(data map[string]any) int64 { | ||
| var size int64 | ||
| for key, value := range data { | ||
| size += int64(len(key)) | ||
| switch v := value.(type) { | ||
| case string: | ||
| size += int64(len(v)) | ||
| case []byte: | ||
| size += int64(len(v)) | ||
| case int, int32, int64, uint, uint32, uint64, float32, float64: | ||
| size += 8 | ||
| case bool: | ||
| size += 1 | ||
| case nil: | ||
| size += 0 | ||
| default: | ||
| size += 100 | ||
| } | ||
| } | ||
| return size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calculating file size based on the variables used might cause problem, as there is no consideration of compression ratio here.
Instead can you research about if there is some function available in parquet go library which might be helpful to us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vaibhav-datazip,
So I initially tried to calculate the actual file size using os.Stat() but realized that it would be 0 until we called Flush() on the writer, which made it unreliable for rotation checks.
And I researched but didn't find any library to get the compressed file size as the buffered data writes only after Flush/Close.
New suggestion : We can periodically check actual file size every 1000 records (or a particular amount) by calling Flush() and then using os.Stat() to get the real compressed file size on disk. This way we measure the actual bytes after compression, not estimates. If the file exceeds the limit, we rotate to a new file.
Tested with Docker integration tests across 1MB, 5MB, and 10MB limits and it gives file sizes very close to those limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out the File().Size() method!
I tested it, but it tracks the offset of bytes written to the writer, not including buffered data. During active writing, this returns a size that's behind the actual data we've ingested.
From what I understand the method seems designed for after the file is closed, not during active writing.

Description
This PR implements configurable file size and row count limits for the Parquet writer, enabling users to control output file sizes for optimal performance with downstream query engines.
Changes:
• Added
max_file_size_mbandmax_rows_per_fileconfiguration fields (defaults: 512 MB, 1M rows)• Implemented size estimation and rotation logic in write path
• Added
estimateRecordSize()to calculate uncompressed data size• Added
shouldRotateFile()to check dual rotation criteria (size OR row count)• Extended
FileMetadatawith row count and size tracking• Added unit tests for configuration validation
Closes #90
Type of change
How Has This Been Tested?
Code Verification:
• Build successful, unit tests pass
• Uses uncompressed size estimation approach
Integration Testing:
Validated with 55,000 test records:
• Row rotation (
max_rows_per_file: 1000): 55 files created, 1,000 rows each• Size rotation (
max_file_size_mb: 1): 20 files created, ~2,831 rows each (~0.18 MB compressed)• Defaults (
512 MB / 1M rows): 1 file, no rotation (data below limits)Documentation