feat: add configurable file size and row limits for Parquet writer #586

siiddhantt · 2025-10-12T10:03:25Z

Description

This PR implements configurable file size and row count limits for the Parquet writer, enabling users to control output file sizes for optimal performance with downstream query engines.

Changes:

• Added max_file_size_mb and max_rows_per_file configuration fields (defaults: 512 MB, 1M rows)
• Implemented size estimation and rotation logic in write path
• Added estimateRecordSize() to calculate uncompressed data size
• Added shouldRotateFile() to check dual rotation criteria (size OR row count)
• Extended FileMetadata with row count and size tracking
• Added unit tests for configuration validation

Closes #90

Type of change

New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Code Verification:
• Build successful, unit tests pass
• Uses uncompressed size estimation approach

Integration Testing:
Validated with 55,000 test records:

• Row rotation (max_rows_per_file: 1000): 55 files created, 1,000 rows each
• Size rotation (max_file_size_mb: 1): 20 files created, ~2,831 rows each (~0.18 MB compressed)
• Defaults (512 MB / 1M rows): 1 file, no rotation (data below limits)

Documentation

Documentation Link: [link to README, olake.io/docs, or olake-docs]
N/A (bug fix, refactor, or test changes only)

vaibhav-datazip · 2025-10-15T06:34:58Z

destination/parquet/config.go


+const (
+	DefaultMaxFileSizeMB  = 512
+	DefaultMaxRowsPerFile = 1000000


we don't need to add row based size limit

vaibhav-datazip · 2025-10-15T06:37:12Z

destination/parquet/parquet.go

+func estimateRecordSize(data map[string]any) int64 {
+	var size int64
+	for key, value := range data {
+		size += int64(len(key))
+		switch v := value.(type) {
+		case string:
+			size += int64(len(v))
+		case []byte:
+			size += int64(len(v))
+		case int, int32, int64, uint, uint32, uint64, float32, float64:
+			size += 8
+		case bool:
+			size += 1
+		case nil:
+			size += 0
+		default:
+			size += 100
+		}
+	}
+	return size


Calculating file size based on the variables used might cause problem, as there is no consideration of compression ratio here.

Instead can you research about if there is some function available in parquet go library which might be helpful to us.

Hi @vaibhav-datazip,

So I initially tried to calculate the actual file size using os.Stat() but realized that it would be 0 until we called Flush() on the writer, which made it unreliable for rotation checks.
And I researched but didn't find any library to get the compressed file size as the buffered data writes only after Flush/Close.

New suggestion : We can periodically check actual file size every 1000 records (or a particular amount) by calling Flush() and then using os.Stat() to get the real compressed file size on disk. This way we measure the actual bytes after compression, not estimates. If the file exceeds the limit, we rotate to a new file.

Tested with Docker integration tests across 1MB, 5MB, and 10MB limits and it gives file sizes very close to those limits.

Maybe this inbuilt function can help you, It can be used to calculate the file size in parquet writer.

Thanks for pointing out the File().Size() method!
I tested it, but it tracks the offset of bytes written to the writer, not including buffered data. During active writing, this returns a size that's behind the actual data we've ingested.

From what I understand the method seems designed for after the file is closed, not during active writing.

siiddhantt and others added 3 commits October 12, 2025 15:27

feat: add configurable file size and row limits for Parquet writer

9364100

Merge branch 'staging' into feat/parquet-file-size-control

d018e8c

Merge branch 'staging' into feat/parquet-file-size-control

0ea1b0e

vaibhav-datazip added the hacktoberfest Issues open for Hacktoberfest contributors label Oct 14, 2025

vaibhav-datazip reviewed Oct 15, 2025

View reviewed changes

Merge branch 'staging' into feat/parquet-file-size-control

5da95d8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add configurable file size and row limits for Parquet writer #586

feat: add configurable file size and row limits for Parquet writer #586

Uh oh!

siiddhantt commented Oct 12, 2025 •

edited

Loading

Uh oh!

vaibhav-datazip Oct 15, 2025

Uh oh!

vaibhav-datazip Oct 15, 2025

Uh oh!

siiddhantt Oct 17, 2025

Uh oh!

vaibhav-datazip Oct 19, 2025

Uh oh!

siiddhantt Oct 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add configurable file size and row limits for Parquet writer #586

Are you sure you want to change the base?

feat: add configurable file size and row limits for Parquet writer #586

Uh oh!

Conversation

siiddhantt commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How Has This Been Tested?

Documentation

Uh oh!

vaibhav-datazip Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

vaibhav-datazip Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

siiddhantt Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

vaibhav-datazip Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

siiddhantt Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

siiddhantt commented Oct 12, 2025 •

edited

Loading