Skip to content

Conversation

@siiddhantt
Copy link
Contributor

@siiddhantt siiddhantt commented Oct 12, 2025

Description

This PR implements configurable file size and row count limits for the Parquet writer, enabling users to control output file sizes for optimal performance with downstream query engines.

Changes:

• Added max_file_size_mb and max_rows_per_file configuration fields (defaults: 512 MB, 1M rows)
• Implemented size estimation and rotation logic in write path
• Added estimateRecordSize() to calculate uncompressed data size
• Added shouldRotateFile() to check dual rotation criteria (size OR row count)
• Extended FileMetadata with row count and size tracking
• Added unit tests for configuration validation

Closes #90

Type of change

  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Code Verification:
• Build successful, unit tests pass
• Uses uncompressed size estimation approach

Integration Testing:
Validated with 55,000 test records:

Row rotation (max_rows_per_file: 1000): 55 files created, 1,000 rows each
Size rotation (max_file_size_mb: 1): 20 files created, ~2,831 rows each (~0.18 MB compressed)
Defaults (512 MB / 1M rows): 1 file, no rotation (data below limits)

Documentation

  • Documentation Link: [link to README, olake.io/docs, or olake-docs]
  • N/A (bug fix, refactor, or test changes only)

@vaibhav-datazip vaibhav-datazip added the hacktoberfest Issues open for Hacktoberfest contributors label Oct 14, 2025

const (
DefaultMaxFileSizeMB = 512
DefaultMaxRowsPerFile = 1000000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to add row based size limit

Comment on lines +121 to +140
func estimateRecordSize(data map[string]any) int64 {
var size int64
for key, value := range data {
size += int64(len(key))
switch v := value.(type) {
case string:
size += int64(len(v))
case []byte:
size += int64(len(v))
case int, int32, int64, uint, uint32, uint64, float32, float64:
size += 8
case bool:
size += 1
case nil:
size += 0
default:
size += 100
}
}
return size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calculating file size based on the variables used might cause problem, as there is no consideration of compression ratio here.

Instead can you research about if there is some function available in parquet go library which might be helpful to us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vaibhav-datazip,

So I initially tried to calculate the actual file size using os.Stat() but realized that it would be 0 until we called Flush() on the writer, which made it unreliable for rotation checks.
And I researched but didn't find any library to get the compressed file size as the buffered data writes only after Flush/Close.

New suggestion : We can periodically check actual file size every 1000 records (or a particular amount) by calling Flush() and then using os.Stat() to get the real compressed file size on disk. This way we measure the actual bytes after compression, not estimates. If the file exceeds the limit, we rotate to a new file.

Tested with Docker integration tests across 1MB, 5MB, and 10MB limits and it gives file sizes very close to those limits.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image Maybe this inbuilt function can help you, It can be used to calculate the file size in parquet writer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out the File().Size() method!
I tested it, but it tracks the offset of bytes written to the writer, not including buffered data. During active writing, this returns a size that's behind the actual data we've ingested.

From what I understand the method seems designed for after the file is closed, not during active writing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hacktoberfest Issues open for Hacktoberfest contributors

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants