Skip to content

Conversation

@tamird
Copy link
Member

@tamird tamird commented Dec 3, 2025

  • Use .display rather than Debug
  • Use mmap to avoid reading entire input files

This change is Reviewable

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@tamird tamird force-pushed the mmap-simpler branch 2 times, most recently from 30c0e42 to 4140f32 Compare December 3, 2025 18:18
@alessandrod
Copy link
Collaborator

I'm meh on this change, virtually all inputs are rlibs so we're not mmapping anyway, and not clear what the point of mmapping here even is we're just going through a worse abstraction (mmap is top 3 worst unix abstractions)

@tamird
Copy link
Member Author

tamird commented Dec 4, 2025

I'm meh on this change, virtually all inputs are rlibs so we're not mmapping anyway, and not clear what the point of mmapping here even is we're just going through a worse abstraction (mmap is top 3 worst unix abstractions)

I don't understand your objection. This change allows us to stop worrying about how much of the file we need to read to guess its type.

@BretasArthur1
Copy link

ge allows us to stop worrying about how much of the file we need to read to guess its type.

You mean by using mmap we can treat the file as a contiguous byte slice, so we can check the header or signature without manually allocating or reading into a buffer, right?

@tamird
Copy link
Member Author

tamird commented Dec 4, 2025

ge allows us to stop worrying about how much of the file we need to read to guess its type.

You mean by using mmap we can treat the file as a contiguous byte slice, so we can check the header or signature without manually allocating or reading into a buffer, right?

Yes

@BretasArthur1
Copy link

BretasArthur1 commented Dec 4, 2025

And what's the trade off with this approach?

@tamird
Copy link
Member Author

tamird commented Dec 4, 2025

It's a few syscalls, but it's constant. ChatGPT can give a more complete answer.

@alessandrod
Copy link
Collaborator

I'm meh on this change, virtually all inputs are rlibs so we're not mmapping anyway, and not clear what the point of mmapping here even is we're just going through a worse abstraction (mmap is top 3 worst unix abstractions)

I don't understand your objection. This change allows us to stop worrying about how much of the file we need to read to guess its type.

But who is worried and why? We can just read_to_end, we're already reading eveything for rlibs which are 99% of the inputs anyway. Mmap just adds an extra layer of (bad) abstraction and an extra dependency for no actual material gain imo.

@tamird
Copy link
Member Author

tamird commented Dec 4, 2025

I'm meh on this change, virtually all inputs are rlibs so we're not mmapping anyway, and not clear what the point of mmapping here even is we're just going through a worse abstraction (mmap is top 3 worst unix abstractions)

I don't understand your objection. This change allows us to stop worrying about how much of the file we need to read to guess its type.

But who is worried and why? We can just read_to_end, we're already reading eveything for rlibs which are 99% of the inputs anyway. Mmap just adds an extra layer of (bad) abstraction and an extra dependency for no actual material gain imo.

I don't understand this use of the word "already". The current implementation reads twice for rlibs, this gets that down to once. I also don't understand the complaints about mmap, we're pretty insulated from whatever it is you dislike and just get a byte slice.

The real payoff is in the simplicity we get in #323 as a result.

@alessandrod
Copy link
Collaborator

The real payoff is in the simplicity we get in #323 as a result.

Is it tho? Isn't it a lot simpler to do read_to_end and then just pass the slice around? No new dependencies, no mmap? BPF files are tiny, we don't need the demand paging mmap gives you. We can read the whole thing in one line and call it a day.

@tamird
Copy link
Member Author

tamird commented Dec 4, 2025

The real payoff is in the simplicity we get in #323 as a result.

Is it tho? Isn't it a lot simpler to do read_to_end and then just pass the slice around? No new dependencies, no mmap? BPF files are tiny, we don't need the demand paging mmap gives you. We can read the whole thing in one line and call it a day.

Seems strictly worse. What am I missing?

@alessandrod
Copy link
Collaborator

Seems strictly worse. What am I missing?

The goal is to pass a slice around. How is reading into a buffer and passing the slice around worse and more complex than adding a dependency and doing a syscall to ask linux to do magic to let us pass a slice around?

@alessandrod
Copy link
Collaborator

Or I guess, another point of view:

When I see mmap I think that it's being used to demand page a file, not to pass &[u8]. Passing &[u8] is a type system thing. Mmap is a system level thing.

Mmap makes no sense for the kind of inputs we use. The min granularity of the page cache is 4k anyway, and it'll actually prefetch, so given our inputs, we're reading everything from disk anyway, just with more steps (even leaving aside that most inputs are archives so we read them anyway).

So just conceptually, it makes no sense to me to be doing this. If you want to pass a slice, read a buffer and pass a slice?

Avoid short reads and seeking.
@tamird
Copy link
Member Author

tamird commented Dec 4, 2025

I split out the use of mmap into its own commit. In my mind using mmap frees us from having to decide whether input files are big or not - we just ask the kernel to deal with it, which it already has to since all reads go through the page cache.

If you want me to remove the tail commit, I can.

@tamird tamird changed the title Use mmap to avoid reading entire input files Avoid short reads and seeking of inputs Dec 4, 2025
@alessandrod
Copy link
Collaborator

yes pls remove mmap - death to unnecessary dependencies! Otherwise looks good

@tamird
Copy link
Member Author

tamird commented Dec 4, 2025

Done. I'll save the branch for when we discover someone is sending us huge inputs :)

@BretasArthur1
Copy link

So, this one will not affect #323 anymore right? Should we wrap there the last changes and get it ready to merge?

@tamird tamird merged commit ac0f305 into main Dec 5, 2025
78 of 79 checks passed
@tamird tamird deleted the mmap-simpler branch December 5, 2025 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants