-
Notifications
You must be signed in to change notification settings - Fork 56
Avoid short reads and seeking of inputs #329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
30c0e42 to
4140f32
Compare
|
I'm meh on this change, virtually all inputs are rlibs so we're not mmapping anyway, and not clear what the point of mmapping here even is we're just going through a worse abstraction (mmap is top 3 worst unix abstractions) |
I don't understand your objection. This change allows us to stop worrying about how much of the file we need to read to guess its type. |
You mean by using mmap we can treat the file as a contiguous byte slice, so we can check the header or signature without manually allocating or reading into a buffer, right? |
Yes |
|
And what's the trade off with this approach? |
|
It's a few syscalls, but it's constant. ChatGPT can give a more complete answer. |
But who is worried and why? We can just read_to_end, we're already reading eveything for rlibs which are 99% of the inputs anyway. Mmap just adds an extra layer of (bad) abstraction and an extra dependency for no actual material gain imo. |
I don't understand this use of the word "already". The current implementation reads twice for rlibs, this gets that down to once. I also don't understand the complaints about mmap, we're pretty insulated from whatever it is you dislike and just get a byte slice. The real payoff is in the simplicity we get in #323 as a result. |
Is it tho? Isn't it a lot simpler to do read_to_end and then just pass the slice around? No new dependencies, no mmap? BPF files are tiny, we don't need the demand paging mmap gives you. We can read the whole thing in one line and call it a day. |
Seems strictly worse. What am I missing? |
The goal is to pass a slice around. How is reading into a buffer and passing the slice around worse and more complex than adding a dependency and doing a syscall to ask linux to do magic to let us pass a slice around? |
|
Or I guess, another point of view: When I see mmap I think that it's being used to demand page a file, not to pass Mmap makes no sense for the kind of inputs we use. The min granularity of the page cache is 4k anyway, and it'll actually prefetch, so given our inputs, we're reading everything from disk anyway, just with more steps (even leaving aside that most inputs are archives so we read them anyway). So just conceptually, it makes no sense to me to be doing this. If you want to pass a slice, read a buffer and pass a slice? |
Avoid short reads and seeking.
|
I split out the use of mmap into its own commit. In my mind using mmap frees us from having to decide whether input files are big or not - we just ask the kernel to deal with it, which it already has to since all reads go through the page cache. If you want me to remove the tail commit, I can. |
|
yes pls remove mmap - death to unnecessary dependencies! Otherwise looks good |
|
Done. I'll save the branch for when we discover someone is sending us huge inputs :) |
|
So, this one will not affect #323 anymore right? Should we wrap there the last changes and get it ready to merge? |
This change is