-
Notifications
You must be signed in to change notification settings - Fork 122
Read debug_names section from DWARF 5. #807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Adds sample implementation to dwarfdump that prints the same information as llvm-dwarfdump for MachO binaries.
83cda2a
to
ca83ea3
Compare
src/read/names.rs
Outdated
#[derive(Debug)] | ||
pub struct DebugNamesUnit<R: Reader> { | ||
header: DebugNamesHeader<R>, | ||
content: R, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at much of the PR yet, but for some early feedback, this probably should have an R
for each part of the unit. Then things like name_table_reader
can simply return that R
instead of needing to do size calculations. Once that is done, there's probably no need to store all the sizes from the header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I didn't understand this pattern on my first reading of the code.
If I understand correctly I should store pre-sliced Reader instances for each section in DebugNamesUnit<R>
, similar to the pattern in CFI (CommonInformationEntry, FrameDescriptionEntry). For example:
pub struct DebugNamesUnit<R: Reader> {
// Metadata
version: u16,
format: Format,
augmentation_string: R,
// Counts for iteration
comp_unit_count: u32,
local_type_unit_count: u32,
foreign_type_unit_count: u32,
bucket_count: u32,
name_count: u32,
abbrev_table_size: u32,
// Pre-sliced readers for each section
comp_unit_list: R,
local_type_unit_list: R,
foreign_type_unit_list: R,
hash_table_data: R,
name_table_data: R,
abbreviation_table: R,
entry_pool: R,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically yes, but you should be able to avoid a lot of the counts. Probably the main one you need to keep is bucket_count
, since that is needed for converting a hash to a bucket index. For the others, we only need to look entries up by index.
Related to this, I see a lot of functions for reading these tables and storing them as Vec
s. We should avoid that for simple arrays. The abbreviation table will need to be parsed into another data structure, but the rest should be kept as R
instead of parsing them into a Vec
.
I haven't looked at this code closely, so I may be missing something, but the goal is to be zero copy where possible. These tables can potentially be quite large, and the user will only need to access a few entries at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parts of this is clearly AI generated (though I don't necessarily believe all of it is). And a lot of it does actually looks reasonable to me. I don't know what policy this project have on using AI tools.
Still, this PR require greater scrutiny than I could give it in a half hour review.
object::Architecture::I386 => "Mach-O i386", | ||
_ => "Mach-O", | ||
}, | ||
object::BinaryFormat::Pe => "PE", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is PE with DWARF even a thing? I thought Microsoft used their own debug info format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the MinGW toolchain uses this. GDB doesn't support pdb files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some support for adding DWARF sections to PE/COFF executables in clang-cl and lldb has basic support for DWARF on Windows. It's an unusual combination for sure. I'll remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine to have PE. This is better than printing "Unknown" for PE.
This file format line is unrelated to .debug_names
. If we're going to add this, it should be at the start of the output for all sections (in main
or dump_file
), not here.
/// Contains the absolute offset of the parent entry in the entry pool | ||
/// if this entry has a parent relationship. None indicates either | ||
/// no parent (top-level entry) or that parent information is not indexed. | ||
pub parent_info: Option<u64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a list of what attributes are allowed / required / expected for entries in the entry pool.
From 6.1.1.4.8:
Each index entry in the series begins with an abbreviation code, and is followed
by the attributes described by the abbreviation declaration for that code. The last
index entry in the series is followed by a terminating entry whose abbreviation
code is 0.
So I'm a bit confused by the ones you have added here. Would it be better to expose a more generic way to iterate over the attributes, just like is done for .debug_info
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we should start with a generic way to obtain the attributes (and that is what dwarfdump
should use), and add helpers on top of that to find particular attributes.
No formal policy yet. I haven't noticed any prior AI contributions. I'd expect the following at least:
|
Not done in this case, obviously.
I think for the most part this was done, but perhaps not for the code documentation comments. (Or perhaps AI was only used for the comments?) The issues in the code itself could plausibly have been human errors that we all make. The obvious AI generated parts were all in the documentation, which made me scrutinise it closer and start cross checking more references and find more issues.
I don't think this PR necessarily feels low quality to me. It just feels like the AI wasn't fully checked. I would say it needs a bit of work. Plus checking by someone more familiar with gimli's internals than me. For example: There were some confusing constructs in the unit tests, but I have not compared to other unit tests in gimli to see how they work. I think the right approach at the moment is a nuanced approach. |
Just to be clear (since my previous comment wasn't): I wasn't making a judgement on this PR, just saying what the policy would be. I still haven't done a review. |
Apologies for the confusion. I used AI tooling to extract/summarise parts of the DWARF spec and add them as code comments. They're too wordy (or inaccurate) for the style of this library, however I do find the links back to specific sections in the DWARF spec useful. The Rust errors are all my own, unfortunately. Thank you for the review comments, I'll continue working on this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've review the dwarfdump changes only.
writeln!( | ||
w, | ||
"Name Index @ 0x{:x} {{", | ||
unit_index * header.length().into_u64() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a header.offset()
method instead.
writeln!( | ||
w, | ||
" Format: {}", | ||
match header.format() { | ||
gimli::Format::Dwarf32 => "DWARF32", | ||
gimli::Format::Dwarf64 => "DWARF64", | ||
} | ||
)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
writeln!( | |
w, | |
" Format: {}", | |
match header.format() { | |
gimli::Format::Dwarf32 => "DWARF32", | |
gimli::Format::Dwarf64 => "DWARF64", | |
} | |
)?; | |
writeln!(w, " Format: {:?}", header.format())?; |
This is what we already use elsewhere. Matching llvm-dwarfdump exactly may be useful, but I'd prefer that to be done separately.
if header.augmentation_string_size() > 0 { | ||
let mut aug_reader = header.augmentation_string().clone(); | ||
let mut aug_string = Vec::new(); | ||
while !aug_reader.is_empty() { | ||
match aug_reader.read_u8() { | ||
Ok(b) => aug_string.push(b), | ||
Err(_) => break, | ||
} | ||
} | ||
if let Ok(aug_str) = String::from_utf8(aug_string) { | ||
writeln!(w, " Augmentation: '{}'", aug_str)?; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if header.augmentation_string_size() > 0 { | |
let mut aug_reader = header.augmentation_string().clone(); | |
let mut aug_string = Vec::new(); | |
while !aug_reader.is_empty() { | |
match aug_reader.read_u8() { | |
Ok(b) => aug_string.push(b), | |
Err(_) => break, | |
} | |
} | |
if let Ok(aug_str) = String::from_utf8(aug_string) { | |
writeln!(w, " Augmentation: '{}'", aug_str)?; | |
} | |
} | |
if let Some(s) = header.augmentation_string() { | |
writeln!(w, " Augmentation: '{}'", s.to_string_lossy)?; | |
} |
Change augmentation_string
to return an Option
and delete augmentation_string_size
from the API.
|
||
loop { | ||
// Read abbreviation code | ||
let abbrev_code = entry_reader.read_uleb128()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is parsing code. That's gimli
's job, we shouldn't be doing it in dwarfdump
.
HashMap::new(); | ||
|
||
for (name_idx, hash, string_offset, entry_offset) in names { | ||
let string_content = resolve_string_name(dwarf, string_offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let string_content = resolve_string_name(dwarf, string_offset); | |
let Ok(string_content) = dwarf.debug_str.get_str(gimli::DebugStrOffset(string_offset)) else { | |
// TODO print warning | |
continue; | |
}; |
get_str
already handles null termination, and we shouldn't proceed with an invalid string.
|
||
// Group entries by resolved string content to handle multiple entries per name | ||
// Following DWARF 5 spec: "If two different symbol names produce the same hash value, | ||
// that hash value will occur twice in the hashes array" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we need to do this. The hash table is grouped by name, and its entries point back to the name, so we already have this grouping. Why are we building up these maps instead of simply iterating over the buckets as we display?
I assume you're still working on the previous review comments. I resolved the ones that look like they've been addressed. Let me know when you're ready for further review or need guidance. |
Fix for #469. This adds support for reading the debug_names section in DWARF 5 as documented in the DWARF 5 spec "6.1 Accelerated Access".
I also added a sample implementation to dwarfdump that prints the same information as llvm-dwarfdump for MachO binaries. The section structure is the same for ELF binaries
This is my first significant contribution to a Rust library, so the API might not be ideal and I might have un-idomatic Rust code. I would appreciate specific suggestions to make it better.
It needs further testing on Linux ELF and some of the corner cases.