-
Notifications
You must be signed in to change notification settings - Fork 290
ThumbcacheParser for Metadata and Thumbnail Extraction #2349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…thods - Added serialVersionUID for serialization compatibility. - Implemented getSupportedTypes to return supported media types. - Implemented parse method to extract embedded documents.
…HTMLContentHandler
…-inc/IPED into feature/thumbcache-parser # Conflicts: # iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/misc/ThumbcacheParser.java
…d extract metadata
- Implemented `parseThumbcacheFile` method to read and parse thumbcache file entries. - Added detailed logging of thumbcache entry attributes. - Utilized `ByteBuffer` for reading binary data with little-endian byte order.
- Implemented `ThumbcacheParser` class to parse thumbcache files. - Added `parseThumbcacheFile` method to read and parse thumbcache file entries. - Utilized `ByteBuffer` for reading binary data with little-endian byte order. - Added detailed logging of thumbcache entry attributes.
- Removed repetitive logging of thumbcache entry attributes. - Adjusted encoding to use `StandardCharsets.UTF_16LE` for identifier strings.
|
Thanks @marcus6n. Could you take a look and help him @hauck-jvsh? |
- Implemented functionality to detect and save images extracted from the .thumbcache file. - Added support for detecting image file extensions (BMP, JPG, PNG) based on the first bytes of the image data. - Images are saved in the 'output' directory with names based on their hash. - Improved parsing of the thumbcache file and added error handling for image saving process.
|
Ok, I will take a look today. |
…zation - Modify parse method to return and extract image data - Add getLastSavedFileName method to retrieve last saved thumbnail - Enhance metadata handling for embedded image visualization - Ensure compatibility with Tika's embedded document extraction
…extraction - Import TikaCoreProperties for RESOURCE_NAME_KEY - Update metadata setting to use correct resource name key - Maintain multiple image extraction logic - Improve embedded document handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces the initial implementation of the ThumbcacheParser to extract metadata and images from thumbcache files.
- Implements the ThumbcacheParser class that processes thumbcache file headers, entries, and embedded images.
- Adds a corresponding unit test (ThumbcacheParserTest) that validates the parser’s output against expected metadata strings.
Reviewed Changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/misc/ThumbcacheParser.java | Implements the parser for thumbcache files including header parsing, entry extraction, and MIME type identification. |
| iped-parsers/iped-parsers-impl/src/test/java/iped/parsers/misc/ThumbcacheParserTest.java | Introduces unit tests that verify the output strings from the ThumbcacheParser match expected values. |
Files not reviewed (1)
- iped-app/resources/config/conf/ParserConfig.xml: Language not supported
Comments suppressed due to low confidence (3)
iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/misc/ThumbcacheParser.java:140
- Verify that the number of bytes read into identifierBytes matches the expected identifierStringSize to avoid incomplete data reading.
stream.read(identifierBytes);
iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/misc/ThumbcacheParser.java:159
- Check that stream.read(imageData) returns the full dataSize bytes to ensure complete image extraction.
stream.read(imageData);
iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/misc/ThumbcacheParser.java:171
- [nitpick] Consider extending the PNG signature check to validate more bytes (e.g., the full 8-byte PNG signature) for more accurate MIME type detection.
if (data.length >= 4) {
| stream.read(fileHeader.array()); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct, read should be called in a reading loop, it might return fewer bytes than requested in situations different than EOF. readNBytes method is safer.
|
Anyone enabled Copilot review here or was it automatic? Anyway, most comments make sense, except the one about the PNG header, 4 bytes seems fine to me. @marcus6n please replace all stream.read(...) calls by the safer readNBytes method and check the returned value. |
| stream.read(fileHeader.array()); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct, read should be called in a reading loop, it might return fewer bytes than requested in situations different than EOF. readNBytes method is safer.
|
@lfcnassif The review of copilot was accidental, because when I went to request the review for your username it was in first place, but you can disregard it. |
|
Ok, I'll make these changes and replace all stream.read(...) calls with the safer readNBytes method, as well as check the returned value. |
No problem, it was valid and pointed things I have to point every time, maybe we can make more use of it. Good to know Microsoft offered some free quota, I think it's the minimum thing they should do after using several github projects code to train Copilot without explicitly asking the authors. Some developers are moving out github because of that. |
I believe that this access to the Copilot revision in pull requests is thanks to the GitHub Student Developer Pack that I'm subscribed to through my college, which offers free tools and resources for students, including Copilot. I don't know if you have access to this, it would be good to check it out as it's a very good tool. |
|
@lfcnassif could you take a look at the requested change and check what’s still missing for the images to be properly displayed within IPED? This is the last pending item before we can wrap things up and publish the feature. Once this is working, we’re good to go. Let me know if there’s anything I can help with to move it forward! |
|
@marcus6n, please take a careful look at my commits. 0274fb0 made extraction of thumbnails working. ab192e8, 486ff59, f754add, 531fb0d are important fixes, now we are extracting thumbs from format versions not handled correctly before and more thumbs from already handled format versions (with the stream.skip(n) fix). |
|
With commit 8dc6cac, the number of recovered thumbnails increased from ~70k to ~338k on the test corpus. |
@lfcnassif Thank you for the contributions and detailed explanations. I’ll carefully review all the changes and improvements, especially the highlighted commits. |
|
I just run a comparison of the number of thumbnails recovered by this implementation and by the carving module when run over thumbcache files (already enabled on forensic and pedo profiles). I disabled the min/max file size restrictions of the carving module to make the comparison fair. This PR recovered ~338k thumbs while the carving module recovered ~347k thumbs from the 2k thumbcache test corpus I collected. So this implementation is missing ~9k thumbnails. I implemented locally an exhaustive search for the CMMM cache entry signature when it is not found at the expected positions (trying to find deleted/unallocated entries), but results were exactly the same of this PR. Not sure what is missing here. I expected this implementation would better recover "fragmented" thumbnails from thumbcache files, but after reading the libyal project documentation, thumbnail data is never fragmented, it is always sequential. So, I don't expect this PR will recover more files than the carving module. However, the key point of this proposal, not implemented yet, is correlating the thumbnails identifier/hash to the Windows.edb database, that would give us the original pictures path from which the thumbnail was generated from, and this is a valuable information from a forensic perspective. @marcus6n could you try to implement this correlation? |
|
@lfcnassif Understood. I’ll try to implement the correlation between the thumbnail identifiers/hashes and the Windows.edb entries as suggested, and will do my best to fine-tune the implementation to match the expected behavior. I’ll keep you posted on any findings or issues during the process. |
This Pull Request introduces the initial implementation of the ThumbcacheParser class, designed to process thumbcache files for metadata extraction and image conversion.