Skip to content

Conversation

@marcus6n
Copy link
Contributor

@marcus6n marcus6n commented Oct 23, 2024

This Pull Request introduces the initial implementation of the ThumbcacheParser class, designed to process thumbcache files for metadata extraction and image conversion.

…thods

- Added serialVersionUID for serialization compatibility.
- Implemented getSupportedTypes to return supported media types.
- Implemented parse method to extract embedded documents.
…-inc/IPED into feature/thumbcache-parser

# Conflicts:
#	iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/misc/ThumbcacheParser.java
- Implemented `parseThumbcacheFile` method to read and parse thumbcache file entries.
- Added detailed logging of thumbcache entry attributes.
- Utilized `ByteBuffer` for reading binary data with little-endian byte order.
- Implemented `ThumbcacheParser` class to parse thumbcache files.
- Added `parseThumbcacheFile` method to read and parse thumbcache file entries.
- Utilized `ByteBuffer` for reading binary data with little-endian byte order.
- Added detailed logging of thumbcache entry attributes.
- Removed repetitive logging of thumbcache entry attributes.
- Adjusted encoding to use `StandardCharsets.UTF_16LE` for identifier strings.
@lfcnassif lfcnassif marked this pull request as draft October 23, 2024 22:21
@lfcnassif
Copy link
Member

Thanks @marcus6n. Could you take a look and help him @hauck-jvsh?

- Implemented functionality to detect and save images extracted from the .thumbcache file.
- Added support for detecting image file extensions (BMP, JPG, PNG) based on the first bytes of the image data.
- Images are saved in the 'output' directory with names based on their hash.
- Improved parsing of the thumbcache file and added error handling for image saving process.
@marcus6n marcus6n requested a review from lfcnassif November 11, 2024 20:43
@hauck-jvsh
Copy link
Member

Ok, I will take a look today.

…zation

- Modify parse method to return and extract image data
- Add getLastSavedFileName method to retrieve last saved thumbnail
- Enhance metadata handling for embedded image visualization
- Ensure compatibility with Tika's embedded document extraction
…extraction

- Import TikaCoreProperties for RESOURCE_NAME_KEY
- Update metadata setting to use correct resource name key
- Maintain multiple image extraction logic
- Improve embedded document handling
@marcus6n marcus6n removed the request for review from lfcnassif February 19, 2025 19:57
@marcus6n marcus6n changed the title Implementation of ThumbcacheParser for Metadata and Thumbnail Extraction ThumbcacheParser for Metadata and Thumbnail Extraction Feb 19, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces the initial implementation of the ThumbcacheParser to extract metadata and images from thumbcache files.

  • Implements the ThumbcacheParser class that processes thumbcache file headers, entries, and embedded images.
  • Adds a corresponding unit test (ThumbcacheParserTest) that validates the parser’s output against expected metadata strings.

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.

File Description
iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/misc/ThumbcacheParser.java Implements the parser for thumbcache files including header parsing, entry extraction, and MIME type identification.
iped-parsers/iped-parsers-impl/src/test/java/iped/parsers/misc/ThumbcacheParserTest.java Introduces unit tests that verify the output strings from the ThumbcacheParser match expected values.
Files not reviewed (1)
  • iped-app/resources/config/conf/ParserConfig.xml: Language not supported
Comments suppressed due to low confidence (3)

iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/misc/ThumbcacheParser.java:140

  • Verify that the number of bytes read into identifierBytes matches the expected identifierStringSize to avoid incomplete data reading.
stream.read(identifierBytes);

iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/misc/ThumbcacheParser.java:159

  • Check that stream.read(imageData) returns the full dataSize bytes to ensure complete image extraction.
stream.read(imageData);

iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/misc/ThumbcacheParser.java:171

  • [nitpick] Consider extending the PNG signature check to validate more bytes (e.g., the full 8-byte PNG signature) for more accurate MIME type detection.
if (data.length >= 4) {

Comment on lines 77 to 78
stream.read(fileHeader.array());

Copy link
Member

@lfcnassif lfcnassif Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct, read should be called in a reading loop, it might return fewer bytes than requested in situations different than EOF. readNBytes method is safer.

@lfcnassif
Copy link
Member

lfcnassif commented Mar 26, 2025

Anyone enabled Copilot review here or was it automatic? Anyway, most comments make sense, except the one about the PNG header, 4 bytes seems fine to me. @marcus6n please replace all stream.read(...) calls by the safer readNBytes method and check the returned value.

Comment on lines 77 to 78
stream.read(fileHeader.array());

Copy link
Member

@lfcnassif lfcnassif Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct, read should be called in a reading loop, it might return fewer bytes than requested in situations different than EOF. readNBytes method is safer.

@marcus6n
Copy link
Contributor Author

@lfcnassif The review of copilot was accidental, because when I went to request the review for your username it was in first place, but you can disregard it.

@marcus6n
Copy link
Contributor Author

Ok, I'll make these changes and replace all stream.read(...) calls with the safer readNBytes method, as well as check the returned value.

@sepinf-inc sepinf-inc deleted a comment from Copilot AI Mar 26, 2025
@lfcnassif
Copy link
Member

lfcnassif commented Mar 27, 2025

@lfcnassif The review of copilot was accidental, because when I went to request the review for your username it was in first place, but you can disregard it.

No problem, it was valid and pointed things I have to point every time, maybe we can make more use of it. Good to know Microsoft offered some free quota, I think it's the minimum thing they should do after using several github projects code to train Copilot without explicitly asking the authors. Some developers are moving out github because of that.

@marcus6n
Copy link
Contributor Author

@lfcnassif The review of copilot was accidental, because when I went to request the review for your username it was in first place, but you can disregard it.

No problem, it was valid and pointed things I have to point every time, maybe we can make more use of it. Good to know Microsoft offered some free quota, I think it's the minimum thing they should do after using several github projects code to train Copilot without explicitly asking the authors. Some developers are moving out github because of that.

I believe that this access to the Copilot revision in pull requests is thanks to the GitHub Student Developer Pack that I'm subscribed to through my college, which offers free tools and resources for students, including Copilot. I don't know if you have access to this, it would be good to check it out as it's a very good tool.

@marcus6n
Copy link
Contributor Author

marcus6n commented Apr 7, 2025

@lfcnassif could you take a look at the requested change and check what’s still missing for the images to be properly displayed within IPED? This is the last pending item before we can wrap things up and publish the feature. Once this is working, we’re good to go. Let me know if there’s anything I can help with to move it forward!

@lfcnassif lfcnassif linked an issue Apr 8, 2025 that may be closed by this pull request
@lfcnassif
Copy link
Member

@marcus6n, please take a careful look at my commits. 0274fb0 made extraction of thumbnails working. ab192e8, 486ff59, f754add, 531fb0d are important fixes, now we are extracting thumbs from format versions not handled correctly before and more thumbs from already handled format versions (with the stream.skip(n) fix).

@lfcnassif
Copy link
Member

lfcnassif commented Apr 9, 2025

With commit 8dc6cac, the number of recovered thumbnails increased from ~70k to ~338k on the test corpus.

@marcus6n
Copy link
Contributor Author

@marcus6n, please take a careful look at my commits. 0274fb0 made extraction of thumbnails working. ab192e8, 486ff59, f754add, 531fb0d are important fixes, now we are extracting thumbs from format versions not handled correctly before and more thumbs from already handled format versions (with the stream.skip(n) fix).

@lfcnassif Thank you for the contributions and detailed explanations. I’ll carefully review all the changes and improvements, especially the highlighted commits.

@lfcnassif
Copy link
Member

lfcnassif commented Apr 10, 2025

I just run a comparison of the number of thumbnails recovered by this implementation and by the carving module when run over thumbcache files (already enabled on forensic and pedo profiles). I disabled the min/max file size restrictions of the carving module to make the comparison fair. This PR recovered ~338k thumbs while the carving module recovered ~347k thumbs from the 2k thumbcache test corpus I collected. So this implementation is missing ~9k thumbnails. I implemented locally an exhaustive search for the CMMM cache entry signature when it is not found at the expected positions (trying to find deleted/unallocated entries), but results were exactly the same of this PR. Not sure what is missing here.

I expected this implementation would better recover "fragmented" thumbnails from thumbcache files, but after reading the libyal project documentation, thumbnail data is never fragmented, it is always sequential. So, I don't expect this PR will recover more files than the carving module.

However, the key point of this proposal, not implemented yet, is correlating the thumbnails identifier/hash to the Windows.edb database, that would give us the original pictures path from which the thumbnail was generated from, and this is a valuable information from a forensic perspective. @marcus6n could you try to implement this correlation?

@marcus6n
Copy link
Contributor Author

@lfcnassif Understood. I’ll try to implement the correlation between the thumbnail identifiers/hashes and the Windows.edb entries as suggested, and will do my best to fine-tune the implementation to match the expected behavior. I’ll keep you posted on any findings or issues during the process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parser for thumbcache files

4 participants