ThumbcacheParser for Metadata and Thumbnail Extraction #2349

marcus6n · 2024-10-23T20:55:03Z

This Pull Request introduces the initial implementation of the ThumbcacheParser class, designed to process thumbcache files for metadata extraction and image conversion.

…thods - Added serialVersionUID for serialization compatibility. - Implemented getSupportedTypes to return supported media types. - Implemented parse method to extract embedded documents.

…HTMLContentHandler

…heParser

…data and images

…ata and images

…-inc/IPED into feature/thumbcache-parser # Conflicts: # iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/misc/ThumbcacheParser.java

…rser

…d extract metadata

- Implemented `parseThumbcacheFile` method to read and parse thumbcache file entries. - Added detailed logging of thumbcache entry attributes. - Utilized `ByteBuffer` for reading binary data with little-endian byte order.

- Implemented `ThumbcacheParser` class to parse thumbcache files. - Added `parseThumbcacheFile` method to read and parse thumbcache file entries. - Utilized `ByteBuffer` for reading binary data with little-endian byte order. - Added detailed logging of thumbcache entry attributes.

- Removed repetitive logging of thumbcache entry attributes. - Adjusted encoding to use `StandardCharsets.UTF_16LE` for identifier strings.

lfcnassif · 2024-10-23T22:22:35Z

Thanks @marcus6n. Could you take a look and help him @hauck-jvsh?

- Implemented functionality to detect and save images extracted from the .thumbcache file. - Added support for detecting image file extensions (BMP, JPG, PNG) based on the first bytes of the image data. - Images are saved in the 'output' directory with names based on their hash. - Improved parsing of the thumbcache file and added error handling for image saving process.

hauck-jvsh · 2024-11-12T15:17:12Z

Ok, I will take a look today.

…zation - Modify parse method to return and extract image data - Add getLastSavedFileName method to retrieve last saved thumbnail - Enhance metadata handling for embedded image visualization - Ensure compatibility with Tika's embedded document extraction

…extraction - Import TikaCoreProperties for RESOURCE_NAME_KEY - Update metadata setting to use correct resource name key - Maintain multiple image extraction logic - Improve embedded document handling

Copilot

Pull Request Overview

This PR introduces the initial implementation of the ThumbcacheParser to extract metadata and images from thumbcache files.

Implements the ThumbcacheParser class that processes thumbcache file headers, entries, and embedded images.
Adds a corresponding unit test (ThumbcacheParserTest) that validates the parser’s output against expected metadata strings.

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.

File	Description
iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/misc/ThumbcacheParser.java	Implements the parser for thumbcache files including header parsing, entry extraction, and MIME type identification.
iped-parsers/iped-parsers-impl/src/test/java/iped/parsers/misc/ThumbcacheParserTest.java	Introduces unit tests that verify the output strings from the ThumbcacheParser match expected values.

Files not reviewed (1)

iped-app/resources/config/conf/ParserConfig.xml: Language not supported

Comments suppressed due to low confidence (3)

iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/misc/ThumbcacheParser.java:140

Verify that the number of bytes read into identifierBytes matches the expected identifierStringSize to avoid incomplete data reading.

stream.read(identifierBytes);

iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/misc/ThumbcacheParser.java:159

Check that stream.read(imageData) returns the full dataSize bytes to ensure complete image extraction.

stream.read(imageData);

iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/misc/ThumbcacheParser.java:171

[nitpick] Consider extending the PNG signature check to validate more bytes (e.g., the full 8-byte PNG signature) for more accurate MIME type detection.

if (data.length >= 4) {

lfcnassif · 2025-03-26T21:47:53Z

iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/misc/ThumbcacheParser.java

+        stream.read(fileHeader.array());
+


This is not correct, read should be called in a reading loop, it might return fewer bytes than requested in situations different than EOF. readNBytes method is safer.

lfcnassif · 2025-03-26T21:35:52Z

Anyone enabled Copilot review here or was it automatic? Anyway, most comments make sense, except the one about the PNG header, 4 bytes seems fine to me. @marcus6n please replace all stream.read(...) calls by the safer readNBytes method and check the returned value.

lfcnassif · 2025-03-26T21:47:53Z

iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/misc/ThumbcacheParser.java

+        stream.read(fileHeader.array());
+


This is not correct, read should be called in a reading loop, it might return fewer bytes than requested in situations different than EOF. readNBytes method is safer.

marcus6n · 2025-03-26T23:26:49Z

@lfcnassif The review of copilot was accidental, because when I went to request the review for your username it was in first place, but you can disregard it.

marcus6n · 2025-03-26T23:29:10Z

Ok, I'll make these changes and replace all stream.read(...) calls with the safer readNBytes method, as well as check the returned value.

lfcnassif · 2025-03-27T00:34:46Z

@lfcnassif The review of copilot was accidental, because when I went to request the review for your username it was in first place, but you can disregard it.

No problem, it was valid and pointed things I have to point every time, maybe we can make more use of it. Good to know Microsoft offered some free quota, I think it's the minimum thing they should do after using several github projects code to train Copilot without explicitly asking the authors. Some developers are moving out github because of that.

marcus6n · 2025-03-27T02:09:27Z

@lfcnassif The review of copilot was accidental, because when I went to request the review for your username it was in first place, but you can disregard it.

No problem, it was valid and pointed things I have to point every time, maybe we can make more use of it. Good to know Microsoft offered some free quota, I think it's the minimum thing they should do after using several github projects code to train Copilot without explicitly asking the authors. Some developers are moving out github because of that.

I believe that this access to the Copilot revision in pull requests is thanks to the GitHub Student Developer Pack that I'm subscribed to through my college, which offers free tools and resources for students, including Copilot. I don't know if you have access to this, it would be good to check it out as it's a very good tool.

…eturned value

marcus6n · 2025-04-07T18:45:50Z

@lfcnassif could you take a look at the requested change and check what’s still missing for the images to be properly displayed within IPED? This is the last pending item before we can wrap things up and publish the feature. Once this is working, we’re good to go. Let me know if there’s anything I can help with to move it forward!

lfcnassif · 2025-04-09T01:36:11Z

@marcus6n, please take a careful look at my commits. 0274fb0 made extraction of thumbnails working. ab192e8, 486ff59, f754add, 531fb0d are important fixes, now we are extracting thumbs from format versions not handled correctly before and more thumbs from already handled format versions (with the stream.skip(n) fix).

lfcnassif · 2025-04-09T20:20:08Z

With commit 8dc6cac, the number of recovered thumbnails increased from ~70k to ~338k on the test corpus.

marcus6n · 2025-04-10T18:05:51Z

@marcus6n, please take a careful look at my commits. 0274fb0 made extraction of thumbnails working. ab192e8, 486ff59, f754add, 531fb0d are important fixes, now we are extracting thumbs from format versions not handled correctly before and more thumbs from already handled format versions (with the stream.skip(n) fix).

@lfcnassif Thank you for the contributions and detailed explanations. I’ll carefully review all the changes and improvements, especially the highlighted commits.

lfcnassif · 2025-04-10T21:55:34Z

I just run a comparison of the number of thumbnails recovered by this implementation and by the carving module when run over thumbcache files (already enabled on forensic and pedo profiles). I disabled the min/max file size restrictions of the carving module to make the comparison fair. This PR recovered ~338k thumbs while the carving module recovered ~347k thumbs from the 2k thumbcache test corpus I collected. So this implementation is missing ~9k thumbnails. I implemented locally an exhaustive search for the CMMM cache entry signature when it is not found at the expected positions (trying to find deleted/unallocated entries), but results were exactly the same of this PR. Not sure what is missing here.

I expected this implementation would better recover "fragmented" thumbnails from thumbcache files, but after reading the libyal project documentation, thumbnail data is never fragmented, it is always sequential. So, I don't expect this PR will recover more files than the carving module.

However, the key point of this proposal, not implemented yet, is correlating the thumbnails identifier/hash to the Windows.edb database, that would give us the original pictures path from which the thumbnail was generated from, and this is a valuable information from a forensic perspective. @marcus6n could you try to implement this correlation?

marcus6n · 2025-04-11T01:23:40Z

@lfcnassif Understood. I’ll try to implement the correlation between the thumbnail identifiers/hashes and the Windows.edb entries as suggested, and will do my best to fine-tune the implementation to match the expected behavior. I’ll keep you posted on any findings or issues during the process.

marcus6n added 18 commits October 16, 2024 15:32

Add ThumbcacheParser class

7cebf82

Refactor ThumbcacheParser to implement getSupportedTypes and parse me…

5493e64

…thods - Added serialVersionUID for serialization compatibility. - Implemented getSupportedTypes to return supported media types. - Implemented parse method to extract embedded documents.

Update ThumbcacheParser to initialize EmbeddedDocumentExtractor and X…

f442a98

…HTMLContentHandler

Initialize TemporaryResources, TikaInputStream, and POIFSFileSystem.

5486bfd

Reserve space for recursive method

4f72dbb

Change serialVersionUID visibility from public to private in Thumbcac…

29d7677

…heParser

Implemented recurseDir method to process directories and extract meta…

7832863

…data and images

Added ThumbcacheParser entry to ParserConfig.xml

9dc7baa

Fixed metadata type in ThumbcacheParser.

f8c575e

Implemented ThumbcacheParser to process directories and extract metad…

cc3fdf5

…ata and images

Added image conversion to Base64 in ThumbcacheParser

d581755

Add parsing of specific fields in ThumbcacheParser

cafee08

Merge branch 'feature/thumbcache-parser' of https://github.com/sepinf…

a048449

…-inc/IPED into feature/thumbcache-parser # Conflicts: # iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/misc/ThumbcacheParser.java

Implement directory recursion and metadata extraction in ThumbcachePa…

d97fe51

…rser

Refactor processDocumentNode method to handle DocumentNode entries an…

e4dea6c

…d extract metadata

feat: Add thumbcache file parsing logic

11252d6

- Implemented `parseThumbcacheFile` method to read and parse thumbcache file entries. - Added detailed logging of thumbcache entry attributes. - Utilized `ByteBuffer` for reading binary data with little-endian byte order.

refactor: Remove repetitive logging and adjust encoding

e894649

- Removed repetitive logging of thumbcache entry attributes. - Adjusted encoding to use `StandardCharsets.UTF_16LE` for identifier strings.

lfcnassif marked this pull request as draft October 23, 2024 22:21

marcus6n added 2 commits November 11, 2024 16:15

Create an 'images' folder and save processed images inside it

32dfbdc

marcus6n requested a review from lfcnassif November 11, 2024 20:43

marcus6n added 2 commits November 21, 2024 15:36

feat(ThumbcacheParser): Fix resource name metadata and improve image …

6f4465e

…extraction - Import TikaCoreProperties for RESOURCE_NAME_KEY - Update metadata setting to use correct resource name key - Maintain multiple image extraction logic - Improve embedded document handling

marcus6n removed the request for review from lfcnassif February 19, 2025 19:57

marcus6n changed the title ~~Implementation of ThumbcacheParser for Metadata and Thumbnail Extraction~~ ThumbcacheParser for Metadata and Thumbnail Extraction Feb 19, 2025

marcus6n added 2 commits March 11, 2025 17:44

Add unit test for ThumbcacheParser

b78362d

Add test file test_Thumbcache.db

252a54b

Copilot AI reviewed Mar 26, 2025

View reviewed changes

lfcnassif reviewed Mar 26, 2025

View reviewed changes

sepinf-inc deleted a comment from Copilot AI Mar 26, 2025

Replace stream.read(...) calls with readNBytes method and check the r…

b456118

…eturned value

lfcnassif added 2 commits April 8, 2025 19:51

Merge branch 'master' into pr-2349

cc1d5f5

'#968: parsers should be stateless to be thread safe by documentation

ab192e8

lfcnassif linked an issue Apr 8, 2025 that may be closed by this pull request

Parser for thumbcache files #968

Open

lfcnassif added 8 commits April 8, 2025 20:32

'#968: add signature detection for thumbcache files

2b93946

'#968: add a category for thumbcache files and expand them by default

0274fb0

'#968: remove unneeded image mime/ext detection, it will be done later

130dfe4

'#968: avoid creation of unneeded temp file

92639e1

'#968: don't close the stream, it should be closed by the client code

486ff59

'#968: fix logic to work correctly for format versions 20 & 21

f754add

'#968: use safer readNBytes instead of skip, minor reorganization

531fb0d

'#968: less memory usage: extract thumbs early, dont keep them in a List

7ebdbe6

lfcnassif added 2 commits April 9, 2025 14:21

'#968: propagate all exceptions so the file will be tagged as corrupted.

f4949a3

'#968: skip the remainder of the cache entry to find the next one

8dc6cac

lfcnassif added 2 commits April 9, 2025 17:22

'#968: fix test compilation issue after parser api change

0f6d5fc

'#968: fix failing test because of WriteLimitException being thrown

06c1721

ThumbcacheParser for Metadata and Thumbnail Extraction #2349

Are you sure you want to change the base?

ThumbcacheParser for Metadata and Thumbnail Extraction #2349

Uh oh!

Conversation

marcus6n commented Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lfcnassif commented Oct 23, 2024

Uh oh!

hauck-jvsh commented Nov 12, 2024

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

lfcnassif Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lfcnassif commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lfcnassif Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marcus6n commented Mar 26, 2025

Uh oh!

marcus6n commented Mar 26, 2025

Uh oh!

lfcnassif commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcus6n commented Mar 27, 2025

Uh oh!

marcus6n commented Apr 7, 2025

Uh oh!

lfcnassif commented Apr 9, 2025

Uh oh!

lfcnassif commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcus6n commented Apr 10, 2025

Uh oh!

lfcnassif commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcus6n commented Apr 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

marcus6n commented Oct 23, 2024 •

edited

Loading

lfcnassif Mar 26, 2025 •

edited

Loading

lfcnassif commented Mar 26, 2025 •

edited

Loading

lfcnassif Mar 26, 2025 •

edited

Loading

lfcnassif commented Mar 27, 2025 •

edited

Loading

lfcnassif commented Apr 9, 2025 •

edited

Loading

lfcnassif commented Apr 10, 2025 •

edited

Loading