Skip to content

Fix/tessdata prefix path resolution#2251

Closed
DhanushVarma-2 wants to merge 5 commits intoCCExtractor:masterfrom
DhanushVarma-2:fix/tessdata-prefix-path-resolution
Closed

Fix/tessdata prefix path resolution#2251
DhanushVarma-2 wants to merge 5 commits intoCCExtractor:masterfrom
DhanushVarma-2:fix/tessdata-prefix-path-resolution

Conversation

@DhanushVarma-2
Copy link
Copy Markdown
Contributor

@DhanushVarma-2 DhanushVarma-2 commented Apr 2, 2026

In raising this pull request, I confirm the following (please check boxes):

Reason for this PR:

  • This PR adds new functionality.
  • This PR fixes a bug that I have personally experienced or that a real user has reported and for which a sample exists.
  • This PR is porting code from C to Rust.

Sanity check:

  • I have read and understood the contributors guide.
  • I have checked that another pull request for this purpose does not exist.
  • If the PR adds new functionality, I've added it to the changelog. If it's just a bug fix, I have NOT added it to the changelog.
  • I am NOT adding new C code unless it's to fix an existing, reproducible bug.

Repro instructions:

This is essential. We will not merge ANY PR that doesn't come with detailed instructions, including a sample. We don't want
"fixes" for theoretical issues that an AI agent found, without context. If you can't reproduce the bug, don't send a PR.

Creating PRs with AI is very quick, but we still have humans (even if AI assisted) going over each.

Be mindful of reviewers' time.


Root cause: Two bugs in init_ocr() in ocr.c:

The Tesseract 4/5 branch always blindly appended /tessdata to the path returned by probe_tessdata_location(). If TESSDATA_PREFIX was already set to a path ending in tessdata/, this caused a double-append (e.g. /usr/share/tessdata/tessdata).
The legacy Tesseract <4 branch passed tessdata_path raw to TessBaseAPIInit4 without appending tessdata at all — causing Tesseract to look for eng.traineddata directly in e.g. /usr/share/ instead of /usr/share/tessdata/.

Fix: Normalize the path once before both branches — detect whether the returned path already ends with tessdata or tessdata/, and handle Windows backslash separators correctly.
Tested on: macOS (Apple Silicon, Tesseract 5.5.1 via Homebrew). All 6 path cases verified correct including TESSDATA_PREFIX pointing directly at tessdata dir and Windows paths.

Dhanush Varma added 5 commits April 2, 2026 04:14
The matroska_track_text_subtitle_id_extensions array had 7 entries for
an 8-value enum, leaving MATROSKA_TRACK_SUBTITLE_CODEC_ID_KATE (index 7)
out of bounds. On most platforms this read NULL, which then caused
strlen(NULL) UB and snprintf to emit .(null) in the output filename.

Two fixes:
- Add "kate" at index 7 in the extensions array so KATE tracks
  produce correct .kate output filenames
- Add a NULL guard in generate_filename_from_track() so any future
  unknown codec ID safely falls back to .bin instead of crashing or
  producing .(null)

Fixes CCExtractor#972
The matroska_track_text_subtitle_id_extensions array had 7 entries for
an 8-value enum, leaving MATROSKA_TRACK_SUBTITLE_CODEC_ID_KATE (index 7)
out of bounds. On most platforms this read NULL, which then caused
strlen(NULL) UB and snprintf to emit .(null) in the output filename.

Two fixes:
- Add "kate" at index 7 in the extensions array so KATE tracks
  produce correct .kate output filenames
- Add a NULL guard in generate_filename_from_track() so any future
  unknown codec ID safely falls back to .bin instead of crashing or
  producing .(null)

Fixes CCExtractor#972
The matroska_track_text_subtitle_id_extensions array had 7 entries for
an 8-value enum, leaving MATROSKA_TRACK_SUBTITLE_CODEC_ID_KATE (index 7)
out of bounds. On most platforms this read NULL, which then caused
strlen(NULL) UB and snprintf to emit .(null) in the output filename.

Two fixes:
- Add "kate" at index 7 in the extensions array so KATE tracks
  produce correct .kate output filenames
- Add a NULL guard in generate_filename_from_track() so any future
  unknown codec ID safely falls back to .bin instead of crashing or
  producing .(null)

Fixes CCExtractor#972
- Both Tesseract 4/5 and legacy (<4) branches now use a consistently
  built tess_path instead of raw tessdata_path or manual concatenation
- Handles the case where TESSDATA_PREFIX already points at the tessdata
  dir itself (avoids double-appending 'tessdata')
- Handles Windows paths ending with backslash correctly
- Adds mprint diagnostic showing the resolved tessdata path

Fixes CCExtractor#1492
@ccextractor-bot
Copy link
Copy Markdown
Collaborator

CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit d56a6be...:
Report Name Tests Passed
Broken 9/13
CEA-708 1/14
DVB 3/7
DVD 3/3
DVR-MS 2/2
General 20/27
Hardsubx 1/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 77/86
Teletext 20/21
WTV 13/13
XDS 31/34

Your PR breaks these cases:

  • ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 8e8229b88b...
  • ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2...
  • ccextractor --autoprogram --out=ttxt --latin1 132d7df7e9...
  • ccextractor --autoprogram --out=ttxt --latin1 99e5eaafdc...
  • ccextractor --autoprogram --out=srt --latin1 b22260d065...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla 7aad20907e...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65...
  • ccextractor --autoprogram --out=ttxt --latin1 01509e4d27...
  • ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b...
  • ccextractor --out=spupng c83f765c66...
  • ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --autoprogram --out=ttxt --xds --latin1 --ucla 85058ad37e...
  • ccextractor --autoprogram --out=srt --latin1 --ucla b22260d065...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 7f41299cc7...

NOTE: The following tests have been failing on the master branch as well as the PR:

Congratulations: Merging this PR would fix the following tests:

  • ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

@ccextractor-bot
Copy link
Copy Markdown
Collaborator

CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit d56a6be...:
Report Name Tests Passed
Broken 9/13
CEA-708 1/14
DVB 4/7
DVD 3/3
DVR-MS 2/2
General 22/27
Hardsubx 1/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 81/86
Teletext 20/21
WTV 13/13
XDS 31/34

Your PR breaks these cases:

  • ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 8e8229b88b...
  • ccextractor --autoprogram --out=ttxt --latin1 132d7df7e9...
  • ccextractor --autoprogram --out=ttxt --latin1 99e5eaafdc...
  • ccextractor --autoprogram --out=srt --latin1 b22260d065...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla 7aad20907e...
  • ccextractor --autoprogram --out=ttxt --latin1 01509e4d27...
  • ccextractor --autoprogram --out=ttxt --xds --latin1 --ucla 85058ad37e...
  • ccextractor --autoprogram --out=srt --latin1 --ucla b22260d065...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 7f41299cc7...

NOTE: The following tests have been failing on the master branch as well as the PR:

Congratulations: Merging this PR would fix the following tests:

  • ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65..., Last passed: Never
  • ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b..., Last passed: Never
  • ccextractor --out=spupng c83f765c66..., Last passed: Never
  • ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

@cfsmp3
Copy link
Copy Markdown
Contributor

cfsmp3 commented Apr 4, 2026

Closing — a few issues:

  1. matroska.c/h changes already merged via fix: MKV subtitle track .(null) extension for KATE and unknown codec IDs #2250 — these will now conflict.
  2. No PR description or repro: the checkboxes are unchecked and there's no explanation of what system/configuration triggers the double tessdata path. On what platform does TESSDATA_PREFIX point directly to the tessdata/ dir instead of its parent? We need a concrete repro.
  3. Missing braces on if blocks per our code style.

The OCR path fix itself looks reasonable — please resubmit as a new PR with:

  • Just the ocr.c changes (drop the matroska changes)
  • A description explaining the scenario and how to reproduce
  • Braces on all if/else blocks

Thanks.

@cfsmp3 cfsmp3 closed this Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants