Fix/tessdata prefix path resolution by DhanushVarma-2 · Pull Request #2251 · CCExtractor/ccextractor

DhanushVarma-2 · 2026-04-02T10:37:26Z

In raising this pull request, I confirm the following (please check boxes):

Reason for this PR:

This PR adds new functionality.
This PR fixes a bug that I have personally experienced or that a real user has reported and for which a sample exists.
This PR is porting code from C to Rust.

Sanity check:

I have read and understood the contributors guide.
I have checked that another pull request for this purpose does not exist.
If the PR adds new functionality, I've added it to the changelog. If it's just a bug fix, I have NOT added it to the changelog.
I am NOT adding new C code unless it's to fix an existing, reproducible bug.

Repro instructions:

This is essential. We will not merge ANY PR that doesn't come with detailed instructions, including a sample. We don't want
"fixes" for theoretical issues that an AI agent found, without context. If you can't reproduce the bug, don't send a PR.

Creating PRs with AI is very quick, but we still have humans (even if AI assisted) going over each.

Be mindful of reviewers' time.

Root cause: Two bugs in init_ocr() in ocr.c:

The Tesseract 4/5 branch always blindly appended /tessdata to the path returned by probe_tessdata_location(). If TESSDATA_PREFIX was already set to a path ending in tessdata/, this caused a double-append (e.g. /usr/share/tessdata/tessdata).
The legacy Tesseract <4 branch passed tessdata_path raw to TessBaseAPIInit4 without appending tessdata at all — causing Tesseract to look for eng.traineddata directly in e.g. /usr/share/ instead of /usr/share/tessdata/.

Fix: Normalize the path once before both branches — detect whether the returned path already ends with tessdata or tessdata/, and handle Windows backslash separators correctly.
Tested on: macOS (Apple Silicon, Tesseract 5.5.1 via Homebrew). All 6 path cases verified correct including TESSDATA_PREFIX pointing directly at tessdata dir and Windows paths.

The matroska_track_text_subtitle_id_extensions array had 7 entries for an 8-value enum, leaving MATROSKA_TRACK_SUBTITLE_CODEC_ID_KATE (index 7) out of bounds. On most platforms this read NULL, which then caused strlen(NULL) UB and snprintf to emit .(null) in the output filename. Two fixes: - Add "kate" at index 7 in the extensions array so KATE tracks produce correct .kate output filenames - Add a NULL guard in generate_filename_from_track() so any future unknown codec ID safely falls back to .bin instead of crashing or producing .(null) Fixes CCExtractor#972

- Both Tesseract 4/5 and legacy (<4) branches now use a consistently built tess_path instead of raw tessdata_path or manual concatenation - Handles the case where TESSDATA_PREFIX already points at the tessdata dir itself (avoids double-appending 'tessdata') - Handles Windows paths ending with backslash correctly - Adds mprint diagnostic showing the resolved tessdata path Fixes CCExtractor#1492

ccextractor-bot · 2026-04-02T11:29:40Z

CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit d56a6be...:

Report Name	Tests Passed
Broken	9/13
CEA-708	1/14
DVB	3/7
DVD	3/3
DVR-MS	2/2
General	20/27
Hardsubx	1/1
Hauppage	3/3
MP4	3/3
NoCC	10/10
Options	77/86
Teletext	20/21
WTV	13/13
XDS	31/34

Your PR breaks these cases:

ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 8e8229b88b...
ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2...
ccextractor --autoprogram --out=ttxt --latin1 132d7df7e9...
ccextractor --autoprogram --out=ttxt --latin1 99e5eaafdc...
ccextractor --autoprogram --out=srt --latin1 b22260d065...
ccextractor --autoprogram --out=ttxt --latin1 --ucla 7aad20907e...
ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65...
ccextractor --autoprogram --out=ttxt --latin1 01509e4d27...
ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b...
ccextractor --out=spupng c83f765c66...
ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
ccextractor --autoprogram --out=ttxt --xds --latin1 --ucla 85058ad37e...
ccextractor --autoprogram --out=srt --latin1 --ucla b22260d065...
ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 7f41299cc7...

NOTE: The following tests have been failing on the master branch as well as the PR:

ccextractor --out=srt --latin1 --autoprogram 73d9313d64..., Last passed:
Test 8738
ccextractor --out=ttxt --latin1 001dd8cdf7..., Last passed:
Test 8738
ccextractor --out=srt --latin1 4d4e938ef6..., Last passed:
Test 8738
ccextractor --service 1 --out=txt --no-bom --no-rollup ea83ff7bcb..., Last passed:
Test 8738
ccextractor --service 1 --out=txt f17524b53f..., Last passed:
Test 8738
ccextractor --service 1 --out=txt 80848c45f8..., Last passed:
Test 8738
ccextractor --service 1 --out=txt --no-bom --no-rollup b5d6aad89f..., Last passed:
Test 8738
ccextractor --service 1[EUC-KR] --out=txt --no-rollup b5d6aad89f..., Last passed:
Test 8738
ccextractor --service 1 --out=srt da904de35d..., Last passed:
Test 8738
ccextractor --service 1 --out=sami da904de35d..., Last passed:
Test 8738
ccextractor --service 1 --out=ttxt da904de35d..., Last passed:
Test 8926
ccextractor --service 1[EUC-KR] b5d6aad89f..., Last passed:
Test 8738
ccextractor --service 1[EUC-KR] --no-rollup b5d6aad89f..., Last passed:
Test 8738
ccextractor --service all da904de35d..., Last passed:
Test 8738
ccextractor --service all[EUC-KR] b5d6aad89f..., Last passed:
Test 8738
ccextractor --service 1,2[UTF-8],3[EUC-KR],54 --out=txt da904de35d..., Last passed:
Test 8738
ccextractor --autoprogram --out=srt --latin1 d41b53b504..., Last passed:
Test 8738
ccextractor --stdout --quiet --no-fontcolor 79a51f3500..., Last passed:
Test 8738
ccextractor --stdout --quiet --no-fontcolor 767b546f96..., Last passed:
Test 8738
ccextractor --service 1 c83f765c66..., Last passed:
Test 8738
ccextractor --myth c83f765c66..., Last passed:
Test 8738
ccextractor --in=raw fb79021542..., Last passed:
Test 8738
ccextractor --mp4vidtrack 5df914ce77..., Last passed:
Test 8738
ccextractor --xmltv=3 --out=null 96efd279cf..., Last passed:
Test 8738
ccextractor --datapid 2310 --autoprogram --out=srt --latin1 e639e54550..., Last passed:
Test 8738

Congratulations: Merging this PR would fix the following tests:

ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

ccextractor-bot · 2026-04-02T11:50:05Z

CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit d56a6be...:

Report Name	Tests Passed
Broken	9/13
CEA-708	1/14
DVB	4/7
DVD	3/3
DVR-MS	2/2
General	22/27
Hardsubx	1/1
Hauppage	3/3
MP4	3/3
NoCC	10/10
Options	81/86
Teletext	20/21
WTV	13/13
XDS	31/34

Your PR breaks these cases:

ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 8e8229b88b...
ccextractor --autoprogram --out=ttxt --latin1 132d7df7e9...
ccextractor --autoprogram --out=ttxt --latin1 99e5eaafdc...
ccextractor --autoprogram --out=srt --latin1 b22260d065...
ccextractor --autoprogram --out=ttxt --latin1 --ucla 7aad20907e...
ccextractor --autoprogram --out=ttxt --latin1 01509e4d27...
ccextractor --autoprogram --out=ttxt --xds --latin1 --ucla 85058ad37e...
ccextractor --autoprogram --out=srt --latin1 --ucla b22260d065...
ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 7f41299cc7...

NOTE: The following tests have been failing on the master branch as well as the PR:

ccextractor --out=srt --latin1 --autoprogram 73d9313d64..., Last passed:
Test 8611
ccextractor --out=ttxt --latin1 001dd8cdf7..., Last passed:
Test 8611
ccextractor --out=srt --latin1 4d4e938ef6..., Last passed:
Test 8611
ccextractor --service 1 --out=txt --no-bom --no-rollup ea83ff7bcb..., Last passed:
Test 8611
ccextractor --service 1 --out=txt f17524b53f..., Last passed:
Test 8611
ccextractor --service 1 --out=txt 80848c45f8..., Last passed:
Test 8611
ccextractor --service 1 --out=txt --no-bom --no-rollup b5d6aad89f..., Last passed:
Test 8611
ccextractor --service 1[EUC-KR] --out=txt --no-rollup b5d6aad89f..., Last passed:
Test 8611
ccextractor --service 1 --out=srt da904de35d..., Last passed:
Test 8611
ccextractor --service 1 --out=sami da904de35d..., Last passed:
Test 8611
ccextractor --service 1 --out=ttxt da904de35d..., Last passed:
Test 8943
ccextractor --service 1[EUC-KR] b5d6aad89f..., Last passed:
Test 8611
ccextractor --service 1[EUC-KR] --no-rollup b5d6aad89f..., Last passed:
Test 8611
ccextractor --service all da904de35d..., Last passed:
Test 8611
ccextractor --service all[EUC-KR] b5d6aad89f..., Last passed:
Test 8611
ccextractor --service 1,2[UTF-8],3[EUC-KR],54 --out=txt da904de35d..., Last passed:
Test 8611
ccextractor --autoprogram --out=srt --latin1 d41b53b504..., Last passed:
Test 8611
ccextractor --stdout --quiet --no-fontcolor 79a51f3500..., Last passed:
Test 8611
ccextractor --stdout --quiet --no-fontcolor 767b546f96..., Last passed:
Test 8611
ccextractor --service 1 c83f765c66..., Last passed:
Test 8611
ccextractor --myth c83f765c66..., Last passed:
Test 8611
ccextractor --in=raw fb79021542..., Last passed:
Test 8611
ccextractor --mp4vidtrack 5df914ce77..., Last passed:
Test 8611
ccextractor --xmltv=3 --out=null 96efd279cf..., Last passed:
Test 8611
ccextractor --datapid 2310 --autoprogram --out=srt --latin1 e639e54550..., Last passed:
Test 8611

Congratulations: Merging this PR would fix the following tests:

ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65..., Last passed: Never
ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b..., Last passed: Never
ccextractor --out=spupng c83f765c66..., Last passed: Never
ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

cfsmp3 · 2026-04-04T21:30:31Z

Closing — a few issues:

matroska.c/h changes already merged via fix: MKV subtitle track .(null) extension for KATE and unknown codec IDs #2250 — these will now conflict.
No PR description or repro: the checkboxes are unchecked and there's no explanation of what system/configuration triggers the double tessdata path. On what platform does TESSDATA_PREFIX point directly to the tessdata/ dir instead of its parent? We need a concrete repro.
Missing braces on if blocks per our code style.

The OCR path fix itself looks reasonable — please resubmit as a new PR with:

Just the ocr.c changes (drop the matroska changes)
A description explaining the scenario and how to reproduce
Braces on all if/else blocks

Thanks.

Dhanush Varma added 5 commits April 2, 2026 04:14

style: apply clang-format to ocr.c

6f463a7

cfsmp3 closed this Apr 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/tessdata prefix path resolution#2251

Fix/tessdata prefix path resolution#2251
DhanushVarma-2 wants to merge 5 commits intoCCExtractor:masterfrom
DhanushVarma-2:fix/tessdata-prefix-path-resolution

DhanushVarma-2 commented Apr 2, 2026 •

edited

Loading

Uh oh!

ccextractor-bot commented Apr 2, 2026

Uh oh!

ccextractor-bot commented Apr 2, 2026

Uh oh!

cfsmp3 commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

DhanushVarma-2 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ccextractor-bot commented Apr 2, 2026

Uh oh!

ccextractor-bot commented Apr 2, 2026

Uh oh!

cfsmp3 commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DhanushVarma-2 commented Apr 2, 2026 •

edited

Loading