Fix Hardsubx OCR by hrideshmg · Pull Request #1741 · CCExtractor/ccextractor

hrideshmg · 2025-08-30T22:03:23Z

In raising this pull request, I confirm the following (please check boxes):

I have read and understood the contributors guide.
I have checked that another pull request for this purpose does not exist.
I have considered, and confirmed that this submission will be valuable to others.
I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
I give this submission freely, and claim no ownership to its content.
I have mentioned this change in the changelog.

My familiarity with the project is as follows (check one):

I have never used CCExtractor.
I have used CCExtractor just a couple of times.
I absolutely love CCExtractor, but have not contributed previously.
I am an active contributor to CCExtractor.

Hardsubx is currently broken on the master branch on Rust.

Fixed a segmentation faults on Linux which was caused by a null pointer dereference in dispatch_classifier_function, I've added proper error handling to fix this. While here, I also noticed that the values in the match statement for the classifier functions are incorrect, see here. The pixDilateGray() function also had an incorrect argument, see here.

While CCextractor seemed to run after this, the output seemed to be garbled. On running some tests, I found out that the luminance mask was fully white. This is because the Srgb::new() function expects values in the range of 0.0-1.0, however we were passing values from 0.0-255.0.

I've also enabled hardsubx for the test builds. This is to facilitate adding regression tests on the sample platform.

ccextractor-bot · 2025-08-30T23:54:35Z

CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit 4b5f68a...:

Report Name	Tests Passed
Broken	13/13
CEA-708	14/14
DVB	7/7
DVD	3/3
DVR-MS	2/2
General	27/27
Hauppage	3/3
MP4	3/3
NoCC	10/10
Options	86/86
Teletext	21/21
WTV	13/13
XDS	34/34

All tests passing on the master branch were passed completely.

Check the result page for more info.

ccextractor-bot · 2025-08-31T01:04:21Z

CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit 4b5f68a...:

Report Name	Tests Passed
Broken	13/13
CEA-708	14/14
DVB	4/7
DVD	3/3
DVR-MS	2/2
General	27/27
Hauppage	3/3
MP4	3/3
NoCC	10/10
Options	86/86
Teletext	21/21
WTV	13/13
XDS	34/34

NOTE: The following tests have been failing on the master branch as well as the PR:

ccextractor --stdout --quiet --no-fontcolor 79a51f3500..., Last passed:
Never
ccextractor --stdout --quiet --no-fontcolor 767b546f96..., Last passed:
Never
ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2..., Last passed:
Never

Congratulations: Merging this PR would fix the following tests:

ccextractor --autoprogram --out=srt --latin1 f1422b8bfe..., Last passed: Never
ccextractor --datapid 5603 --autoprogram --out=srt --latin1 --teletext 85c7fc1ad7..., Last passed: Never
ccextractor --out=spupng c83f765c66..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 c0d2fba8c0..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 006fdc391a..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 e92a1d4d2a..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 7e4ebf7fd7..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 9256a60e4b..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 27d7a43dd6..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 297a44921a..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 efbe129086..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 eae0077731..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 e2e2b501e0..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 c6407fb294..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 --datets dcada745de..., Last passed: Never
ccextractor --autoprogram --out=srt --latin1 --tpage 398 5d5838bde9..., Last passed: Never
ccextractor --autoprogram --out=srt --latin1 --teletext --tpage 398 3b276ad8bf..., Last passed: Never
ccextractor --out=srt --latin1 f23a544ba8..., Last passed: Never
ccextractor --out=srt --latin1 10f0f77cf4..., Last passed: Never
ccextractor --out=srt --latin1 df3b4d62d3..., Last passed: Never

All tests passing on the master branch were passed completely.

Check the result page for more info.

prateekmedia

LGTM!

rboy1 · 2025-09-06T15:19:40Z

Since this doesn't appear to be an exact port from C to RUST, some of the thresholds have changed and consequently the accuracy of the OCR has changed. i.e. you will see different results on C vs OCR builds.

So here my question, how does one tune the OCR thresholds? If the C version was working better than the RUST port there should be some way to tweak the thresholds.

cfsmp3 · 2025-09-06T16:36:51Z

Since this doesn't appear to be an exact port from C to RUST, some of the thresholds have changed and consequently the accuracy of the OCR has changed. i.e. you will see different results on C vs OCR builds.

So here my question, how does one tune the OCR thresholds? If the C version was working better than the RUST port there should be some way to tweak the thresholds.

If the C version was working better then the Rust version is not ready, that's all.
This applies to all subsystems, of course.

Feel free to share samples and how to compare both versions, that's the best way to help getting us where we want to be.

hrideshmg · 2025-09-08T17:03:15Z

@rboy1 #1746 should fix the discrepancies between the Rust and C implementations. I'd also suggest playing around with --whiteness-thresh as well, setting it to lower values on certain files seemed to produce really good results.

hrideshmg force-pushed the hardsubx_fixes branch 6 times, most recently from eb97424 to 858925a Compare August 30, 2025 22:33

hrideshmg added 2 commits August 31, 2025 04:38

fix: hardsubx segmentation fault

057671c

fix: hardsubx garbage output

2a54f6f

hrideshmg force-pushed the hardsubx_fixes branch from 858925a to 4c5d001 Compare August 30, 2025 23:09

chore: enable hardsubx on test builds

83483c4

hrideshmg force-pushed the hardsubx_fixes branch from 4c5d001 to 83483c4 Compare August 30, 2025 23:14

prateekmedia approved these changes Sep 2, 2025

View reviewed changes

prateekmedia merged commit 3f44115 into CCExtractor:master Sep 2, 2025
17 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Hardsubx OCR#1741

Fix Hardsubx OCR#1741
prateekmedia merged 3 commits into
CCExtractor:masterfrom
hrideshmg:hardsubx_fixes

hrideshmg commented Aug 30, 2025 •

edited

Loading

Uh oh!

ccextractor-bot commented Aug 30, 2025

Uh oh!

ccextractor-bot commented Aug 31, 2025

Uh oh!

prateekmedia left a comment

Uh oh!

Uh oh!

rboy1 commented Sep 6, 2025

Uh oh!

cfsmp3 commented Sep 6, 2025

Uh oh!

hrideshmg commented Sep 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

hrideshmg commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ccextractor-bot commented Aug 30, 2025

Uh oh!

ccextractor-bot commented Aug 31, 2025

Uh oh!

prateekmedia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rboy1 commented Sep 6, 2025

Uh oh!

cfsmp3 commented Sep 6, 2025

Uh oh!

hrideshmg commented Sep 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hrideshmg commented Aug 30, 2025 •

edited

Loading