-
-
Notifications
You must be signed in to change notification settings - Fork 75
Add PGS to SRT OCR conversion feature #701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add dropdown menu for PGS subtitle tracks with OCR option - Auto-detect Tesseract OCR on all drives and Windows registry - Add settings panel with dependency status display - Support for converting image-based PGS to editable SRT - Handles language code conversion and environment setup - Includes comprehensive error handling and user guidance
cdgriffith
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this wonderful addition!
I have a few tweaks suggested. If you are up for doing them let me know, otherwise I can merge it and work on it as well as you set up a great feature I would love to add!
If you do add more please also run the pre-commit checks so it passes linting:
pre-commit install
pre-commit run --all-files
Ah, that's how you do it - thank you. Will do for future commits, and go back to see if I can do it on this PR also. |
c067e1f to
cc88b50
Compare
|
Hi @cdgriffith - My earlier commit missed the Based on our threaded convo, I think I've addressed all issues:
|
- Use environment variables for Windows tool detection instead of scanning all drives (LOCALAPPDATA, PROGRAMFILES, PROGRAMFILES(X86)) - Remove pgsrip_path config field and use pgsrip Python API directly - Update dependency checks to use importlib for pgsrip library - Fix BabelLanguage to handle both 2-letter and 3-letter ISO codes - Update error messages and installation instructions All changes pass pre-commit linting checks.
cc88b50 to
1c6c486
Compare
The glob pattern was failing when filenames contained brackets like [imdbid-tt0187738] because glob interprets [] as character classes. Changed to detect newly created .srt files by comparing before/after directory listings instead of using filename-based glob patterns. Fixes false error for files like "Blade II (2002) [imdbid-tt0187738].mkv"
Include package metadata for pgsrip, pytesseract, and babelfish in the Windows builds to fix 'No package metadata was found' error when running OCR conversion from the compiled executable.
Add collect_data_files('babelfish') to bundle ISO language code
data files needed by babelfish at runtime.
Add copy_metadata('cleanit') for pgsrip dependency.
Add collect_data_files('cleanit') to bundle YAML config files
needed by cleanit at runtime.
Add copy_metadata('trakit') for pgsrip dependency.
|
@cdgriffith - Ok, I think we're finally there. The working screenshots I showed in #701 (comment) were based on running via Python in a Windows command prompt. After that, I realized we needed a bunch more work and tweaks to get it working OOB via the compiled binary. So, commits f5ddccc through 4f8e347 are just that. It works beautifully now. I think it's finally ready for your ACK/NACK. Sorry for all the noise, it's been a long time since I've done a PR - but I hope this brings some extra functionality and usefulness for someone out there. I, for one, use SRT subtitles alongside every MKV I stream via Jellyfin. Without them, it's a transcode every time I start a movie just to render the Blu-ray subtitles natively. HTH. YMMV.
|
Include pgsrip, pytesseract, babelfish, cleanit, trakit, opencv-python, and pysrt in project dependencies to fix Windows build error where PyInstaller's copy_metadata() could not find package metadata for packages that weren't installed during the build process.
|
Summary of Changes in aacb011 Fixed Windows build error where PyInstaller couldn't find package metadata for OCR dependencies. Changes:
Why this fixes the build: The spec files use |
Include all babelfish.converters submodules (alpha2, alpha3b, alpha3t, name, opensubtitles) in PyInstaller hidden imports to fix 'No module named babelfish.converters.alpha2' error during OCR conversion.
Add mkvtoolnix directory to PATH environment variable so pgsrip can find mkvextract executable when performing OCR conversion. This fixes the 'mkvextract command not found' error.
Change working directory to video folder and use relative filename when calling pgsrip to avoid issues with special characters (parentheses, brackets) in Windows paths that may cause mkvextract to fail.
| @@ -1,5 +1,5 @@ | |||
| # -*- mode: python ; coding: utf-8 -*- | |||
| from PyInstaller.utils.hooks import collect_submodules | |||
| from PyInstaller.utils.hooks import collect_submodules, copy_metadata, collect_data_files | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not know about those functions, handy!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Just testing some final changes. I had to deal with detection for Subtitle Edit's tesseract installations. It works locally, testing a build now.
Check AppData/Roaming/Subtitle Edit for Tesseract installations, parse version numbers from directory names (e.g., Tesseract550), and automatically select the newest version. This ensures modern Tesseract versions are detected even when multiple versions exist.
Initialize PATH environment variables for tesseract and mkvextract at application startup before any subprocesses are spawned. This ensures frozen PyInstaller executables can properly pass environment to subprocesses spawned by pgsrip library.
Set TEMP and TMP environment variables to standard temp directory to ensure pgsrip can create temporary folders correctly when running from frozen PyInstaller executable.
Override pgsrip's temp folder creation to work correctly in frozen PyInstaller executables. pgsrip's MediaPath.create_temp_folder() doesn't work properly when frozen, so we create our own temp folder if the one provided doesn't exist.
Ensure the monkey-patch is applied before importing Mkv class to prevent pgsrip from capturing the original read_data method in its lambda closures. This should fix PyInstaller temp folder issue.
Move the pgsrip monkey-patch to setup_ocr_environment() which runs at application startup, before any pgsrip imports. This ensures the patch is applied before pgsrip's lambda closures are created, fixing temp folder creation in PyInstaller frozen executables.
Move patch_pgsrip_for_pyinstaller() to run AFTER environment variables are set up, in case pgsrip import requires the environment to be configured first.
Simplify code back to working state from source. PyInstaller exe issue is a known pgsrip bug that needs to be fixed upstream. Feature works perfectly when running from source.
Add documentation explaining that PGS to SRT OCR conversion works from source but fails in PyInstaller builds due to pgsrip temp folder bug. Include workaround instructions and requirements.
Implement OCR conversion for PGS (Presentation Graphic Stream) subtitles to SRT format using pgsrip library with auto-detection of required tools. Features: - Auto-detect Tesseract OCR from PATH or Subtitle Edit installations - Auto-detect MKVToolNix (mkvextract/mkvmerge) from standard locations - Support for multiple language codes (2-letter, 3-letter, names) - Automatic cleanup of temporary .sup files after conversion - Works when running FastFlix from source Known limitation: Due to an upstream issue in pgsrip v0.1.12, this feature does not work in PyInstaller-built executables. Users needing PGS OCR should run FastFlix from source with: python -m fastflix Dependencies added: - pgsrip (OCR engine for PGS subtitles) - pytesseract (Tesseract OCR Python wrapper) - babelfish (language code handling) - cleanit, trakit (metadata handling) - opencv-python, pysrt (image/subtitle processing)
a66fdfb to
2f89be5
Compare
|
I give up, I simply can't figure out how to get it to work on a compiled binary (it compiles cleanly - but the srt extraction fails). It works perfectly from source, though, so that's good enough for my usecase. |
|
Hey @mikeSGman can you re-open this, I'd like to merge it to dev and play around with it to see if I can get the build working for ya! This is a great feature and would love to have it as part of the standard build |
|
It might be because I squashed my commits in my source branch, but I still have the code if it's useful can give it to you. |
|
GitHub won’t let me reopen this PR because the source branch history was rewritten. I opened a new PR with the same changes here: #709. |



Add PGS to SRT OCR Conversion Feature
Summary
This PR adds support for converting image-based PGS (Presentation Graphic Stream) subtitles to text-based SRT format using OCR (Optical Character Recognition). This feature enables users to extract Blu-ray subtitles as editable text files.
Motivation
PGS subtitles are image-based and cannot be edited or searched. Many users want to:
Features
User-Facing Changes
Dropdown Menu for PGS Subtitles
Settings Panel
Smart Dependency Detection
User-Friendly Error Messages
Technical Implementation
Files Modified
fastflix/models/config.py
find_ocr_tool()function to locate Tesseract, mkvmerge, and pgsripenable_pgs_ocr,tesseract_path,mkvmerge_path,pgsrip_pathfastflix/widgets/background_tasks.py
_check_pgsrip_dependencies()method to verify all required tools_convert_sup_to_srt()method to perform OCR conversionuse_ocrparameter toExtractSubtitleSRTclassfastflix/widgets/panels/subtitle_panel.py
fastflix/widgets/settings.py
update_ocr_dependency_status()method to show dependency statusFastFlix_Windows_OneFile.spec
Dependencies
Key Design Decisions
Two-Step Process: First extract .sup file using FFmpeg, then convert with pgsrip
Language Code Conversion: Automatically converts ISO 639-2/T (eng) to ISO 639-1 (en)
Environment Variable Management: Sets both TESSERACT_CMD and PATH
Automatic Cleanup: Deletes .sup file after successful .srt conversion
Testing
Tested Scenarios
Test Results
Testing Checklist
Installation Instructions for Users
Windows
Linux
macOS
Breaking Changes
None. This is a purely additive feature that's disabled by default.
Migration Guide
No migration needed. Existing users will see the new option after updating and installing dependencies.
Future Enhancements
Potential improvements for future PRs:
Related Issues
Closes #[issue-number] (if applicable)
Screenshots
Checklist
t()