تحويل

تحويل ملفات PDF إلى DOCX و TXT.

مميزات تحويل

تحويل ملفات PDF إلى DOCX و TXT باستخدام تقنيات التعرّف على الحروف من Google
إمكانية تحويل ملف واحد أو مجلد كامل من الملفات
الحصول على مخرجات بنفس عدد صفحات ملف PDF

متطلبات الاستخدام

اتصال إنترنت سريع لأن الملفات ستُرفع إلى خوادم Google لتُعالج
إنشاء Service Account Credentials من Google Cloud Platform كما هو موضَّح في هذا الرابط
تثبيت لغة Python بإصدار 3.10 أو أعلى على حاسبك
تثبيت مكتبة poppler-utils على نظام تشغيلك

تثبيت تحويل

من خلال `pip`

يمكنك تثبيت تحويل من خلال pip باستخدام الأمر: pip install tahweel

من خلال الشيفرة المصدرية

قم بتنزيل هذا المستودع من خلال الضغط على Code ثم Download ZIP أو من خلال تنفيذ الأمر التالي: git clone [email protected]:ieasybooks/tahweel.git
قم بفك ضغط الملف إذا قمت بتنزيله بصيغة ZIP وتوجّه إلى مجلد المشروع
نفّذ الأمر التالي لتثبيت تحويل: poetry install

استخدام تحويل

الخيارات المتوفرة

مسارات ملفات PDF أو مجلدات تحتوي على أكثر من ملف PDF: يجب تمرير مسارات الملفات أو المجلدات بعد اسم مكتبة تحويل بشكل مباشر. على سبيل المثال: tahweel "./pdfs"
ملف/ملفات Service Account Credentials: يجب تمرير مسار ملف JSON الخاص بك من Google Cloud Platform (أو أكبر من ملف) إلى الاختيار --service-account-credentials
امتدادات الملفات المستهدفة: يمكنك تمرير الإمتدادات المستهدفة مثل .pdf أو .jpg إلى الاختيار --file-extensions. القيمة الافتراضية تشمل امتداد .pdf وجميع امتدادات الصور المدعومة على حاسبك
عدد عمليات تحويل ملف PDF إلى صور: يمكن تحديد العدد من خلال الاختيار --pdf2image-thread-count. حسب قوة حاسبك يمكن تقليل أو زيادة هذه القيمة. القيمة الافتراضية هي 8
عدد عمليات تحويل الصور إلى نص: يمكن تحديد العدد من خلال الاختيار --processor-max-workers. حسب جودة اتصال الانترنت لديك يمكن تقليل أو زيادة هذه القيمة. القيمة الافتراضية هي 8
نوع المخرجات عند معالجة مجلد من الملفات: عند معالجة مجلد كامل من ملفات PDF يمكنك تحديد نوع المخرجات من خلال تمرير إما tree_to_tree أو side_by_side إلى الاختيار --dir-output-type. القيمة الأولى وهي tree_to_tree ستقوم بإنشاء مجلد جديد بنفس ترتيب المجلد الأصلي لكل نوع من أنواع المخرجات TXT و DOCX. القيمة الثانية وهي side_by_side ستقوم بإنشاء ملفات TXT و DOCX بجانب ملفات PDF داخل المجلد الأصلي. القيمة الافتراضية هي tree_to_tree
فاصل الصفحات في ملفات TXT: يمكن تحديد النص الذي يفصل الصفحات في ملفات TXT من خلال الاختيار --txt-page-separator. القيمة الافتراضية هي PAGE_SEPARATOR
حذف الأسطر من ملفات DOCX: يمكن حذف الأسطر من ملفات DOCX قبل كتابة المحتوى من خلال الاختيار --docx-remove-newlines وهذا مفيد إذا رغبت في أن يتساوى عدد صفحات ملف DOCX مساوياً لعدد صفحات ملف PDF. القيمة الافتراضية هي False
صيغة المخرجات: يمكنك تحديد صيغة المخرجات من خلال الاختيار --output-formats. الصيغ المتوفرة:
- txt
- docx
مجلد المخرجات: يمكنك تحديد مجلد الإخراج من خلال الاختيار --output-dir. إذا لم تُحدّد مجلد الإخراج ستُكتب المخرجات بناء على مسارات الملفات والمجلدات التي أعطيتها لتحويل

➜ tahweel --help
usage: tahweel --service-account-credentials SERVICE_ACCOUNT_CREDENTIALS [SERVICE_ACCOUNT_CREDENTIALS ...] [--file-extensions FILE_EXTENSIONS [FILE_EXTENSIONS ...]]
               [--pdf2image-thread-count PDF2IMAGE_THREAD_COUNT] [--processor-max-workers PROCESSOR_MAX_WORKERS] [--dir-output-type {tree_to_tree,side_by_side}]
               [--txt-page-separator TXT_PAGE_SEPARATOR] [--docx-remove-newlines] [--output-formats {txt,docx} [{txt,docx} ...]] [--output-dir OUTPUT_DIR] [--skip-output-check] [-h] [--version]
               files_or_dirs_paths [files_or_dirs_paths ...]

positional arguments:
  files_or_dirs_paths   Path to the file or directory to be processed.

options:
  --service-account-credentials SERVICE_ACCOUNT_CREDENTIALS [SERVICE_ACCOUNT_CREDENTIALS ...]
                        Paths to the service account credentials JSON files. Multiple credentials will enable parallel processing.
  --file-extensions FILE_EXTENSIONS [FILE_EXTENSIONS ...]
                        Custom file extensions to search for (e.g., .pdf, .jpg). If not provided, defaults to PDF and supported image formats.
  --pdf2image-thread-count PDF2IMAGE_THREAD_COUNT
                        (int, default=8) Number of threads to use for PDF to image conversion using `pdf2image` package.
  --processor-max-workers PROCESSOR_MAX_WORKERS
                        (int, default=8) Number of threads to use while performing OCR on PDF pages.
  --dir-output-type {tree_to_tree,side_by_side}
                        Use this argument when processing a directory. `tree_to_tree` means the output will be in a new directory beside the input directory with the same structure, while
                        `side_by_side` means the output will be in the same input directory beside each file.
  --txt-page-separator TXT_PAGE_SEPARATOR
                        (str, default=PAGE_SEPARATOR) Separator to use between pages in the output TXT file.
  --docx-remove-newlines
                        (bool, default=False) Remove newlines from the output DOCX file. Useful if you want DOCX and PDF to have the same page count.
  --output-formats {txt,docx} [{txt,docx} ...]
                        Format of the output files; if not specified, `txt` and `docx` formats will be produced.
  --output-dir OUTPUT_DIR
                        (pathlib._local.Path | None, default=None) Path to the output directory. This overrides the default output directory behavior.
  --skip-output-check   (bool, default=False) Use this flag in development only to skip the output check.
  -h, --help            show this help message and exit
  --version             show program's version number and exit

التحويل من خلال سطر الأوامر

تحويل ملف PDF واحد

tahweel "./pdfs/1.pdf" \
  --service-account-credentials "./service_account_credentials.json" \
  --pdf2image-thread-count 8 \
  --processor-max-workers 8 \
  --txt-page-separator PAGE_SEPARATOR

تحويل أكثر من ملف PDF ومجلد

tahweel "./pdfs/1.pdf" "./pdfs/2.pdf" "./other_pdfs" \
  --service-account-credentials "./service_account_credentials.json" \
  --pdf2image-thread-count 8 \
  --processor-max-workers 8 \
  --txt-page-separator PAGE_SEPARATOR

تحويل مجلد كامل من الملفات

tahweel "./pdfs" \
  --service-account-credentials "./service_account_credentials.json" \
  --pdf2image-thread-count 8 \
  --processor-max-workers 8 \
  --dir-output-type tree_to_tree \
  --txt-page-separator PAGE_SEPARATOR \
  --docx-remove-newlines

التحويل من خلال الشيفرة البرمجية

يمكنك استخدام تحويل من خلال الشيفرة البرمجية كالتالي:

from concurrent.futures import ThreadPoolExecutor
from pathlib import Path

from tahweel.enums import TahweelType
from tahweel.managers import PdfFileManager
from tahweel.processors import GoogleDriveOcrProcessor
from tahweel.writers import DocxWriter, TxtWriter
from tqdm import tqdm


def main():
  processor = GoogleDriveOcrProcessor('./service_account_credentials.json')
  pdf_file_manager = PdfFileManager(Path('./pdfs/1.pdf'), 8)
  pdf_file_manager.to_images()

  with ThreadPoolExecutor(max_workers=8) as executor:
    content = list(
      tqdm(executor.map(processor.process, pdf_file_manager.images_paths), total=pdf_file_manager.pages_count()),
    )

  TxtWriter(pdf_file_manager.txt_file_path(TahweelType.FILE)).write(content, 'PAGE_SEPARATOR')
  DocxWriter(pdf_file_manager.docx_file_path(TahweelType.FILE)).write(content, False)


if __name__ == '__main__':
  main()

التحويل باستخدام Docker

إذا كان لديك Docker على حاسبك، فالطريقة الأسهل لاستخدام تحويل هي من خلاله. الأمر التالي يقوم بتنزيل Docker image الخاصة بتحويل وتحويل ملف PDF باستخدام تقنيات Google Drive OCR وإخراج النتائج في المجلد الحالي:

docker run -it --rm -v "$PWD:/tahweel" ghcr.io/ieasybooks/tahweel \
  "./pdfs/1.pdf" \
  --service-account-credentials "./service_account_credentials.json" \
  --pdf2image-thread-count 8 \
  --processor-max-workers 8 \
  --dir-output-type tree_to_tree \
  --txt-page-separator PAGE_SEPARATOR \
  --docx-remove-newlines

يمكنك تمرير أي خيار من خيارات مكتبة تحويل المُوضّحة في الأعلى، ولكن يجب مُراعاة تنفيذ الأمر من داخل المجلد الذي يحتوي على ملفات PDF المراد تحويلها وملف Service Account Credentials الخاص بك.

وقد اعتمد المشروع بشكل كبير على مستودع ocrarian.py لإنجاز تحويل بشكل أسرع، فجزى الله من عمل عليه خير الجزاء.

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
.devcontainer		.devcontainer
.github		.github
colab_images		colab_images
tahweel		tahweel
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.sonarcloud.properties		.sonarcloud.properties
Dockerfile		Dockerfile
LICENSE		LICENSE
README.en.md		README.en.md
README.md		README.md
colab_notebook.ipynb		colab_notebook.ipynb
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
renovate.json		renovate.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

تحويل

مميزات تحويل

متطلبات الاستخدام

تثبيت تحويل

من خلال `pip`

من خلال الشيفرة المصدرية

استخدام تحويل

الخيارات المتوفرة

التحويل من خلال سطر الأوامر

تحويل ملف PDF واحد

تحويل أكثر من ملف PDF ومجلد

تحويل مجلد كامل من الملفات

التحويل من خلال الشيفرة البرمجية

التحويل باستخدام Docker

About

Uh oh!

Releases 15

Packages

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

License

ieasybooks/tahweel

Folders and files

Latest commit

History

Repository files navigation

تحويل

مميزات تحويل

متطلبات الاستخدام

تثبيت تحويل

من خلال pip

من خلال الشيفرة المصدرية

استخدام تحويل

الخيارات المتوفرة

التحويل من خلال سطر الأوامر

تحويل ملف PDF واحد

تحويل أكثر من ملف PDF ومجلد

تحويل مجلد كامل من الملفات

التحويل من خلال الشيفرة البرمجية

التحويل باستخدام Docker

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 15

Packages 0

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

من خلال `pip`

Packages