Textract-py3 #543

KyleKing · 2024-12-04T11:36:53Z

Hi! FWIW I have been maintaining a very minimal published fork of textract to address the blocking issue with * dependencies. This PR is to track the differences if anyone wants to use textract-py3 while the long term maintenance of textract is resolved (#498). Once textract is released with a patch for #461, I will redirect users away from my fork because hopefully the upstream package will resume merging PRs, patching, and new version releases, but I unfortunately won't have the bandwidth to contribute beyond this fork.

Clarification, _getStringStream *should* return `unicode` in Python 2, `str` in Python 3, IF the stream requested exists. If it does not exist, it returns `None`, which cannot be added to bytes. This commit adds a check for None, returning an empty bytes string if matched.

There are small typos in: - docs/installation.rst - textract/exceptions.py Fixes: - Should read `suppressed` rather than `supressed`. - Should read `documentation` rather than `documenation`. - Should read `accommodated` rather than `accomodated`. Signed-off-by: Tim Gates <[email protected]>

As of now, the txt parser reads files in text mode as UTF-8 and fails with other encodings. This makes it return a bytes object, leaving the base `decode` to figure out the encoding and act accordingly.

docs: sync latest upstream changes

StevenMapes · 2025-02-17T11:34:41Z

@KyleKing and chance you could update the deps on your fork please and release a new version. It's pinned to using a older version of msg-extractor which had a pinned dep that broke when BS4 was updated, they've patched their part now, please see
TeamMsgExtractor/msg-extractor#450

Great work on the fork though keep it going please!

KyleKing · 2025-02-17T18:03:41Z

Of course! I’m glad that you’re finding it to be useful and I’m happy to accept PRs. I’m out of town now, but I’ll release no later than the 24th!

KyleKing · 2025-02-24T02:41:14Z

My fork doesn't have any upper bound constraints on dependencies (https://github.com/KyleKing/textract-py3/blob/f3f509f8c36be05c85574ac841632693c2a19bd9/pyproject.toml#L22), so any project with Python >=3.8,<4.0 can install extract-msg==0.53.1

For textract-py3, I don't plan on dropping support of Python 3.7 yet, but the relaxed specifications shouldn't introduce artificial version limits

anthonyhashemi · 2025-02-26T10:55:34Z

Hey @KyleKing thanks for creating the fork and maintaing that for the moment.

I tried using it but my colleague pointed out that the reason one of our tests started to break was because textract-py3 was installing xlrd 2.1.0. xlrd 2 and upwards removed support for anything but .xls so textract would no longer support xlsx. I fixed it by explicitly requiring xlrd==1.2.0.

When you get the chance you may want to restrict xlrd in the toml to xlrd = ">=1.2.0,<2.0.0" (I think)

KyleKing · 2025-02-26T13:26:35Z

Thanks @anthonyhashemi! You're right and I'll publish a release this afternoon

Long-term, maybe adapting changes to use openpyxl for xlsx from #433 would be worthwhile in the case that xlrd no longer works reliably for xlsx

Prevents upgrading xlrd

KyleKing · 2025-02-26T23:54:15Z

I've released v2.1.1 with the patch. Thanks again!

https://pypi.org/project/textract-py3

StevenMapes · 2025-03-20T14:38:04Z

Moving over to use openpyxl for xlsx would be good as then the constraint on xlrd could be removed which would be useful for projects that also use pandas for XLS as that, pandas, now requires v2.0.1+ of xlrd when using the xlrd engine

KyleKing · 2025-03-21T01:49:06Z

That would be great and I would be open to PRs! I don’t have a use case for Excel, so I wouldn’t make the changes otherwise

TheElementalOfDestruction and others added 19 commits June 16, 2022 01:04

build: initialize poetry (#1)

c7ebf22

docs: fix changelog version (#2)

f27cd05

release: cleanup 2.0.1 release (#3)

b7c3647

Enable encoding detection for the txt parser

0123795

As of now, the txt parser reads files in text mode as UTF-8 and fails with other encodings. This makes it return a bytes object, leaving the base `decode` to figure out the encoding and act accordingly.

docs: remove reference to rtx

4c7a71b

treat ShellError when call pdf2txt.py

8d51c54

support python3.12

78f9e64

Merge pull request #4 from deanmalmgren/master

32b1b22

docs: sync latest upstream changes

Merge pull request #7 from timgates42/bugfix_typos

05e6353

Merge pull request #9 from branchvincent/py312

b0b5e25

Merge pull request #6 from LoicGrobol/patch-1

aa8eb5b

Merge pull request #8 from TheElementalOfDestruction/fix-msg-support

5b55f5c

Merge pull request #10 from dhrim/master

e9349b3

build: update dependencies

cfb59f9

Bump version: 2.0.1 → 2.1.0

ac3c17e

docs: add publish step to README

c0628b2

Merge pull request #11 from KyleKing/release

f3f509f

KyleKing marked this pull request as draft December 4, 2024 11:36

KyleKing changed the title ~~textrat-py3 Tracker~~ Textract-py3 Dec 4, 2024

build: update lockfile

0bc7e89

fix: prevent errors parsing xlsx

b812a97

Prevents upgrading xlrd

build: finish updating release documentation

12f2ece

StevenMapes mentioned this pull request Apr 3, 2025

textract (>=1.6.5,<2.0.0) requires xlrd (>=1.2.0,<1.3.0). #544

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Textract-py3 #543

Textract-py3 #543

KyleKing commented Dec 4, 2024

StevenMapes commented Feb 17, 2025

KyleKing commented Feb 17, 2025

KyleKing commented Feb 24, 2025

anthonyhashemi commented Feb 26, 2025 •

edited

Loading

KyleKing commented Feb 26, 2025

KyleKing commented Feb 26, 2025

StevenMapes commented Mar 20, 2025

KyleKing commented Mar 21, 2025

Textract-py3 #543

Are you sure you want to change the base?

Textract-py3 #543

Conversation

KyleKing commented Dec 4, 2024

StevenMapes commented Feb 17, 2025

KyleKing commented Feb 17, 2025

KyleKing commented Feb 24, 2025

anthonyhashemi commented Feb 26, 2025 • edited Loading

KyleKing commented Feb 26, 2025

KyleKing commented Feb 26, 2025

StevenMapes commented Mar 20, 2025

KyleKing commented Mar 21, 2025

anthonyhashemi commented Feb 26, 2025 •

edited

Loading