Skip to content

Commit ca562a7

Browse files
committed
Merge branch 'master' of github.com:aleks-v-k/textract
2 parents 124c44f + 102a584 commit ca562a7

24 files changed

+147
-84
lines changed

.gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,9 @@ var/
2525
pip-log.txt
2626
pip-delete-this-directory.txt
2727

28+
# Virtual environments
29+
**/venv*
30+
2831
# Unit test / coverage reports
2932
htmlcov/
3033
.tox/

.pyup.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
update: all
22
branch: master
33
schedule: "every two weeks"
4-
pin: True
4+
pin: False
55
requirements:
66
- requirements/python:
77
updates: all

.travis.yml

+4-3
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
sudo: required
2-
dist: bionic
1+
dist: focal
2+
os: linux
33

44
language: python
55
python:
@@ -9,6 +9,7 @@ python:
99
# install system dependencies here with apt-get.
1010
before_install:
1111
- sudo ./provision/debian.sh
12+
- python -m pip install --upgrade pip
1213

1314
# install python dependencies including this package in the travis
1415
# virtualenv
@@ -27,7 +28,7 @@ script:
2728
- cd tests && make && cd -
2829
- nosetests --with-coverage --cover-package=textract
2930
- cd tests && pytest && cd -
30-
- pycodestyle textract/ bin/textract
31+
# - pycodestyle textract/ bin/textract
3132
- if [[ $TRAVIS_PYTHON_VERSION == 3.7 ]];
3233
then cd docs && make html && cd -;
3334
fi

README.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
..
33
.. * bumpversion {major|minor|patch}
44
.. * git push && git push --tags
5-
.. * python setup.py sdist upload
5+
.. * twine upload -r textract dist/*
66
.. * convert into release https://github.com/deanmalmgren/textract/releases
77
88
textract

docs/changelog.rst

+10
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,14 @@ latest changes in development for next release
1010
----------------------------------------------
1111

1212
.. THANKS FOR CONTRIBUTING; ADD YOUR UNRELEASED CHANGES HERE!
13+
1.6.5
14+
-------------------
15+
16+
* switched epub parsing to MIT license compatible package (`#411`_ by
17+
`@jhale1805`_)
18+
19+
1.6.4
20+
-------------------
1321

1422
* several bug fixes, including:
1523

@@ -276,6 +284,7 @@ latest changes in development for next release
276284
.. _@eiotec: https://github.com/eiotec
277285
.. _@evfredericksen: https://github.com/evfredericksen
278286
.. _@jaraco: https://github.com/jaraco
287+
.. _@jhale1805: https://github.com/jhale1805
279288
.. _@jsmith-mploir: https://github.com/jsmith-mploir
280289
.. _@kokxx: https://github.com/Kokxx
281290
.. _@levivm: https://github.com/levivm
@@ -356,3 +365,4 @@ latest changes in development for next release
356365
.. _#149: https://github.com/deanmalmgren/textract/issues/149
357366
.. _#150: https://github.com/deanmalmgren/textract/issues/150
358367
.. _#162: https://github.com/deanmalmgren/textract/issues/162
368+
.. _#411: https://github.com/deanmalmgren/textract/issues/411

docs/conf.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@
5858
# built documents.
5959
#
6060
# The short X.Y version.
61-
release = version = "1.6.3"
61+
release = version = "1.6.5"
6262

6363
# The language for content autogenerated by Sphinx. Refer to documentation
6464
# for a list of supported languages.

docs/index.rst

+3-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
textract
77
================================
88

9-
As undesireable as it might be, more often than not there is extremely
9+
As undesirable as it might be, more often than not there is extremely
1010
useful information embedded in Word documents, PowerPoint
1111
presentations, PDFs, etc---so-called "dark data"---that would be
1212
valuable for further textual analysis and visualization. While
@@ -44,6 +44,8 @@ file types by either mentioning them on the `issue tracker
4444

4545
* ``.csv`` via python builtins
4646

47+
* ``.tsv`` and ``.tab`` via python builtins
48+
4749
* ``.doc`` via `antiword`_
4850

4951
* ``.docx`` via `python-docx2txt`_

docs/installation.rst

+13-1
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ pypi.
4343

4444
.. code-block:: bash
4545
46-
brew cask install xquartz
46+
brew install --cask xquartz
4747
brew install poppler antiword unrtf tesseract swig
4848
pip install textract
4949
@@ -62,6 +62,18 @@ pypi.
6262
homebrew, you may also need to install the python
6363
development header files for textract to properly install.
6464

65+
FreeBSD
66+
-------
67+
68+
Setting up this package on FreeBSD pretty much follows the steps for
69+
Ubuntu / Debian while using ``pkg`` as package manager.
70+
71+
.. code-block:: bash
72+
73+
pkg install lang/python38 devel/py-pip textproc/libxml2 textproc/libxslt textproc/antiword textproc/unrtf \
74+
graphics/poppler print/pstotext graphics/tesseract audio/flac multimedia/ffmpeg audio/lame audio/sox \
75+
graphics/jpeg-turbo
76+
pip install textract
6577
6678
Don't see your operating system installation instructions here?
6779
---------------------------------------------------------------

provision/debian.sh

-1
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,5 @@ base=$(pwd)
1616

1717
# Install all of the dependencies required in the examples.
1818
# http://docs.travis-ci.com/user/installing-dependencies/#Installing-Ubuntu-packages
19-
add-apt-repository ppa:mc3man/trusty-media -y
2019
apt-get update -qq
2120
sed 's/\(.*\)\#.*/\1/' < $base/requirements/debian | xargs apt-get install -y --fix-missing

requirements/debian

-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,6 @@ make
1010
# these packages are required by python-docx, which depends on lxml
1111
# and requires these things
1212
python-dev
13-
python-pip
1413
libxml2-dev
1514
libxslt1-dev
1615

requirements/freebsd

+36
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# required packages
2+
audio/pulseaudio
3+
devel/git
4+
5+
# these packages are required by python-docx, which depends on lxml
6+
# and requires these things
7+
lang/python38
8+
devel/py-pippython-pip
9+
textproc/libxml2
10+
textproc/libxslt
11+
12+
# parse word documents
13+
textproc/antiword
14+
15+
# parse rtf documents
16+
textproc/unrtf
17+
18+
# parse image files
19+
graphics/tesseract
20+
graphics/jpeg-turbo
21+
22+
# parse pdfs
23+
graphics/poppler
24+
25+
# parse postscript files
26+
print/pstotext
27+
28+
# parse audio files, with SpeechRecognition
29+
audio/flac
30+
31+
# filetype conversion libs
32+
multimedia/ffmpeg
33+
audio/lame
34+
35+
# convert audio files
36+
audio/sox

requirements/python

+10-11
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,13 @@
11
# This file contains all python dependencies that are required by the textract
22
# package in order for it to properly work.
33

4-
argcomplete==1.10.0
5-
beautifulsoup4==4.8.0
6-
chardet==3.0.4
7-
docx2txt==0.8
8-
EbookLib==0.17.1
9-
extract-msg==0.23.1
10-
pdfminer.six==20181108
11-
python-pptx==0.6.18
12-
six==1.12.0
13-
SpeechRecognition==3.8.1
14-
xlrd==1.2.0
4+
argcomplete~=1.10.0
5+
beautifulsoup4~=4.8.0
6+
chardet==3.*
7+
docx2txt~=0.8
8+
extract-msg<=0.29.* #Last with python2 support
9+
pdfminer.six==20191110 #Last with python2 support
10+
python-pptx~=0.6.18
11+
six~=1.12.0
12+
SpeechRecognition~=3.8.1
13+
xlrd~=1.2.0

setup.cfg

+1-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[bumpversion]
2-
current_version = 1.6.3
2+
current_version = 1.6.5
33
commit = True
44
tag = True
55

@@ -20,4 +20,3 @@ search = THANKS FOR CONTRIBUTING; ADD YOUR UNRELEASED CHANGES HERE!
2020
replace = THANKS FOR CONTRIBUTING; ADD YOUR UNRELEASED CHANGES HERE!
2121
{new_version}
2222
-------------------
23-

setup.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ def parse_requirements(requirements_filename):
4242

4343
setup(
4444
name=textract.__name__,
45-
version="1.6.3",
45+
version="1.6.5",
4646
description="extract text from any document. no muss. no fuss.",
4747
long_description=long_description,
4848
url=github_url,

tests/base.py

-12
Original file line numberDiff line numberDiff line change
@@ -68,18 +68,6 @@ def get_filename(self, filename_root, default_filename_root):
6868
return filename
6969
return self.get_filename(default_filename_root, default_filename_root)
7070

71-
def download_file(self, url, filename):
72-
if not os.path.exists(filename):
73-
74-
# stream the request to make sure it works correctly
75-
# http://stackoverflow.com/a/16696317/564709
76-
response = requests.get(url, stream=True)
77-
with open(filename, 'wb') as stream:
78-
for chunk in response.iter_content(chunk_size=1024):
79-
if chunk: # filter out keep-alive new chunks
80-
stream.write(chunk)
81-
stream.flush()
82-
8371
@property
8472
def raw_text_filename(self):
8573
return self.get_filename(self.raw_text_filename_root,

tests/epub/raw_text.txt

-19
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,7 @@
1-
21
Epub testing
32
With subtitle...
4-
53
Introduction
64
Welcome here! All the text have ben generate with the Samuel L lorem ipsum.
7-
8-
95
We happy?
106
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
117
We happy?
@@ -16,7 +12,6 @@ No man, I don't eat pork
1612
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
1713
Is she dead, yes or no?
1814
The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.
19-
2015
We happy?
2116
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
2217
We happy?
@@ -27,7 +22,6 @@ No man, I don't eat pork
2722
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
2823
Is she dead, yes or no?
2924
The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.
30-
3125
We happy?
3226
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
3327
We happy?
@@ -38,7 +32,6 @@ No man, I don't eat pork
3832
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
3933
Is she dead, yes or no?
4034
The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.
41-
4235
We happy?
4336
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
4437
We happy?
@@ -49,18 +42,6 @@ No man, I don't eat pork
4942
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
5043
Is she dead, yes or no?
5144
The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.
52-
53-
We happy?
54-
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
55-
We happy?
56-
The lysine contingency - it's intended to prevent the spread of the animals is case they ever got off the island. Dr. Wu inserted a gene that makes a single faulty enzyme in protein metabolism. The animals can't manufacture the amino acid lysine. Unless they're continually supplied with lysine by us, they'll slip into a coma and die.
57-
Oh... what I'm gon' do?
58-
The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.
59-
No man, I don't eat pork
60-
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
61-
Is she dead, yes or no?
62-
The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.
63-
6445
We happy?
6546
Well, the way they make shows is, they make one show. That show's called a pilot. Then they show that show to the people who make shows, and on the strength of that one show they decide if they're going to make more shows. Some pilots get picked and become television programs. Some don't, become nothing. She starred in one of the ones that became nothing.
6647
We happy?

tests/ps/raw_text.txt

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
Narrow Text
2+
How exciting!

tests/test_pdf.py

-13
Original file line numberDiff line numberDiff line change
@@ -25,19 +25,6 @@ def test_tesseract_cli(self):
2525
method='tesseract',
2626
)
2727

28-
def test_large_pdf(self):
29-
"""Make sure extraction does not hang (issue #33)"""
30-
31-
# download the file
32-
filename = os.path.join(self.get_extension_directory(), "large.pdf")
33-
self.download_file(
34-
"https://openknowledge.worldbank.org/bitstream/handle/10986/16091/9780821399378.pdf",
35-
filename,
36-
)
37-
38-
# make sure textract can successfully run
39-
self.assertSuccessfulTextract(filename)
40-
4128
def test_two_column(self):
4229
"""Preserve two column layout in extraction"""
4330
filename = os.path.join(self.get_extension_directory(), 'two_column.pdf')

textract/__init__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
from .parsers import process
22

3-
VERSION = "1.6.3"
3+
VERSION = "1.6.5"

textract/parsers/__init__.py

+2
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,14 @@
1717
".htm": ".html",
1818
"": ".txt",
1919
".log": ".txt",
20+
".tab": ".tsv",
2021
}
2122

2223
# default encoding that is returned by the process method. specify it
2324
# here so the default is used on both the process function and also by
2425
# the command line interface
2526
DEFAULT_OUTPUT_ENCODING = 'utf_8'
27+
DEFAULT_ENCODING = 'utf_8'
2628

2729
# filename format
2830
_FILENAME_SUFFIX = '_parser'

textract/parsers/audio.py

-2
Original file line numberDiff line numberDiff line change
@@ -45,8 +45,6 @@ def extract(self, filename, method='', **kwargs):
4545
speech = ''
4646
except sr.UnknownValueError:
4747
speech = ''
48-
except sr.RequestError as e:
49-
speech = ''
5048

5149
# add a newline, to make output cleaner
5250
speech += '\n'

0 commit comments

Comments
 (0)