Skip to content

Commit c008f84

Browse files
Merge branch 'release/1.4.0'
2 parents 6c9636c + 24d0596 commit c008f84

File tree

14 files changed

+330
-180
lines changed

14 files changed

+330
-180
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,4 @@ MANIFEST
66
build/
77
dist/
88
tika_app.egg-info/
9-
venv/
9+
venv*/

.travis.yml

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ python:
99
- "3.6"
1010

1111
env:
12-
- TIKA_VER="1.16"
12+
- TIKA_VER="1.18"
1313
TIKA_APP_JAR=/tmp/tika-app-${TIKA_VER}.jar
1414

1515
before_script:
@@ -37,11 +37,21 @@ script:
3737
- python -m tikapp -v
3838
- python -m tikapp -h
3939
- python -m tikapp -a -f tests/files/test.zip
40+
- python -m tikapp -a -k < tests/files/test.zip
41+
42+
deploy:
43+
provider: pypi
44+
user: fmantuano
45+
password:
46+
secure: "p7l2yeLecvW/j1zs1XvxUbqT8f1ATIuao8TmQ4vJMCKJkhsRComHHkoHs5gYSdhj5QsuTSgcrsxV0PXfdEtyCAndYaZGrySNxlNBIShX0WArDfJCTcvnMWgv+4KoPbGwH27oxmo+icqCiCML4y07aRS8IKK1L/YxodbDTnCGK2dWiYm7VZS1oMFBqZiMxvRde3nE82nqlP6U3lvTE8HFNLlUIUa4hPAOXOuL/tX3L3alpStiBqwQMcHLOgdJuU47MXugfNNa3/u/mYeq4FGY85qcYOi/nXLzD01yAl6saeQ8FXtKDIIgnDVotstyMP31t1MU2yC7fxAj3XgHMbyRC9mCRJbgveHIEbnXnka5xl2mhEKa5e+mea9d3w8XWbi9ftx661w8x1V25v6+RH40WDCszbZ4K3cneWNIC5lRLlMLZ7JUB+L/G72dsBOO4BvCfQeo04WyO7GD0klnWxTNHo28ryOQ0e1Z/v1ocnkF/3ZwbgFjp9/I+rPjwNlHb/tzgr9hyD8BshA1nUE4ZOi+EnZNcuBotysia9tJ9EncjWXv0inUj9VenNqYROrF+xaDnKQRjAQr51CTz4uLA5FqauwNNmtgWoKSZVBwjCdSBnWGLYx059bAzkdhgP4sfxvfzhNxMVDhAucBGdXPeecxrzzfHBVpHkuDJXWuBPT7vwQ="
47+
on:
48+
tags: true
49+
branch: master
4050

4151
after_success:
4252
coveralls
4353

4454
notifications:
4555
email: false
4656
slack:
47-
secure: gLkkwrBjb0jCuiqMCRM5hXPzYH+LNA5UpMcvjULKtvywVnFVren0UYVyp6h81eQhsvIZecghESTitwRwL7Ttl9/TEmiJ1fuD/pjUbRtUDQourm3zpPAlxppVaj61Hfnln6MyPW+1QIlfXpOJRl+k7RrLKTm0FxXP6TS9t9t+p97ZIVz/iOVGwYWNgeSIdxy6sdzISkMx3i3bn7tr/ILruc18wqbaBs7GzQpgjaFl0S+7PDv4vBmj/9dTYxku6G+nSuzz+Do+BXAMCdcSEn4O4HT+tYyxmkCgRqn7zM8KtAXQwNkuSdjTOZo3Pn917jZibrEj8SaqbfXW1Q2BSN30zTk6p0Y22DF26qbYcy1XX6VDL52oy9GCGth1vNGNkLh/rFQHhZfCXvaSz1jws9vrtbEtHPXaYqfA+p/Xi01N1ewLMaL8yWA9NnCNF9r178bjiKtb7TAWu7B2o2I1j3FqfvtXsQsCQvNGTUc0LFP6i3geS056J8jrtz21IXvUmAJLHY5Qx9j88/lwA2HnhJquY7pFUztjXTgl2JBsBeGgzyoERnhq75iWQATteqbBkW1t5jkivw8g5QNbwln10PQij0SvvV9Cr02W7yX2nXt77/YeHR7ddwxxTNK8xUcjJJGUdB3AAGq1R92G/rEs4WfTniS8wG0CksOqyZ2BF1Q0gdk=
57+
secure: "gLkkwrBjb0jCuiqMCRM5hXPzYH+LNA5UpMcvjULKtvywVnFVren0UYVyp6h81eQhsvIZecghESTitwRwL7Ttl9/TEmiJ1fuD/pjUbRtUDQourm3zpPAlxppVaj61Hfnln6MyPW+1QIlfXpOJRl+k7RrLKTm0FxXP6TS9t9t+p97ZIVz/iOVGwYWNgeSIdxy6sdzISkMx3i3bn7tr/ILruc18wqbaBs7GzQpgjaFl0S+7PDv4vBmj/9dTYxku6G+nSuzz+Do+BXAMCdcSEn4O4HT+tYyxmkCgRqn7zM8KtAXQwNkuSdjTOZo3Pn917jZibrEj8SaqbfXW1Q2BSN30zTk6p0Y22DF26qbYcy1XX6VDL52oy9GCGth1vNGNkLh/rFQHhZfCXvaSz1jws9vrtbEtHPXaYqfA+p/Xi01N1ewLMaL8yWA9NnCNF9r178bjiKtb7TAWu7B2o2I1j3FqfvtXsQsCQvNGTUc0LFP6i3geS056J8jrtz21IXvUmAJLHY5Qx9j88/lwA2HnhJquY7pFUztjXTgl2JBsBeGgzyoERnhq75iWQATteqbBkW1t5jkivw8g5QNbwln10PQij0SvvV9Cr02W7yX2nXt77/YeHR7ddwxxTNK8xUcjJJGUdB3AAGq1R92G/rEs4WfTniS8wG0CksOqyZ2BF1Q0gdk="

README.md

Lines changed: 26 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,12 @@
88
## Overview
99

1010
tika-app-python is a wrapper for [Apache Tika App](https://tika.apache.org/).
11+
With this library you can analyze:
12+
- file on disk
13+
- payload in base64
14+
- file object (like standard input)
15+
16+
To use file object function you should use Apache Tika version >= 1.17.
1117

1218
### Apache 2 Open Source License
1319
tika-app-python can be downloaded, used, and modified free of charge. It is available under the Apache 2 license.
@@ -48,7 +54,7 @@ Import `TikaApp` class:
4854
```
4955
from tikapp import TikaApp
5056
51-
tika_client = TikaApp(file_jar="/opt/tika/tika-app-1.15.jar")
57+
tika_client = TikaApp(file_jar="/opt/tika/tika-app-1.18.jar")
5258
```
5359

5460
For get **content type**:
@@ -75,7 +81,7 @@ For detect **only content**:
7581
tika_client.extract_only_content("your_file")
7682
```
7783

78-
If you want to use payload in base64, you can use the same methods with `payload` argument:
84+
You can analyze payload in base64 with the same methods, but passing `payload` argument:
7985

8086
```
8187
tika_client.detect_content_type(payload="base64_payload")
@@ -84,6 +90,14 @@ tika_client.extract_all_content(payload="base64_payload")
8490
tika_client.extract_only_content(payload="base64_payload")
8591
```
8692

93+
or you can analyze file object (like standard input) with the same methods, but passing `objectInput` argument:
94+
95+
```
96+
tika_client.detect_language(objectInput="objectInput")
97+
tika_client.extract_all_content(objectInput="objectInput")
98+
tika_client.extract_only_content(objectInput="objectInput")
99+
```
100+
87101
## Usage from command-line
88102

89103
If you installed tika-app-python with `pip` or `setup.py` you can use it with command-line.
@@ -97,8 +111,8 @@ The last one overwrite all the others.
97111
These are all swithes:
98112

99113
```
100-
usage: tikapp [-h] (-f FILE | -p PAYLOAD) [-j JAR] [-d] [-t] [-l] [-a]
101-
[-v]
114+
usage: tikapp [-h] (-f FILE | -p PAYLOAD | -k) [-j JAR] [-d] [-t] [-l]
115+
[-a] [-v]
102116
103117
Wrapper for Apache Tika App.
104118
@@ -107,6 +121,7 @@ optional arguments:
107121
-f FILE, --file FILE File to submit (default: None)
108122
-p PAYLOAD, --payload PAYLOAD
109123
Base64 payload to submit (default: None)
124+
-k, --stdin Enable parsing from stdin (default: False)
110125
-j JAR, --jar JAR Apache Tika app JAR (default: None)
111126
-d, --detect Detect document type (default: False)
112127
-t, --text Output plain text content (default: False)
@@ -116,12 +131,18 @@ optional arguments:
116131
-v, --version show program's version number and exit
117132
```
118133

119-
Example:
134+
Example from file on disk:
120135

121136
```shell
122137
$ tikapp -f example_file -a
123138
```
124139

140+
Example from standard input
141+
142+
```shell
143+
$ tikapp -a -k < example_file
144+
```
145+
125146
## Performance tests
126147

127148
These are the results of performance tests in [tests](https://github.com/fedelemantuano/tika-app-python/tree/develop/tests) folder:

README.rst

Lines changed: 70 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
1-
|PyPI version| |Build Status| |Coverage Status| |BCH compliance|
1+
`PyPI version <https://badge.fury.io/py/tika-app>`__ `Build
2+
Status <https://travis-ci.org/fedelemantuano/tika-app-python>`__
3+
`Coverage
4+
Status <https://coveralls.io/github/fedelemantuano/tika-app-python?branch=master>`__
5+
`BCH compliance <https://bettercodehub.com/>`__
26

37
tika-app-python
48
===============
@@ -7,7 +11,10 @@ Overview
711
--------
812

913
tika-app-python is a wrapper for `Apache Tika
10-
App <https://tika.apache.org/>`__.
14+
App <https://tika.apache.org/>`__. With this library you can analyze: -
15+
file on disk - payload in base64 - file object (like standard input)
16+
17+
To use file object function you should use Apache Tika version >= 1.17.
1118

1219
Apache 2 Open Source License
1320
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -31,21 +38,21 @@ Clone repository
3138

3239
::
3340

34-
git clone https://github.com/fedelemantuano/tika-app-python.git
41+
git clone https://github.com/fedelemantuano/tika-app-python.git
3542

3643
and install tika-app-python with ``setup.py``:
3744

3845
::
3946

40-
cd tika-app-python
47+
cd tika-app-python
4148

42-
python setup.py install
49+
python setup.py install
4350

4451
or use ``pip``:
4552

4653
::
4754

48-
pip install tika-app
55+
pip install tika-app
4956

5057
Usage in a project
5158
------------------
@@ -54,43 +61,52 @@ Import ``TikaApp`` class:
5461

5562
::
5663

57-
from tikapp import TikaApp
64+
from tikapp import TikaApp
5865

59-
tika_client = TikaApp(file_jar="/opt/tika/tika-app-1.15.jar")
66+
tika_client = TikaApp(file_jar="/opt/tika/tika-app-1.18.jar")
6067

6168
For get **content type**:
6269

6370
::
6471

65-
tika_client.detect_content_type("your_file")
72+
tika_client.detect_content_type("your_file")
6673

6774
For detect **language**:
6875

6976
::
7077

71-
tika_client.detect_language("your_file")
78+
tika_client.detect_language("your_file")
7279

7380
For detect **all metadata and content**:
7481

7582
::
7683

77-
tika_client.extract_all_content("your_file")
84+
tika_client.extract_all_content("your_file")
7885

7986
For detect **only content**:
8087

8188
::
8289

83-
tika_client.extract_only_content("your_file")
90+
tika_client.extract_only_content("your_file")
8491

85-
If you want to use payload in base64, you can use the same methods with
92+
You can analyze payload in base64 with the same methods, but passing
8693
``payload`` argument:
8794

8895
::
8996

90-
tika_client.detect_content_type(payload="base64_payload")
91-
tika_client.detect_language(payload="base64_payload")
92-
tika_client.extract_all_content(payload="base64_payload")
93-
tika_client.extract_only_content(payload="base64_payload")
97+
tika_client.detect_content_type(payload="base64_payload")
98+
tika_client.detect_language(payload="base64_payload")
99+
tika_client.extract_all_content(payload="base64_payload")
100+
tika_client.extract_only_content(payload="base64_payload")
101+
102+
or you can analyze file object (like standard input) with the same
103+
methods, but passing ``objectInput`` argument:
104+
105+
::
106+
107+
tika_client.detect_language(objectInput="objectInput")
108+
tika_client.extract_all_content(objectInput="objectInput")
109+
tika_client.extract_only_content(objectInput="objectInput")
94110

95111
Usage from command-line
96112
-----------------------
@@ -107,29 +123,36 @@ These are all swithes:
107123

108124
::
109125

110-
usage: tikapp [-h] (-f FILE | -p PAYLOAD) [-j JAR] [-d] [-t] [-l] [-a]
111-
[-v]
126+
usage: tikapp [-h] (-f FILE | -p PAYLOAD | -k) [-j JAR] [-d] [-t] [-l]
127+
[-a] [-v]
112128

113-
Wrapper for Apache Tika App.
129+
Wrapper for Apache Tika App.
130+
131+
optional arguments:
132+
-h, --help show this help message and exit
133+
-f FILE, --file FILE File to submit (default: None)
134+
-p PAYLOAD, --payload PAYLOAD
135+
Base64 payload to submit (default: None)
136+
-k, --stdin Enable parsing from stdin (default: False)
137+
-j JAR, --jar JAR Apache Tika app JAR (default: None)
138+
-d, --detect Detect document type (default: False)
139+
-t, --text Output plain text content (default: False)
140+
-l, --language Output only language (default: False)
141+
-a, --all Output metadata and content from all embedded files
142+
(default: False)
143+
-v, --version show program's version number and exit
144+
145+
Example from file on disk:
146+
147+
.. code:: shell
114148
115-
optional arguments:
116-
-h, --help show this help message and exit
117-
-f FILE, --file FILE File to submit (default: None)
118-
-p PAYLOAD, --payload PAYLOAD
119-
Base64 payload to submit (default: None)
120-
-j JAR, --jar JAR Apache Tika app JAR (default: None)
121-
-d, --detect Detect document type (default: False)
122-
-t, --text Output plain text content (default: False)
123-
-l, --language Output only language (default: False)
124-
-a, --all Output metadata and content from all embedded files
125-
(default: False)
126-
-v, --version show program's version number and exit
149+
$ tikapp -f example_file -a
127150
128-
Example:
151+
Example from standard input
129152

130153
.. code:: shell
131154
132-
$ tikapp -f example_file -a
155+
$ tikapp -a -k < example_file
133156
134157
Performance tests
135158
-----------------
@@ -140,25 +163,16 @@ folder:
140163

141164
::
142165

143-
(Python 2)
144-
tika_content_type() 0.704840 sec
145-
tika_detect_language() 1.592066 sec
146-
magic_content_type() 0.000215 sec
147-
tika_extract_all_content() 0.816366 sec
148-
tika_extract_only_content() 0.788667 sec
149-
150-
(Python 3)
151-
tika_content_type() 0.698357 sec
152-
tika_detect_language() 1.593452 sec
153-
magic_content_type() 0.000226 sec
154-
tika_extract_all_content() 0.785915 sec
155-
tika_extract_only_content() 0.766517 sec
156-
157-
.. |PyPI version| image:: https://badge.fury.io/py/tika-app.svg
158-
:target: https://badge.fury.io/py/tika-app
159-
.. |Build Status| image:: https://travis-ci.org/fedelemantuano/tika-app-python.svg?branch=master
160-
:target: https://travis-ci.org/fedelemantuano/tika-app-python
161-
.. |Coverage Status| image:: https://coveralls.io/repos/github/fedelemantuano/tika-app-python/badge.svg?branch=master
162-
:target: https://coveralls.io/github/fedelemantuano/tika-app-python?branch=master
163-
.. |BCH compliance| image:: https://bettercodehub.com/edge/badge/fedelemantuano/tika-app-python?branch=develop
164-
:target: https://bettercodehub.com/
166+
(Python 2)
167+
tika_content_type() 0.704840 sec
168+
tika_detect_language() 1.592066 sec
169+
magic_content_type() 0.000215 sec
170+
tika_extract_all_content() 0.816366 sec
171+
tika_extract_only_content() 0.788667 sec
172+
173+
(Python 3)
174+
tika_content_type() 0.698357 sec
175+
tika_detect_language() 1.593452 sec
176+
magic_content_type() 0.000226 sec
177+
tika_extract_all_content() 0.785915 sec
178+
tika_extract_only_content() 0.766517 sec

requirements.txt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
chainmap
21
mail-parser>=3
32
python-magic
43
simplejson

tests/context.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# -*- coding: utf-8 -*-
2+
3+
import sys
4+
import os
5+
sys.path.insert(0, os.path.abspath(
6+
os.path.join(os.path.dirname(__file__), '..')))
7+
8+
from tikapp import TikaApp
9+
from tikapp.exceptions import *

tests/files/pdf1.pdf

27.6 KB
Binary file not shown.

tests/performance.py

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -20,21 +20,18 @@
2020
from __future__ import unicode_literals
2121
import magic
2222
import os
23-
import sys
2423
import timeit
2524

26-
profiling_path = os.path.realpath(os.path.dirname(__file__))
27-
root = os.path.join(profiling_path, '..')
28-
sys.path.append(root)
29-
from tikapp import TikaApp
25+
from context import TikaApp
3026

27+
profiling_path = os.path.realpath(os.path.dirname(__file__))
3128
test_zip = os.path.join(profiling_path, "files", "lorem_ipsum.txt.zip")
3229
test_txt = os.path.join(profiling_path, "files", "lorem_ipsum.txt")
3330

3431
try:
3532
TIKA_APP_JAR = os.environ["TIKA_APP_JAR"]
3633
except KeyError:
37-
TIKA_APP_JAR = "/opt/tika/tika-app-1.15.jar"
34+
TIKA_APP_JAR = "/opt/tika/tika-app-1.18.jar"
3835

3936

4037
def tika_content_type():

0 commit comments

Comments
 (0)