You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
will try to find duplicated documents in all indices known to the ES instance on localhost:9200, that look akin to 'esindexprefix-\*' while excluding all indices starting with 'excludedindex', where documents are grouped by `fingerprint` field.
18
+
19
+
*`-a` will process all indexes known to the ES instance that match the prefix and prefixseparator.
15
20
*`-b` batch size - critical for performance ES queries might take several minutes, depending on size of your indexes
16
21
*`-f` name of field that should be unique
17
22
*`-h` displays help
18
23
*`-m` number of duplicated documents with same unique field value
19
24
*`-t` document type in ES
20
-
*`--sleep 60` time between aggregation requests (gives ES time to run GC on heap)
25
+
*`--sleep 60` time between aggregation requests (gives ES time to run GC on heap), 15 seconds seems to be enough to avoid triggering ES flood protection though.
21
26
22
27
WARNING: Running huge bulk operations on ES cluster might influence performance of your cluster or even crash some nodes if heap
23
28
is not large enough. Increment `-b` and `-m` parameters with caution! ES returns at most `b * m` documents, eventually you might hit
@@ -28,7 +33,7 @@ A log file containing documents with unique fields is written into `/tmp/es_dedu
28
33
By design ES aggregate queries are not necessarily precise. Depending on your cluster setup, some documents won't be deleted due to
29
34
inaccurate shard statistics.
30
35
31
-
Running `$ python3 dedupe.py --check_log /tmp/es_dedupe.log --noop` will query for documents found by aggregate and queries check whether were actually
36
+
Running `$ python dedupe.py --check_log /tmp/es_dedupe.log --noop` will query for documents found by aggregate and queries check whether were actually
32
37
deleted.
33
38
```
34
39
== Starting ES deduplicator....
@@ -56,11 +61,21 @@ Deleted 276673 duplicates, in total 609802. Batch processed in 0:00:08.487847, r
56
61
```
57
62
58
63
## Requirements
64
+
For the installation use the tools provided by your operating system.
65
+
66
+
On Linux this can be one of the following: yum, dnf, apt, yast, emerge, ..
67
+
```
68
+
* Install python (2 or 3, both will work)
69
+
* Install python*ujson and python*requests for the fitting python version
70
+
```
71
+
72
+
On Windows you are pretty much on your own, but fear not, you can do the following ;-)
59
73
```
60
-
apt install python3-dev
61
-
pip3 install -r requirements.txt
74
+
* Download and install a python version from https://www.python.org/ .
75
+
* Open a console terminal and head to the repository copy of es-deduplicator, then run:
76
+
pip install -r requirements.txt
62
77
```
63
78
64
79
## History
65
80
66
-
Originaly written in bash which performed terribly due to slow JSON processing with pipes and `jq`. Python with `ujson` seems to be better fitted for this task.
81
+
Originally written in bash which performed terribly due to slow JSON processing with pipes and `jq`. Python with `ujson` seems to be better fitted for this task.
Copy file name to clipboardExpand all lines: dedupe.py
+10-4Lines changed: 10 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -33,6 +33,9 @@
33
33
34
34
# out current scriptname (minus the path)
35
35
ourname=os.path.basename(__file__)
36
+
# At least Elasticsearch 6.2.2 does not support application/x-ndjson, but wants to enforce setting an explicit Content-Type. As to why Elastic wouldn't support this, I have no idea.
0 commit comments