Skip to content

Commit 922a309

Browse files
committed
Add Dockerfile
1 parent 3ad6b3b commit 922a309

4 files changed

Lines changed: 56 additions & 2 deletions

File tree

Dockerfile

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
FROM debian:9-slim as builder
2+
ENV LANG C.UTF-8
3+
RUN apt-get update && apt-get install --no-install-recommends -y python3-pip python3-setuptools python3-dev make gcc\
4+
&& apt-get clean && rm -rf /var/lib/apt/lists/*
5+
ADD requirements.txt /tmp/
6+
RUN pip3 install wheel && pip3 install -r /tmp/requirements.txt
7+
8+
FROM debian:9-slim
9+
ENV LANG C.UTF-8
10+
11+
RUN apt-get update \
12+
&& apt-get install --no-install-recommends -y python3\
13+
&& apt-get clean && rm -rf /var/lib/apt/lists/*
14+
15+
COPY --from=builder /usr/local/lib/python3.5/ /usr/local/lib/python3.5/
16+
#COPY --from=builder /usr/local/lib/python3.5/site-packages/ /usr/local/lib/python3.5/site-packages/
17+
18+
RUN mkdir /app
19+
ADD dedupe.py /app
20+
ADD entrypoint.sh /app
21+
WORKDIR /app
22+
ENTRYPOINT /app/entrypoint.sh
23+

Makefile

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,27 @@
1+
NAME ?=es-dedupe
2+
REGISTRY ?= deric
3+
14
all: clean test
25

6+
build:
7+
docker pull `head -n 1 Dockerfile | awk '{ print $$2 }'`
8+
docker build -t $(NAME) .
9+
10+
define RELEASE
11+
git tag "v$(1)"
12+
git push
13+
git push --tags
14+
docker tag $(NAME) $(REGISTRY)/$(NAME):v$(1)
15+
docker tag $(NAME) $(REGISTRY)/$(NAME):latest
16+
docker push $(REGISTRY)/$(NAME)
17+
endef
18+
19+
shell: build
20+
docker run --entrypoint /bin/bash -it $(NAME)
21+
22+
release: build
23+
$(call RELEASE,$(v))
24+
325
dev:
426
pip install -r requirements.txt -r requirements-dev.txt
527

README.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,18 @@
1-
# ES deduplicator
1+
# ES-dedupe
22

33
A tool for removing duplicated documents that are grouped by some unique field (e.g. `--field Uuid`). Removal process consists of two phases:
44

55
1. Aggregate query find documents that have same `field` value and at least 2 occurences. One copy of such document is left in ES all other are deleted via Bulk API (almost all, usually - there's always some catch). We wait for index update after each `DELETE` operatation. Processed documents are logged into `/tmp/es_dedupe.log`.
66
2. Unfortunately aggregate queries are not necessarily exact. Based on `/tmp/es_dedupe.log` logfile we query for each `field` value and DELETE document copies on other shards. Depending on number of nodes and shards in cluster there might be still document that aggregate query didn't return. In order to disable 2nd step use `--no-chck` flag.
77

8-
Usage:
8+
## Docker
9+
10+
Running from Docker:
11+
```
12+
docker run deric/es-dedupe -H localhost -P 9200 -i exact-index-name -f Uuid
13+
```
14+
15+
## Usage
916
```
1017
python -u dedupe.py -H localhost -P 9200 -i exact-index-name -f Uuid > es_dedupe.log
1118
```

entrypoint.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
#!/bin/bash
2+
python3 dedupe.py $@

0 commit comments

Comments
 (0)