Skip to content

Commit 6c2502b

Browse files
authored
Melhorias de observabilidade, otimizações Docker e atualização de dependências (#121)
## Resumo - **Observabilidade**: implementa `JsonFormatter` para logs estruturados em JSON, elimina duplicação de entradas de log e consolida tracebacks em campo único (compatível com Grafana Loki) - **Docker**: otimiza imagens (torch CPU-only, separação de deps de dev, remoção de camadas desnecessárias, multiarch amd64+arm64 com cache remoto) - **Dependências**: atualiza Python para 3.14.4, todos os pacotes para versões estáveis mais recentes - **CI/CD**: adiciona `build-args` de metadados (BUILD_DATE, VCS_REF, VERSION, TIKA_VERSION), trigger `cicd/build` via comentário em PR, corrige registry hardcoded no workflow Tika - **Correções**: SyntaxWarning em regex (`al_associacao_municipios.py`), ruff atualizado para v0.15.10 ## Mudanças por área ### Logs e observabilidade (`monitoring/`, `data_extraction/`, `tasks/`) - `JsonFormatter`: cada entrada de log é uma linha JSON com todos os campos `extra={}` preservados - Tracebacks como campo `traceback` (string única) em vez de linhas fragmentadas - Remove chamadas `logging.error()` duplicadas que precediam `log_tika_error()` - Substitui `logging.exception(e)` por `exc_info=True` no `logging.error()` existente - Suprime logs INFO de `opensearchpy` e `urllib3` (muito verbosos) - `log_tika_request`/`log_tika_response` rebaixados para DEBUG ### Docker (`Dockerfile.base`, `Dockerfile`, `Dockerfile_apache_tika`, `Makefile`) - `torch` instalado com `--index-url .../whl/cpu` antes dos demais pacotes (~1.5 GB economizados) - `black`, `coverage`, `ruff` movidos para `requirements-dev.txt` - `curl` removido do builder stage; OCI labels apenas no runtime stage - Diretórios `tests/` dos pacotes removidos antes do COPY para o runtime - `TIKA_VERSION` parametrizado via `ARG` no Dockerfile e variável no Makefile - `REGISTRY` derivado automaticamente da URL do remote git - Novos targets multiarch no Makefile com cache remoto ### CI/CD (`.github/workflows/`) - `BUILD_DATE`, `VCS_REF`, `VERSION` passados como `build-args` em todos os workflows - `TIKA_VERSION` passado como `build-arg` com input configurável no `workflow_dispatch` - Trigger `issue_comment` com `cicd/build` para disparar builds via comentário em PRs - `build_container_image.yaml`: novo trigger em push no `main` para mudanças no `Dockerfile` - `build_apache_tika.yaml`: push restrito à `main`, registry dinâmico via `github.repository_owner` ## Checklist de testes - [ ] `cicd/build` — acionar build via comentário neste PR - [ ] Verificar que logs aparecem como JSON válido (campo `traceback` presente em erros) - [ ] Confirmar tamanho da imagem base com `docker image ls` após build - [ ] Testar extração de texto com Tika: `make retest-tika` - [ ] Verificar que `make help-build` exibe REGISTRY correto 🤖 Generated with [Claude Code](https://claude.com/claude-code)
2 parents b4774b5 + 247dbd4 commit 6c2502b

25 files changed

Lines changed: 791 additions & 290 deletions

.dockerignore

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
.git
2+
.github
3+
__pycache__
4+
.ruff_cache
5+
.claude
6+
tests
7+
debug
8+
docs
9+
models
10+
*.md
11+
*.pyc
12+
*.pyo
13+
.pytest_cache
14+
.coverage
15+
.env*
16+
.venv
17+
venv
18+
*.tar
19+
*.tar.gz
20+
.gitignore
21+
.gitattributes

.github/workflows/build_apache_tika.yaml

Lines changed: 65 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,20 @@
11
on:
22
push:
3+
branches:
4+
- main
35
paths:
46
- 'Dockerfile_apache_tika'
57
- '.github/workflows/build_apache_tika.yaml'
68
tags:
79
- "v*"
10+
issue_comment:
11+
types: [created]
812
workflow_dispatch:
13+
inputs:
14+
tika_version:
15+
description: 'Apache Tika version (e.g. 3.2.2)'
16+
required: false
17+
default: '3.2.2'
918

1019
name: Build Apache Tika container image
1120

@@ -20,6 +29,9 @@ concurrency:
2029
jobs:
2130
test-apache-tika:
2231
name: Test Apache Tika on multiple architectures
32+
if: |
33+
(github.event_name != 'issue_comment') ||
34+
(github.event.issue.pull_request != null && contains(github.event.comment.body, 'cicd/build'))
2335
strategy:
2436
fail-fast: false
2537
matrix:
@@ -34,26 +46,30 @@ jobs:
3446
timeout-minutes: 30
3547
steps:
3648
- name: Checkout code
37-
uses: actions/checkout@v4
49+
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4.3.1
50+
with:
51+
ref: ${{ github.event_name == 'issue_comment' && format('refs/pull/{0}/head', github.event.issue.number) || github.ref }}
3852

3953
- name: Free up disk space
4054
run: ./.github/scripts/free_disk_space.sh
4155

4256
- name: Set up Docker Buildx
43-
uses: docker/setup-buildx-action@v3
57+
uses: docker/setup-buildx-action@8d2750c68a42422c14e847fe6c8ac0403b4cbd6f # v3.12.0
4458
with:
4559
driver-opts: |
4660
image=moby/buildkit:v0.12.5
4761
4862
- name: Build Apache Tika test image for ${{ matrix.platform }}
49-
uses: docker/build-push-action@v5
63+
uses: docker/build-push-action@ca052bb54ab0790a636c9b5f226502c73d547a25 # v5.4.0
5064
with:
5165
context: .
5266
file: ./Dockerfile_apache_tika
5367
platforms: ${{ matrix.platform }}
5468
load: ${{ matrix.platform == 'linux/amd64' }}
5569
cache-from: type=gha,scope=tika-test-${{ matrix.arch }}
5670
cache-to: type=gha,mode=min,scope=tika-test-${{ matrix.arch }}
71+
build-args: |
72+
TIKA_VERSION=${{ inputs.tika_version || '3.2.2' }}
5773
tags: |
5874
test-apache-tika:${{ matrix.arch }}
5975
@@ -113,49 +129,66 @@ jobs:
113129
timeout-minutes: 30
114130
steps:
115131
- name: Checkout code
116-
uses: actions/checkout@v4
132+
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4.3.1
133+
with:
134+
ref: ${{ github.event_name == 'issue_comment' && format('refs/pull/{0}/head', github.event.issue.number) || github.ref }}
117135

118136
- name: Free up disk space
119137
run: ./.github/scripts/free_disk_space.sh
120138

121139
- name: Set up Docker Buildx
122-
uses: docker/setup-buildx-action@v3
140+
uses: docker/setup-buildx-action@8d2750c68a42422c14e847fe6c8ac0403b4cbd6f # v3.12.0
123141
with:
124142
driver-opts: |
125143
image=moby/buildkit:v0.12.5
126144
127145
- name: Login to GitHub Container Registry
128-
uses: docker/login-action@v3
146+
uses: docker/login-action@c94ce9fb468520275223c153574b00df6fe4bcc9 # v3.7.0
129147
with:
130148
registry: ghcr.io
131149
username: ${{ github.repository_owner }}
132150
password: ${{ secrets.GITHUB_TOKEN }}
133151

152+
- name: Generate build metadata
153+
id: meta
154+
run: |
155+
echo "build_date=$(date -u +"%Y-%m-%dT%H:%M:%SZ")" >> $GITHUB_OUTPUT
156+
echo "vcs_ref=$(git rev-parse --short HEAD)" >> $GITHUB_OUTPUT
157+
echo "version=$(git describe --tags --always 2>/dev/null || echo 'latest')" >> $GITHUB_OUTPUT
158+
134159
- name: Build and push Apache Tika development container image
135-
if: ${{ startsWith(github.ref, 'refs/heads/') }}
136-
uses: docker/build-push-action@v5
160+
if: ${{ startsWith(github.ref, 'refs/heads/') || github.event_name == 'issue_comment' }}
161+
uses: docker/build-push-action@ca052bb54ab0790a636c9b5f226502c73d547a25 # v5.4.0
137162
with:
138163
context: .
139164
file: ./Dockerfile_apache_tika
140165
platforms: ${{ matrix.platform }}
141166
push: true
142167
cache-from: type=gha,scope=tika-main-${{ matrix.arch }}
143168
cache-to: type=gha,mode=max,scope=tika-main-${{ matrix.arch }}
169+
build-args: |
170+
TIKA_VERSION=${{ inputs.tika_version || '3.2.2' }}
171+
BUILD_DATE=${{ steps.meta.outputs.build_date }}
172+
VCS_REF=${{ steps.meta.outputs.vcs_ref }}
144173
tags: |
145-
ghcr.io/okfn-brasil/querido-diario-apache-tika-server:latest-${{ matrix.arch }}
174+
ghcr.io/${{ github.repository_owner }}/querido-diario-apache-tika-server:latest-${{ matrix.arch }}
146175
147176
- name: Build and push Apache Tika tagged container image
148177
if: ${{ startsWith(github.ref, 'refs/tags/') }}
149-
uses: docker/build-push-action@v5
178+
uses: docker/build-push-action@ca052bb54ab0790a636c9b5f226502c73d547a25 # v5.4.0
150179
with:
151180
context: .
152181
file: ./Dockerfile_apache_tika
153182
platforms: ${{ matrix.platform }}
154183
push: true
155184
cache-from: type=gha,scope=tika-tag-${{ matrix.arch }}
156185
cache-to: type=gha,mode=max,scope=tika-tag-${{ matrix.arch }}
186+
build-args: |
187+
TIKA_VERSION=${{ inputs.tika_version || '3.2.2' }}
188+
BUILD_DATE=${{ steps.meta.outputs.build_date }}
189+
VCS_REF=${{ steps.meta.outputs.vcs_ref }}
157190
tags: |
158-
ghcr.io/okfn-brasil/querido-diario-apache-tika-server:${{ github.ref_name }}-${{ matrix.arch }}
191+
ghcr.io/${{ github.repository_owner }}/querido-diario-apache-tika-server:${{ github.ref_name }}-${{ matrix.arch }}
159192
160193
create-apache-tika-manifest:
161194
name: Create Apache Tika multi-arch manifest
@@ -164,55 +197,59 @@ jobs:
164197
timeout-minutes: 15
165198
steps:
166199
- name: Set up Docker Buildx
167-
uses: docker/setup-buildx-action@v3
200+
uses: docker/setup-buildx-action@8d2750c68a42422c14e847fe6c8ac0403b4cbd6f # v3.12.0
168201

169202
- name: Login to GitHub Container Registry
170-
uses: docker/login-action@v3
203+
uses: docker/login-action@c94ce9fb468520275223c153574b00df6fe4bcc9 # v3.7.0
171204
with:
172205
registry: ghcr.io
173206
username: ${{ github.repository_owner }}
174207
password: ${{ secrets.GITHUB_TOKEN }}
175208

176-
- name: Verify single-arch images availability (branch)
177-
if: ${{ startsWith(github.ref, 'refs/heads/') }}
209+
- name: Verify single-arch images availability (branch/PR)
210+
if: ${{ startsWith(github.ref, 'refs/heads/') || github.event_name == 'issue_comment' }}
178211
run: |
212+
IMAGE="ghcr.io/${{ github.repository_owner }}/querido-diario-apache-tika-server"
179213
for tag in latest-amd64 latest-arm64; do
180214
for i in {1..20}; do
181-
if docker buildx imagetools inspect ghcr.io/okfn-brasil/querido-diario-apache-tika-server:$tag > /dev/null 2>&1; then
182-
echo "Found ghcr.io/okfn-brasil/querido-diario-apache-tika-server:$tag";
215+
if docker buildx imagetools inspect $IMAGE:$tag > /dev/null 2>&1; then
216+
echo "Found $IMAGE:$tag";
183217
break;
184218
fi
185-
echo "Waiting for ghcr.io/okfn-brasil/querido-diario-apache-tika-server:$tag to be available ($i/20)...";
219+
echo "Waiting for $IMAGE:$tag to be available ($i/20)...";
186220
sleep 3;
187221
done
188222
done
189223
190224
- name: Create and push Apache Tika development manifest
191-
if: ${{ startsWith(github.ref, 'refs/heads/') }}
225+
if: ${{ startsWith(github.ref, 'refs/heads/') || github.event_name == 'issue_comment' }}
192226
run: |
227+
IMAGE="ghcr.io/${{ github.repository_owner }}/querido-diario-apache-tika-server"
193228
docker buildx imagetools create \
194-
-t ghcr.io/okfn-brasil/querido-diario-apache-tika-server:latest \
195-
ghcr.io/okfn-brasil/querido-diario-apache-tika-server:latest-amd64 \
196-
ghcr.io/okfn-brasil/querido-diario-apache-tika-server:latest-arm64
229+
-t $IMAGE:latest \
230+
$IMAGE:latest-amd64 \
231+
$IMAGE:latest-arm64
197232
198233
- name: Verify single-arch images availability (tag)
199234
if: ${{ startsWith(github.ref, 'refs/tags/') }}
200235
run: |
236+
IMAGE="ghcr.io/${{ github.repository_owner }}/querido-diario-apache-tika-server"
201237
for arch in amd64 arm64; do
202238
for i in {1..20}; do
203-
if docker buildx imagetools inspect ghcr.io/okfn-brasil/querido-diario-apache-tika-server:${{ github.ref_name }}-$arch > /dev/null 2>&1; then
204-
echo "Found ghcr.io/okfn-brasil/querido-diario-apache-tika-server:${{ github.ref_name }}-$arch";
239+
if docker buildx imagetools inspect $IMAGE:${{ github.ref_name }}-$arch > /dev/null 2>&1; then
240+
echo "Found $IMAGE:${{ github.ref_name }}-$arch";
205241
break;
206242
fi
207-
echo "Waiting for ghcr.io/okfn-brasil/querido-diario-apache-tika-server:${{ github.ref_name }}-$arch to be available ($i/20)...";
243+
echo "Waiting for $IMAGE:${{ github.ref_name }}-$arch to be available ($i/20)...";
208244
sleep 3;
209245
done
210246
done
211247
212248
- name: Create and push Apache Tika tagged manifest
213249
if: ${{ startsWith(github.ref, 'refs/tags/') }}
214250
run: |
251+
IMAGE="ghcr.io/${{ github.repository_owner }}/querido-diario-apache-tika-server"
215252
docker buildx imagetools create \
216-
-t ghcr.io/okfn-brasil/querido-diario-apache-tika-server:${{ github.ref_name }} \
217-
ghcr.io/okfn-brasil/querido-diario-apache-tika-server:${{ github.ref_name }}-amd64 \
218-
ghcr.io/okfn-brasil/querido-diario-apache-tika-server:${{ github.ref_name }}-arm64
253+
-t $IMAGE:${{ github.ref_name }} \
254+
$IMAGE:${{ github.ref_name }}-amd64 \
255+
$IMAGE:${{ github.ref_name }}-arm64

.github/workflows/build_base_image.yaml

Lines changed: 24 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ on:
66
- 'requirements.txt'
77
- 'Dockerfile.base'
88
- '.github/workflows/build_base_image.yaml'
9+
issue_comment:
10+
types: [created]
911
workflow_dispatch:
1012

1113
name: Build base container image
@@ -21,6 +23,9 @@ concurrency:
2123
jobs:
2224
build-base-image:
2325
name: Build base image with dependencies
26+
if: |
27+
(github.event_name != 'issue_comment') ||
28+
(github.event.issue.pull_request != null && contains(github.event.comment.body, 'cicd/build'))
2429
strategy:
2530
fail-fast: false
2631
matrix:
@@ -35,33 +40,46 @@ jobs:
3540
timeout-minutes: 90
3641
steps:
3742
- name: Checkout code
38-
uses: actions/checkout@v4
43+
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4.3.1
44+
with:
45+
ref: ${{ github.event_name == 'issue_comment' && format('refs/pull/{0}/head', github.event.issue.number) || github.ref }}
3946

4047
- name: Free up disk space
4148
run: ./.github/scripts/free_disk_space.sh
4249

4350
- name: Set up Docker Buildx
44-
uses: docker/setup-buildx-action@v3
51+
uses: docker/setup-buildx-action@8d2750c68a42422c14e847fe6c8ac0403b4cbd6f # v3.12.0
4552
with:
4653
driver-opts: |
4754
image=moby/buildkit:v0.12.5
4855
4956
- name: Login to GitHub Container Registry
50-
uses: docker/login-action@v3
57+
uses: docker/login-action@c94ce9fb468520275223c153574b00df6fe4bcc9 # v3.7.0
5158
with:
5259
registry: ghcr.io
5360
username: ${{ github.repository_owner }}
5461
password: ${{ secrets.GITHUB_TOKEN }}
5562

63+
- name: Generate build metadata
64+
id: meta
65+
run: |
66+
echo "build_date=$(date -u +"%Y-%m-%dT%H:%M:%SZ")" >> $GITHUB_OUTPUT
67+
echo "vcs_ref=$(git rev-parse --short HEAD)" >> $GITHUB_OUTPUT
68+
echo "version=$(git describe --tags --always 2>/dev/null || echo 'latest')" >> $GITHUB_OUTPUT
69+
5670
- name: Build and push base image
57-
uses: docker/build-push-action@v5
71+
uses: docker/build-push-action@ca052bb54ab0790a636c9b5f226502c73d547a25 # v5.4.0
5872
with:
5973
context: .
6074
file: ./Dockerfile.base
6175
platforms: ${{ matrix.platform }}
6276
push: true
6377
cache-from: type=gha,scope=base-${{ matrix.arch }}
6478
cache-to: type=gha,mode=max,scope=base-${{ matrix.arch }}
79+
build-args: |
80+
BUILD_DATE=${{ steps.meta.outputs.build_date }}
81+
VCS_REF=${{ steps.meta.outputs.vcs_ref }}
82+
VERSION=${{ steps.meta.outputs.version }}
6583
tags: |
6684
ghcr.io/${{ github.repository }}/base:latest-${{ matrix.arch }}
6785
@@ -72,10 +90,10 @@ jobs:
7290
timeout-minutes: 15
7391
steps:
7492
- name: Set up Docker Buildx
75-
uses: docker/setup-buildx-action@v3
93+
uses: docker/setup-buildx-action@8d2750c68a42422c14e847fe6c8ac0403b4cbd6f # v3.12.0
7694

7795
- name: Login to GitHub Container Registry
78-
uses: docker/login-action@v3
96+
uses: docker/login-action@c94ce9fb468520275223c153574b00df6fe4bcc9 # v3.7.0
7997
with:
8098
registry: ghcr.io
8199
username: ${{ github.repository_owner }}

0 commit comments

Comments
 (0)