Skip to content

Commit 680ade0

Browse files
committed
fix : Pinecone WUs 한도 초과로 인한 Index Incremental update 방식으로 변경 및 월별 전체 리빌딩 git action 추가
1 parent 3bdc0a6 commit 680ade0

File tree

5 files changed

+114
-16
lines changed

5 files changed

+114
-16
lines changed

.github/workflows/index-full.yml

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
name: Monthly Full Index Rebuild
2+
3+
on:
4+
schedule:
5+
- cron: "0 3 1 * *" # 매월 1일 오전 3시 (UTC)
6+
workflow_dispatch: {}
7+
8+
jobs:
9+
run:
10+
runs-on: ubuntu-latest
11+
12+
steps:
13+
- uses: actions/checkout@v4
14+
15+
- uses: actions/setup-python@v5
16+
with:
17+
python-version: "3.11"
18+
19+
- name: Install dependencies
20+
run: |
21+
python -m pip install --upgrade pip
22+
pip install --upgrade "setuptools<75" wheel
23+
pip install -r requirements-indexer.txt -v
24+
25+
- name: Run full rebuild
26+
env:
27+
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
28+
DB_HOST: ${{ secrets.DB_HOST }}
29+
DB_USER: ${{ secrets.DB_USER }}
30+
DB_PASSWORD: ${{ secrets.DB_PASSWORD }}
31+
DB_NAME: ${{ secrets.DB_NAME }}
32+
DB_PORT: ${{ secrets.DB_PORT }}
33+
PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
34+
PINECONE_INDEX: ${{ secrets.PINECONE_INDEX }}
35+
PINECONE_NAMESPACE: ${{ secrets.PINECONE_NAMESPACE }}
36+
# 전체 리빌드 설정
37+
INDEX_MODE: full
38+
PYTHONPATH: ${{ github.workspace }}/src
39+
run: |
40+
python -m uosai.indexer.index

.github/workflows/index.yml

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,20 @@
1-
name: Daily Index Rebuild
1+
name: Incremental Index Update
22

33
on:
44
workflow_run:
5-
workflows: ["Daily UOS Notices"]
5+
workflows: ["Daily UOS Notices"]
66
types:
77
- completed
8-
workflow_dispatch: {}
8+
workflow_dispatch:
9+
inputs:
10+
mode:
11+
description: 'Indexing mode (incremental or full)'
12+
required: false
13+
default: 'incremental'
14+
type: choice
15+
options:
16+
- incremental
17+
- full
918

1019
jobs:
1120
run:
@@ -29,7 +38,7 @@ jobs:
2938
pip install --upgrade "setuptools<75" wheel
3039
pip install -r requirements-indexer.txt -v
3140
32-
- name: Run indexer
41+
- name: Run indexer (incremental update)
3342
env:
3443
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
3544
DB_HOST: ${{ secrets.DB_HOST }}
@@ -40,5 +49,9 @@ jobs:
4049
PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
4150
PINECONE_INDEX: ${{ secrets.PINECONE_INDEX }}
4251
PINECONE_NAMESPACE: ${{ secrets.PINECONE_NAMESPACE }}
52+
# 증분 업데이트 설정
53+
INDEX_MODE: ${{ github.event.inputs.mode || 'incremental' }}
54+
INCREMENTAL_DAYS: 14
55+
PYTHONPATH: ${{ github.workspace }}/src
4356
run: |
44-
python scripts/run_indexer.py
57+
python -m uosai.indexer.index

src/uosai/common/utils.py

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -179,18 +179,31 @@ def get_vectorstore() -> PineconeVectorStore:
179179

180180

181181
def upsert_docs(docs: List[Document], rebuild: bool = False) -> int:
182+
"""Pinecone에 문서를 upsert합니다.
183+
184+
Args:
185+
docs: 업서트할 문서 리스트
186+
rebuild: True면 전체 삭제 후 재삽입, False면 증분 업데이트 (기본값)
187+
188+
Returns:
189+
업서트된 문서 개수
190+
191+
Note:
192+
- rebuild=False (증분 모드): 동일 ID 문서는 덮어쓰기, 신규는 삽입 (WUs 절약)
193+
- rebuild=True (전체 리빌드): 모든 문서 삭제 후 재삽입 (월 1회 권장)
194+
"""
182195
if not PINECONE_API_KEY:
183196
raise RuntimeError("PINECONE_API_KEY missing")
184197

185198
pc = Pinecone(api_key=PINECONE_API_KEY)
186199
ensure_pinecone_index(pc, PINECONE_INDEX, EMBED_DIM)
187200
vs = get_vectorstore()
188201

189-
# 첫 배치에서만 전체 삭제할 때 사용
202+
# rebuild=True일 때만 전체 삭제 (WUs 대량 소모)
190203
if rebuild:
191204
idx = pc.Index(PINECONE_INDEX)
192205
ns_repr = "__default__" if PINECONE_NS is None else PINECONE_NS
193-
print(f"[pinecone] delete_all namespace={ns_repr}")
206+
print(f"[pinecone] FULL REBUILD: delete_all namespace={ns_repr}")
194207
try:
195208
if PINECONE_NS is None:
196209
idx.delete(delete_all=True) # 기본 네임스페이스
@@ -199,8 +212,10 @@ def upsert_docs(docs: List[Document], rebuild: bool = False) -> int:
199212
except Exception as e:
200213
# 존재하지 않는 네임스페이스면 지울 게 없어서 404가 날 수 있음 — 경고만 출력
201214
print(f"[pinecone] delete_all warning: {e}")
215+
else:
216+
print(f"[pinecone] INCREMENTAL UPDATE: upserting {len(docs)} documents")
202217

203-
# 업서트
218+
# 업서트 (동일 ID면 덮어쓰기, 신규면 삽입)
204219
ids = []
205220
for i, d in enumerate(docs):
206221
m = d.metadata or {}

src/uosai/indexer/README.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +0,0 @@
1-
Indexing pipeline (split/embed/upsert)

src/uosai/indexer/index.py

Lines changed: 38 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,48 @@
11
# src/uosai/indexer/index.py
22
import os, sys, time, traceback
3-
from datetime import datetime
3+
from datetime import datetime, timedelta
44

55
# 공통 유틸
6-
from uosai.common.utils import fetch_all_rows, row_to_doc, split_docs, upsert_docs
6+
from uosai.common.utils import fetch_all_rows, fetch_rows_since, row_to_doc, split_docs, upsert_docs
77

88
BATCH_SIZE = int(os.getenv("BATCH_SIZE", "200"))
99
BATCH_SLEEP_SEC = float(os.getenv("BATCH_SLEEP_SEC", "0.8")) # 레이트리밋 대응
1010

11+
# 증분 업데이트 설정
12+
INDEX_MODE = os.getenv("INDEX_MODE", "incremental") # "incremental" 또는 "full"
13+
INCREMENTAL_DAYS = int(os.getenv("INCREMENTAL_DAYS", "7")) # 증분 업데이트 시 최근 N일
14+
1115
def log(msg: str) -> None:
1216
print(f"[indexer {datetime.now():%Y-%m-%d %H:%M:%S}] {msg}")
1317

1418
def main() -> int:
15-
log("Full rebuild start")
16-
rows = fetch_all_rows()
19+
"""인덱싱 메인 함수
20+
21+
환경 변수:
22+
INDEX_MODE: "incremental" (증분 업데이트, 기본값) 또는 "full" (전체 리빌드)
23+
INCREMENTAL_DAYS: 증분 업데이트 시 최근 N일 데이터 처리 (기본값: 7)
24+
25+
WUs 절약:
26+
- incremental: 최근 N일 데이터만 upsert (일일 약 30-300 WUs)
27+
- full: 전체 삭제 후 재삽입 (약 6,000 WUs, 월 1회 권장)
28+
"""
29+
mode = INDEX_MODE.lower()
30+
31+
if mode == "full":
32+
log("=== FULL REBUILD MODE ===")
33+
log("WARNING: This will consume significant WUs!")
34+
rows = fetch_all_rows()
35+
rebuild = True
36+
elif mode == "incremental":
37+
log(f"=== INCREMENTAL UPDATE MODE (last {INCREMENTAL_DAYS} days) ===")
38+
since_date = (datetime.now() - timedelta(days=INCREMENTAL_DAYS)).strftime("%Y-%m-%d")
39+
log(f"Fetching notices since: {since_date}")
40+
rows = fetch_rows_since(since_date)
41+
rebuild = False
42+
else:
43+
log(f"ERROR: Invalid INDEX_MODE={INDEX_MODE}. Use 'incremental' or 'full'")
44+
return 1
45+
1746
if not rows:
1847
log("No rows found")
1948
return 0
@@ -24,14 +53,16 @@ def main() -> int:
2453
total = 0
2554
for i in range(0, len(docs), BATCH_SIZE):
2655
batch = docs[i:i+BATCH_SIZE]
27-
# 첫 배치만 전체 삭제
28-
n = upsert_docs(batch, rebuild=(i == 0))
56+
# full 모드일 때만 첫 배치에서 전체 삭제
57+
should_rebuild = rebuild and (i == 0)
58+
n = upsert_docs(batch, rebuild=should_rebuild)
2959
total += n
3060
log(f"Upsert batch {i//BATCH_SIZE+1}: {n} chunks (cum {total})")
3161
if i + BATCH_SIZE < len(docs) and BATCH_SLEEP_SEC > 0:
3262
time.sleep(BATCH_SLEEP_SEC)
3363

34-
log(f"Full rebuild done: chunks={total}")
64+
mode_label = "Full rebuild" if rebuild else "Incremental update"
65+
log(f"{mode_label} done: chunks={total}")
3566
return total
3667

3768
if __name__ == "__main__":

0 commit comments

Comments
 (0)