Skip to content

Commit d8e2b0a

Browse files
committed
1. 优化了keybert分词下过
2. 给标签处理和时间戳处理等耗时工作增加了标签 3. 流水线支持跳过基础镜像使用固定的基础镜像打包,进行加速 4. 隐藏build镜像,合并build镜像和run镜像,加速镜像的导出,减少镜像仓库的占用。
1 parent dcd02e4 commit d8e2b0a

6 files changed

Lines changed: 159 additions & 38 deletions

File tree

.github/workflows/build-and-deploy.yml

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,10 @@ jobs:
3131
env:
3232
REGISTRY: ${{ secrets.REGISTRY || 'ghcr.io' }}
3333
IMAGE_NAME: ${{ secrets.IMAGE_NAME || github.repository }}
34+
# Config sources (priority): workflow_dispatch inputs -> repo variables.
35+
# For push events, only repo variables (vars.*) apply.
36+
SKIP_BASE_IMAGE: ${{ vars.SKIP_BASE_IMAGE || 'false' }}
37+
FIXED_BASE_IMAGE_REF: ${{ vars.BASE_IMAGE_REF || '' }}
3438

3539

3640
steps:
@@ -44,9 +48,20 @@ jobs:
4448
set -euo pipefail
4549
registry="${REGISTRY}"
4650
image_name="${IMAGE_NAME}"
51+
skip_base_image_raw="${SKIP_BASE_IMAGE}"
52+
fixed_base_image_ref="${FIXED_BASE_IMAGE_REF}"
4753
# GHCR requires lowercase image names
4854
image_name_lc="${image_name,,}"
4955
56+
# Normalize boolean-ish values from env.
57+
shopt -s nocasematch
58+
if [[ "${skip_base_image_raw}" == "true" || "${skip_base_image_raw}" == "1" || "${skip_base_image_raw}" == "yes" ]]; then
59+
skip_base_image="true"
60+
else
61+
skip_base_image="false"
62+
fi
63+
shopt -u nocasematch
64+
5065
# Version source: `vsersion` (repo file). Fallback keeps workflow usable.
5166
if [[ -f vsersion ]]; then
5267
version="$(tr -d '\r\n' < vsersion)"
@@ -58,12 +73,24 @@ jobs:
5873
echo "image=${registry}/${image_name_lc}" >> "$GITHUB_OUTPUT"
5974
echo "version=${version}" >> "$GITHUB_OUTPUT"
6075
echo "iteration=${iteration}" >> "$GITHUB_OUTPUT"
76+
echo "skip_base_image=${skip_base_image}" >> "$GITHUB_OUTPUT"
6177
6278
echo "base_tag=base-${version}-${iteration}" >> "$GITHUB_OUTPUT"
6379
echo "build_tag=build-${version}-${iteration}" >> "$GITHUB_OUTPUT"
6480
echo "run_tag=run-${version}-${iteration}" >> "$GITHUB_OUTPUT"
6581
66-
echo "base_image_ref=${registry}/${image_name_lc}:base-${version}-${iteration}" >> "$GITHUB_OUTPUT"
82+
computed_base_image_ref="${registry}/${image_name_lc}:base-${version}-${iteration}"
83+
if [[ "${skip_base_image}" == "true" ]]; then
84+
if [[ -z "${fixed_base_image_ref}" ]]; then
85+
echo "ERROR: SKIP_BASE_IMAGE=true but FIXED_BASE_IMAGE_REF/base_image_ref is empty" >&2
86+
exit 2
87+
fi
88+
base_image_ref="${fixed_base_image_ref}"
89+
else
90+
base_image_ref="${computed_base_image_ref}"
91+
fi
92+
93+
echo "base_image_ref=${base_image_ref}" >> "$GITHUB_OUTPUT"
6794
echo "build_image_ref=${registry}/${image_name_lc}:build-${version}-${iteration}" >> "$GITHUB_OUTPUT"
6895
echo "run_image_ref=${registry}/${image_name_lc}:run-${version}-${iteration}" >> "$GITHUB_OUTPUT"
6996
@@ -81,6 +108,7 @@ jobs:
81108
password: ${{ secrets.REGISTRY_PASSWORD }}
82109

83110
- name: Build & push base image
111+
if: ${{ steps.vars.outputs.skip_base_image != 'true' }}
84112
uses: docker/build-push-action@v6
85113
with:
86114
context: .

.vscode/launch.json

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,28 @@
66
"configurations": [
77

88
{
9-
"name": "Python数据处理",
9+
"name": "Python数据处理 (.venv)",
1010
"type": "debugpy",
1111
"request": "launch",
1212
"cwd": "${workspaceFolder}",
1313
"program": "${file}",
1414
"console": "integratedTerminal",
15-
"args": "--verbose --git-batch=true --tag-count=0 --tag-method=textrank"
15+
"python": "${workspaceFolder}/.venv/bin/python",
16+
"env": {
17+
"HF_ENDPOINT": "https://hf-mirror.com"
18+
},
19+
"args": "--verbose --git-batch=true --tag-count=3 --tag-method=keybert"
20+
},
21+
{
22+
"name": "Python数据处理 (uv run + attach)",
23+
"type": "debugpy",
24+
"request": "attach",
25+
"connect": {
26+
"host": "127.0.0.1",
27+
"port": 5678
28+
},
29+
"preLaunchTask": "uv: run hexo-proc with debugpy",
30+
"justMyCode": true
1631
}
1732
]
1833
}

.vscode/tasks.json

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
{
2+
"version": "2.0.0",
3+
"tasks": [
4+
{
5+
"label": "uv: run hexo-proc with debugpy",
6+
"type": "shell",
7+
"command": "HF_ENDPOINT=https://hf-mirror.com uv run python -m debugpy --listen 5678 --wait-for-client -m hexo_proc --verbose --git-batch=true --tag-count=3 --tag-method=keybert",
8+
"isBackground": true,
9+
"problemMatcher": []
10+
}
11+
]
12+
}

TODO.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -31,13 +31,12 @@
3131
- [ ] 执行效果优化
3232
- [x] 批量处理git的时间戳。极大优化处理速度,避免多次触发python的执行bash命令。优化git的时间戳提取功能。现在的方案效率太低了,多次系统命令行调用和读取,严重拖慢了速度。
3333
- [x] 支持多种标签处理策略。引入tfidf、textrank、keybert小模型等。textrank速度和效果平衡性最好.标签生成方案优化。
34-
- [ ] keybert分词效果不好。
35-
- [ ] 标签处理过程和时间戳处理过程,添加简单的进度条。
34+
- [x] keybert分词效果不好。
35+
- [x] 标签处理过程和时间戳处理过程,添加简单的进度条。
3636
- [ ] 增量数据处理。实现一下增量编译的功能。每次提交一个commit和当前已经生成过的frontmatter,只重新生成其中的一部分。每次处理的时候都检查历史commit到这次commit之间变更的文件,再进行处理。
37-
- [ ] 标签处理过程会导致流水线内存超过显示出现错误,比较离谱
37+
- [x] 流水线支持跳过基础镜像的打包加速构建过程。
3838
- [x] git batch 命令仍旧会直接占用ssh缓冲区,导致断联然后强制退出。原来是AI写出了死循环!
39-
- [ ] 流水线多少还存在一些问题,不知道为什么打包不出来最终的镜像了,看到好像在输出最终镜像的一直卡主。优先解决这个问题
40-
- [x] 考虑隐藏build的镜像,还是只打包运行镜像就可,将构建镜像和运行镜像放到一个Dockerfile中。分两个阶段打包。保留现在这种打包模式不变
39+
- [x] 考虑隐藏build的镜像,x是只打包运行镜像就可,将构建镜像和运行镜像放到一个Dockerfile中。分两个阶段打包。保留现在这种打包模式不变
4140
- [x] 优化日志输出。中间过程有的地方输出了太多日志,有的地方没有输出日志。
4241
- [x] 在ghignore中忽略一下那些自己已经不再维护的目录。
4342
- [ ] 备案

docker.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,4 @@ docker run -d --name notes -p 80:80 ghcr.io/estom/hexo-blog:run-v1.0.0
1818

1919
# 更新子模块,会触发子模块的下载
2020
# git submodule update --recursive
21+
uv run hexo-proc --verbose --git-batch=true --tag-method=keybert --tag-count=3 --tag-budget=100

processor/process_posts.py

Lines changed: 96 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,67 @@ def build_index(
4141
) -> Dict[str, List[str]]: ...
4242

4343

44+
@dataclass
45+
class ProgressPrinter:
46+
"""Print progress in both interactive (TTY) and non-interactive environments.
47+
48+
- TTY: renders a single-line progress bar updated in-place.
49+
- Non-TTY (CI/server): prints on every 1% progress change.
50+
"""
51+
52+
enabled: bool
53+
prefix: str
54+
width: int = 24
55+
tty_stream = sys.stderr
56+
_last_pct: int = -1
57+
_is_tty: bool = False
58+
59+
def __post_init__(self) -> None:
60+
if not self.enabled:
61+
self._is_tty = False
62+
return
63+
self._is_tty = bool(getattr(self.tty_stream, "isatty", lambda: False)())
64+
65+
def _render_bar(self, done: int, total: int) -> str:
66+
if total <= 0:
67+
return "[" + ("-" * self.width) + "] 0/0"
68+
frac = max(0.0, min(1.0, done / float(total)))
69+
filled = int(round(frac * self.width))
70+
bar = "#" * filled + "-" * (self.width - filled)
71+
pct = int(frac * 100)
72+
return f"[{bar}] {done}/{total} {pct:3d}%"
73+
74+
def update(self, done: int, total: int) -> None:
75+
if not self.enabled:
76+
return
77+
if total < 0:
78+
total = 0
79+
if done < 0:
80+
done = 0
81+
if total > 0 and done > total:
82+
done = total
83+
84+
if self._is_tty:
85+
msg = f"\r{self.prefix} {self._render_bar(done, total)}"
86+
self.tty_stream.write(msg)
87+
self.tty_stream.flush()
88+
return
89+
90+
pct = int((done / max(1, total)) * 100)
91+
# Non-interactive: print every 1% change.
92+
if pct != self._last_pct:
93+
print(f"{self.prefix}: {pct}% ({done}/{total})")
94+
self._last_pct = pct
95+
96+
def close(self, total: int) -> None:
97+
if not self.enabled:
98+
return
99+
self.update(total, total)
100+
if self._is_tty:
101+
self.tty_stream.write("\n")
102+
self.tty_stream.flush()
103+
104+
44105
def _clean_text_for_tags(text: str) -> str:
45106
# Remove common non-content segments to reduce noisy tokens.
46107
clean = text
@@ -117,15 +178,18 @@ def _build_tfidf_tag_index(
117178
) from exc
118179

119180
# Build corpus: jieba tokens joined by spaces.
181+
progress_corpus = ProgressPrinter(enabled=verbose, prefix='[tags tfidf] corpus')
120182
rel_paths: List[str] = []
121183
corpus: List[str] = []
122-
for p in files:
184+
for i, p in enumerate(files, start=1):
123185
rel = p.relative_to(target_root).as_posix()
124186
rel_paths.append(rel)
125187
raw = _read_text_best_effort(p)
126188
body = _strip_front_matter_if_any(raw)
127189
tokens = _jieba_tokenize(body, dedupe=True)
128190
corpus.append(' '.join(tokens))
191+
progress_corpus.update(i, len(files))
192+
progress_corpus.close(len(files))
129193

130194
vectorizer = TfidfVectorizer(
131195
# We already tokenized; keep tokens as-is.
@@ -139,11 +203,13 @@ def _build_tfidf_tag_index(
139203

140204
used_global: Set[str] = set()
141205
tags_by_path: Dict[str, List[str]] = {}
206+
progress_pick = ProgressPrinter(enabled=verbose, prefix='[tags tfidf] pick')
142207

143208
for i, rel in enumerate(rel_paths):
144209
row = X.getrow(i)
145210
if row.nnz == 0:
146211
tags_by_path[rel] = []
212+
progress_pick.update(i + 1, len(rel_paths))
147213
continue
148214
# Sort terms by TF-IDF score descending.
149215
pairs = sorted(zip(row.indices, row.data), key=lambda x: x[1], reverse=True)
@@ -164,6 +230,8 @@ def _build_tfidf_tag_index(
164230
if len(picked) >= tag_count:
165231
break
166232
tags_by_path[rel] = picked
233+
progress_pick.update(i + 1, len(rel_paths))
234+
progress_pick.close(len(rel_paths))
167235

168236
if verbose:
169237
print(f"[tags] unique={len(used_global)}/{max_unique_tags}")
@@ -190,11 +258,12 @@ def _build_textrank_tag_index(
190258
"或:pip install textrank4zh"
191259
) from exc
192260
print(f"[tags] start building TextRank index for {len(files)} files...")
261+
progress = ProgressPrinter(enabled=verbose, prefix='[tags textrank]')
193262

194263
used_global: Set[str] = set()
195264
tags_by_path: Dict[str, List[str]] = {}
196265

197-
for p in files:
266+
for i, p in enumerate(files, start=1):
198267
rel = p.relative_to(target_root).as_posix()
199268
raw = _read_text_best_effort(p)
200269
body = _strip_front_matter_if_any(raw)
@@ -239,6 +308,8 @@ def _build_textrank_tag_index(
239308
if len(picked) >= tag_count:
240309
break
241310
tags_by_path[rel] = picked
311+
progress.update(i, len(files))
312+
progress.close(len(files))
242313

243314
if verbose:
244315
print(f"[tags] end building TextRank index, unique={len(used_global)}/{max_unique_tags}")
@@ -280,6 +351,7 @@ def _build_keybert_tag_index(
280351
) from exc
281352

282353
print(f"[tags] start building KeyBERT index for {len(files)} files...")
354+
progress = ProgressPrinter(enabled=verbose, prefix='[tags keybert]')
283355

284356
# A multilingual model works for Chinese and English mixed content.
285357
# Keep it as an internal default to avoid expanding CLI surface.
@@ -297,11 +369,16 @@ def _build_keybert_tag_index(
297369
used_global: Set[str] = set()
298370
tags_by_path: Dict[str, List[str]] = {}
299371

300-
for p in files:
372+
for i, p in enumerate(files, start=1):
301373
rel = p.relative_to(target_root).as_posix()
302374
raw = _read_text_best_effort(p)
303375
body = _strip_front_matter_if_any(raw)
304376
clean = _clean_text_for_tags(body)
377+
tokens = _jieba_tokenize(clean, dedupe=False)
378+
if not tokens:
379+
tags_by_path[rel] = []
380+
continue
381+
doc = ' '.join(tokens)
305382

306383
# KeyBERT returns list[(keyword, score)]. Use keyphrase_ngram_range=(1,1)
307384
# to align with the existing per-tag token behavior.
@@ -351,6 +428,8 @@ def _build_keybert_tag_index(
351428
break
352429

353430
tags_by_path[rel] = picked
431+
progress.update(i, len(files))
432+
progress.close(len(files))
354433

355434
if verbose:
356435
print(f"[tags] unique={len(used_global)}/{max_unique_tags}")
@@ -782,6 +861,8 @@ def _build_git_time_index(repo_dir: Path, interest_paths: Set[str], date_kind: s
782861
created: Dict[str, int] = {}
783862
done: Set[str] = set()
784863
current_ts: Optional[int] = None
864+
progress = ProgressPrinter(enabled=verbose, prefix='[git] batch')
865+
last_filled = -1
785866

786867
try:
787868
with tempfile.NamedTemporaryFile(prefix='git-log-', suffix='.txt', delete=False, mode='w', encoding='utf-8') as fp:
@@ -856,6 +937,14 @@ def _build_git_time_index(repo_dir: Path, interest_paths: Set[str], date_kind: s
856937
done.add(final)
857938
if len(done) == len(interest_paths):
858939
break
940+
941+
filled = 0
942+
for k in interest_paths:
943+
if k in updated and k in created:
944+
filled += 1
945+
if filled != last_filled:
946+
progress.update(filled, len(interest_paths))
947+
last_filled = filled
859948
finally:
860949
if out_path:
861950
try:
@@ -869,6 +958,7 @@ def _build_git_time_index(repo_dir: Path, interest_paths: Set[str], date_kind: s
869958
u = updated.get(k)
870959
if c is not None and u is not None:
871960
result[k] = (c, u)
961+
progress.close(len(interest_paths))
872962
return result
873963

874964

@@ -1032,40 +1122,16 @@ def process_directory(cfg: ProcessConfig) -> int:
10321122
if cfg.verbose:
10331123
print(f"[git] batch index: filled={len(idx)}/{len(missing)}")
10341124

1035-
def render_bar(done: int, total_count: int, width: int = 24) -> str:
1036-
if total_count <= 0:
1037-
return "[" + ("-" * width) + "] 0/0"
1038-
frac = max(0.0, min(1.0, done / float(total_count)))
1039-
filled = int(round(frac * width))
1040-
bar = "#" * filled + "-" * (width - filled)
1041-
pct = int(frac * 100)
1042-
return f"[{bar}] {done}/{total_count} {pct:3d}%"
1043-
1044-
show_progress = (not cfg.verbose)
1045-
is_tty = bool(getattr(sys.stderr, "isatty", lambda: False)())
1046-
last_reported_pct = -1
1125+
progress = ProgressPrinter(enabled=cfg.verbose, prefix='[process]')
10471126

10481127
count = 0
10491128
for i, p in enumerate(files, start=1):
10501129
if process_file(cfg, p):
10511130
count += 1
10521131
if cfg.verbose:
10531132
print(f"[ok] {p.relative_to(cfg.target_dir).as_posix()}")
1054-
1055-
if show_progress:
1056-
if is_tty:
1057-
msg = f"\r{render_bar(i, total)}"
1058-
sys.stderr.write(msg)
1059-
sys.stderr.flush()
1060-
else:
1061-
# Non-interactive output: print on percentage change (1% steps) and at the end.
1062-
pct = int((i / max(1, total)) * 100)
1063-
if pct != last_reported_pct and (pct % 5 == 0 or i == total):
1064-
print(f"progress: {pct}% ({i}/{total})")
1065-
last_reported_pct = pct
1066-
1067-
if show_progress and is_tty:
1068-
sys.stderr.write("\n")
1133+
progress.update(i, total)
1134+
progress.close(total)
10691135

10701136
print(f"processed: {count} files, target={cfg.target_dir}")
10711137
return 0

0 commit comments

Comments
 (0)