Skip to content

Commit 8d6cad6

Browse files
author
lin.zhang
committed
[to #CZPDEV-24907] upgrade README.md
1 parent cb9dddc commit 8d6cad6

1,237 files changed

Lines changed: 145431 additions & 87 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Makefile

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
.PHONY: help clean test test-unit test-integration test-integration-bin build build-fat install upload dev lint format generate-skills tag
1+
.PHONY: help clean test test-unit test-integration test-integration-bin build build-fat install upload dev lint format generate-skills sync-lakehouse-doc tag
22

33
help:
44
@echo "cz-cli Makefile commands:"
@@ -10,6 +10,7 @@ help:
1010
@echo " make lint - Run code linting (ruff)"
1111
@echo " make format - Format code (ruff format)"
1212
@echo " make generate-skills - Regenerate bundled SKILL.md from Click command tree"
13+
@echo " make sync-lakehouse-doc - Sync lakehouse_doc repo to skills/lakehouse-doc/references"
1314
@echo " make build - Build distribution packages"
1415
@echo " make build-fat - Build standalone binaries (supports multi-version)"
1516
@echo " make install - Install package in editable mode"
@@ -67,6 +68,15 @@ generate-skills:
6768
python scripts/generate_skills.py
6869
@echo "✅ Skill docs generated"
6970

71+
LAKEHOUSE_DOC_SRC ?= $(HOME)/IdeaProjects/lakehouse_doc
72+
LAKEHOUSE_DOC_DST := cz_cli/skills/lakehouse-doc/references
73+
74+
sync-lakehouse-doc:
75+
@echo "📚 Syncing lakehouse-doc from $(LAKEHOUSE_DOC_SRC)..."
76+
@mkdir -p $(LAKEHOUSE_DOC_DST)
77+
rsync -a --delete --exclude='.git' --exclude='.idea' --exclude='.topwrite' --exclude='asset' --exclude='.DS_Store' --exclude='*.png' $(LAKEHOUSE_DOC_SRC)/ $(LAKEHOUSE_DOC_DST)/
78+
@echo "✅ Synced $$(find $(LAKEHOUSE_DOC_DST) -type f | wc -l | tr -d ' ') files"
79+
7080
build-pkg: clean generate-skills
7181
@echo "📦 Building distribution packages..."
7282
python -m build

cz_cli/SKILL.template.md

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,9 +48,27 @@ Consult **Command Risk Reference** below to determine whether a command qualifie
4848
The following companion skills can be installed alongside `cz-cli` via `cz-cli install-skills`:
4949

5050
- **lakehouse-python-sdk** — Lakehouse Python/Shell task engineering: develop, rewrite, optimize, troubleshoot. Covers connector, BulkLoad, IGS, Studio params, datasource, and CREATE TABLE DDL; MUST trigger on: develop/write/create/modify/rewrite/optimize Python or Shell task; BulkLoad batch upload; IGS realtime ingest; connector query/write; Python task error/diagnosis; CREATE TABLE / partition / bucket / index DDL.
51+
- **lakehouse-doc** — ClickZetta Lakehouse official documentation. Covers SQL syntax, data types, functions, DDL/DML commands, dynamic tables, materialized views, access control, VCluster, data lake, AI functions, etc. When the user writes or asks about SQL syntax or Lakehouse dialect specifics, **MUST** consult lakehouse-doc skill references first to ensure accuracy.
5152

5253
**If no row matches**: proceed with cz-cli commands directly.
5354

55+
### Rule 2.1 — MUST consult lakehouse-doc skill for SQL commands
56+
57+
When the user request involves any of the following, **MUST** consult **lakehouse-doc** skill references before answering or generating SQL:
58+
59+
- Writing, modifying, or optimizing SQL statements (DDL / DML / DQL)
60+
- Asking about ClickZetta Lakehouse SQL dialect syntax, keywords, or function usage
61+
- Using data types, type casting, or datetime formats
62+
- Creating or altering tables, views, materialized views, dynamic tables, external tables
63+
- Data import/export (COPY INTO, PUT/GET, BulkLoad, Pipe)
64+
- Access control (GRANT / REVOKE), roles, permissions
65+
- VCluster configuration and management
66+
- Index creation and usage (inverted index, BloomFilter, vector index)
67+
- AI functions, vector search, semantic views
68+
- Information Schema system view queries
69+
70+
**Rationale**: ClickZetta Lakehouse SQL dialect differs from standard SQL and other databases. Relying on general knowledge may produce incorrect syntax. Consulting lakehouse-doc significantly improves answer accuracy and confidence.
71+
5472
### Rule 3 — Development ends at save; execution requires separate authorization
5573

5674
When the user says "develop", "write", "create", or "modify" a task, the work is **complete once the script is saved successfully**.
@@ -80,7 +98,7 @@ The Python SDK(connector/igs/bulkload) is using from clickzetta import connect i
8098

8199
### Rule 6 — Flow nodes use Flow-specific tools exclusively
82100

83-
When the operation target is a Flow task or any of its child nodes (`task_type=500`, or user mentions "组合任务 / flow / 工作流"):
101+
When the operation target is a Flow task or any of its child nodes (`task_type=500`, or user mentions "composite task / flow / workflow"):
84102

85103
- **MUST** use Flow-specific commands: `task flow node-detail`, `task flow node-save`, `task flow node-save-config`, `task flow bind`, `task flow submit`, etc.
86104
- **MUST NOT** use `task save`, `task save-config`, `task detail`, or `task online` on Flow child nodes — these tools are for regular (non-Flow) tasks only and will produce incorrect results or errors.
@@ -90,7 +108,7 @@ When the operation target is a Flow task or any of its child nodes (`task_type=5
90108

91109
Responses from `task`, `runs`, and `executions` commands may include a `studio_url` field. When present, surface it in the end to the user so they can open the resource directly in Studio.
92110

93-
Display as a Markdown hyperlink: `[在 Studio 中查看](https://...)`. Show all studio_url values returned across all commands in the same response — do not deduplicate.
111+
Display as a Markdown hyperlink: `[View in Studio](https://...)`. Show all studio_url values returned across all commands in the same response — do not deduplicate.
94112

95113
### Rule 8 — Maximize execution efficiency: parallel and chained commands
96114

cz_cli/commands/sql.py

Lines changed: 6 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
from cz_cli.connection_ctx import connection_kwargs_from_ctx
1717
from cz_cli.logger import log_operation
1818
from cz_cli.masking import mask_rows
19+
from clickzetta.connector.v0.utils import split_sql as _split_sql_statements
1920

2021
_WRITE_RE = re.compile(
2122
r"^\s*(INSERT|UPDATE|DELETE|REPLACE|ALTER|CREATE|DROP|TRUNCATE|RENAME|FORK)\b",
@@ -32,67 +33,6 @@
3233
DEFAULT_TRUNCATE_LEN = 3000
3334

3435

35-
def _split_sql_statements(sql: str) -> list[str]:
36-
"""Split *sql* on semicolons outside '...', \"...\", and `...`."""
37-
text = sql.strip()
38-
if not text:
39-
return []
40-
parts: list[str] = []
41-
buf: list[str] = []
42-
n = len(text)
43-
i = 0
44-
in_single = in_double = in_backtick = False
45-
46-
def flush() -> None:
47-
s = "".join(buf).strip()
48-
buf.clear()
49-
if s:
50-
parts.append(s)
51-
52-
while i < n:
53-
c = text[i]
54-
if in_backtick:
55-
buf.append(c)
56-
if c == "`":
57-
in_backtick = False
58-
i += 1
59-
continue
60-
if in_single:
61-
buf.append(c)
62-
if c == "'":
63-
if i + 1 < n and text[i + 1] == "'":
64-
buf.append(text[i + 1])
65-
i += 2
66-
continue
67-
in_single = False
68-
i += 1
69-
continue
70-
if in_double:
71-
buf.append(c)
72-
if c == '"':
73-
in_double = False
74-
elif c == "\\" and i + 1 < n:
75-
buf.append(text[i + 1])
76-
i += 2
77-
continue
78-
i += 1
79-
continue
80-
if c == "'":
81-
in_single = True
82-
buf.append(c)
83-
elif c == '"':
84-
in_double = True
85-
buf.append(c)
86-
elif c == "`":
87-
in_backtick = True
88-
buf.append(c)
89-
elif c == ";":
90-
flush()
91-
else:
92-
buf.append(c)
93-
i += 1
94-
flush()
95-
return parts if parts else [text]
9636

9737

9838
def _extract_tables_from_sql(sql: str) -> list[str]:
@@ -463,14 +403,13 @@ def _execute(
463403
if has_user_limit:
464404
user_limit = _extract_limit_value(sql_text)
465405

466-
# Prepare hints
467-
hints = {}
468-
if timeout:
469-
hints["sdk.job.timeout"] = timeout
470-
471406
try:
472407
cursor = conn.cursor()
473408
try:
409+
# Set timeout via SET statement
410+
if timeout:
411+
cursor.execute(f"SET sdk.job.timeout={timeout}")
412+
474413
# Set user-provided SQL flags
475414
for key, value in flags.items():
476415
cursor.execute(f"SET {key}={value}")
@@ -495,11 +434,7 @@ def _execute(
495434
if variables:
496435
exec_stmt = exec_stmt % variables
497436

498-
# Execute with hints
499-
if hints:
500-
cursor.execute(exec_stmt, hints=hints)
501-
else:
502-
cursor.execute(exec_stmt)
437+
cursor.execute(exec_stmt)
503438

504439
if cursor.description is not None:
505440
last_description = cursor.description
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
---
2+
name: lakehouse-doc
3+
description: "云器 Lakehouse 官方文档知识库。编写 SQL、查询语法/函数/数据类型、DDL/DML、动态表、权限、计算组、数据湖、AI 函数等 Lakehouse 相关问题时,必须查阅本 skill 的 references/ 目录。"
4+
---
5+
6+
# lakehouse-doc
7+
8+
云器 Lakehouse 官方文档。根据用户问题在 `references/` 下按文件名定位对应文档。
9+
10+
## references/ 目录结构
11+
12+
```
13+
references/
14+
├── *.md # 778 篇主文档(按主题命名,见下方索引)
15+
├── eco_integration/ # 生态工具集成 (12 篇)
16+
│ ├── dbt.md, superset.md, datagrip-lakehouse.md, trino.md ...
17+
├── java_reference/ # Java SDK (5 篇)
18+
│ ├── java-sdk-summary.md, jdbc.md, realtime-upload.md, client.md ...
19+
├── python_reference/ # Python SDK (3 篇)
20+
│ ├── connector.md, sqlalchemy.md, python-sdk-summary.md
21+
├── opensource/ # 开源工具 (1 篇)
22+
│ └── travel.md
23+
└── sql_functions/ # SQL 函数参考 (339 篇)
24+
├── aggregate_functions/ # 聚合函数 (52): count.md, sum.md, avg.md ...
25+
├── window_functions/ # 窗口函数 (19): row_number.md, rank.md, lag.md ...
26+
├── table_functions/ # 表函数 (9): table_changes.md ...
27+
├── context_functions/ # 上下文函数 (8): current_user.md ...
28+
└── scalar_functions/ # 标量函数 (339)
29+
├── datetime_functions/ # 日期时间 (66)
30+
├── string_functions/ # 字符串 (70)
31+
├── math_functions/ # 数学 (55)
32+
├── nested_functions/ # 嵌套类型 (45)
33+
├── bitmap_functions/ # Bitmap (29)
34+
├── json_functions/ # JSON (14)
35+
├── conditional_functions/# 条件 (14)
36+
├── high_order_functions/ # 高阶 (12)
37+
├── vector_functions/ # 向量 (11)
38+
├── ip_functions/ # IP (6)
39+
├── search_functions/ # 搜索 (6)
40+
├── hash_functions/ # 哈希 (5)
41+
├── geo_functions/ # 地理 (3)
42+
├── bitwise_functions/ # 位运算 (2)
43+
└── partition/ # 分区 (1)
44+
```
45+
46+
## 文档索引(llms.txt)
47+
48+
# 云器 Lakehouse 文档(LLM 导航)
49+
50+
> 云器 Lakehouse 是全托管的湖仓一体架构平台,基于云原生设计理念从零打造。通过**存算分离****Serverless弹性架构****开放存储格式****AI优化工具**,为企业提供数据仓库、数据湖、实时处理与BI报表的统一平台。
51+
52+
## 快速入门
53+
54+
- [概览](https://www.yunqi.tech/documents/Overview): 介绍云器Lakehouse的存算分离架构、Serverless计算、开放数据格式及主要应用场景。
55+
- [产品概念](https://www.yunqi.tech/documents/Concepts): 介绍云器Lakehouse的存算分离架构、Serverless计算、开放数据格式及主要应用场景。
56+
- [入门指导](https://www.yunqi.tech/documents/Tutorials): 通过数据导入、SQL查询、数据可视化等步骤,快速完成从数据接入到分析展示的完整流程。
57+
58+
## 使用指南
59+
60+
- [Studio](https://www.yunqi.tech/documents/studio_manual): 通过Web界面进行数据开发与管理,支持连接数据源、SQL查询、作业编排、结果可视化和资产目录浏览。
61+
- [对象模型](https://www.yunqi.tech/documents/object_model_design): 介绍云器Lakehouse的对象模型核心概念,包括目录、数据库、表、视图、物化视图、函数和共享的定义与层级关系。
62+
- [数据采集](https://www.yunqi.tech/documents/Ingestion): 通过本地文件、数据库、Kafka等多种数据源导入数据,涵盖核心概念、配置步骤与操作示例。
63+
- [数据加工](https://www.yunqi.tech/documents/Transformation): 围绕"数据加工"说明核心概念、关键配置与典型操作步骤,并提供示例与注意事项。
64+
- [数据分析](https://www.yunqi.tech/documents/Analysis): 提供从数据导入、SQL查询到可视化分析的全流程操作指南,涵盖数据源连接、SQL语法、函数使用及结果导出。
65+
- [安全](https://www.yunqi.tech/documents/data_security): 提供用户管理、权限控制、审计日志等安全功能。
66+
- [数据分享](https://www.yunqi.tech/documents/data_share): 围绕"数据分享"说明核心概念、关键配置与典型操作步骤。
67+
- [私网连接](https://www.yunqi.tech/documents/connect_to_Lakehouse): 通过配置终端节点实现跨VPC或本地IDC与云上服务的私网安全访问。
68+
- [性能测试](https://www.yunqi.tech/documents/benchmark): 性能测试核心概念、关键配置与典型操作步骤。
69+
- [生态工具](https://www.yunqi.tech/documents/tools): 生态工具核心概念、关键配置与典型操作步骤。
70+
- [Insight](https://www.yunqi.tech/documents/Lakehouse_Insight): 通过连接云器Lakehouse数据源,创建数据集并拖拽生成BI报表与看板。
71+
72+
## SQL手册
73+
74+
- [SQL命令](https://www.yunqi.tech/documents/sql-reference): DDL、DML、DQL等SQL命令的完整语法参考。
75+
- [数据类型](https://www.yunqi.tech/documents/data-type): 精确数值、浮点数、字符串、日期时间、布尔值等数据类型定义。
76+
- [SQL函数](https://www.yunqi.tech/documents/functions): SQL函数核心概念与使用示例。
77+
- [SQL使用指南](https://www.yunqi.tech/documents/considerations-for-using-sql): SQL使用注意事项与最佳实践。
78+
79+
## 开发手册
80+
81+
- [Java SDK 参考](https://www.yunqi.tech/documents/java-sdk-refer): Java SDK 核心概念、关键配置与典型操作步骤。
82+
- [Python SDK 参考](https://www.yunqi.tech/documents/python-sdk-refer): Python SDK 核心概念、关键配置与典型操作步骤。
83+
84+
## 实践教程
85+
86+
- [高效管理对象和组织数据](https://www.yunqi.tech/documents/data_org): 数据对象创建管理,目录组织、权限、生命周期策略。
87+
- [数据导入导出实践](https://www.yunqi.tech/documents/practice_data_import_and_export): 多数据源导入导出操作步骤与示例。
88+
- [数据查询分析实践](https://www.yunqi.tech/documents/practice_data_analysis): 从数据导入到可视化分析的全流程操作指南。
89+
- [构建和运维ELT流程实践](https://www.yunqi.tech/documents/ELT_practice): 企业级ELT流水线构建,涵盖开发、测试、部署及故障恢复。
90+
- [优化计算资源](https://www.yunqi.tech/documents/OptimizingComputingResources): 计算组配置、弹性伸缩策略和资源监控。
91+
- [性能体验](https://www.yunqi.tech/documents/performence_test): 性能测试方法、优化建议与监控指标。
92+
- [构建 Modern Data Stack](https://www.yunqi.tech/documents/ModernDataStackWithEcosystemTools): 现代数据栈核心组件与架构模式。
93+
- [AI应用开发](https://www.yunqi.tech/documents/REMOTEFUNCTION): 从数据准备、模型训练到服务部署的AI应用开发流程。
94+
- [安全与合规审计](https://www.yunqi.tech/documents/security_compliance_audit_guide): 权限管理、SQL审计日志、数据脱敏策略及合规性配置。
95+
- [用量和费用管理](https://www.yunqi.tech/documents/cost_management): 用量明细、费用构成、计费模式与预算管理。
96+
97+
## Lakehouse AI
98+
99+
- [Lakehouse AI 概述](https://www.yunqi.tech/documents/LakehouseAI_overview): 非结构化数据管理、AI外部函数、多模态检索、Python开发框架及对话式分析。
100+
- [AI 的数据准备](https://www.yunqi.tech/documents/Server_data_for_AI): 向量检索、全文搜索与结构化数据分析的无缝结合。
101+
- [AI 函数](https://www.yunqi.tech/documents/AI_function_in_SQL): 创建和使用AI函数,支持Python/Java调用外部AI服务。
102+
- [Zettapark](https://www.yunqi.tech/documents/LakehousePythonZettapark): Python开发框架API参考。
103+
- [AI + BI 统一工作流](https://www.yunqi.tech/documents/unifiedWorkflow): 自然语言交互生成SQL查询与可视化。
104+
- [AI Gateway](https://www.yunqi.tech/documents/AIGateway): 统一接入、路由分发、负载均衡、限流熔断。
105+
- [DataGPT](https://www.yunqi.tech/documents/datagpt_intro): 自然语言提问直接生成SQL并获取可视化图表。
106+
- [Lakehouse MCP Server](https://www.yunqi.tech/documents/LakehouseMCPServer): 通过MCP将数据湖仓能力暴露给AI助手。
107+
- [AI 生态](https://www.yunqi.tech/documents/AI_eco): 与PyTorch、TensorFlow、MLflow、LangChain等集成。
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.topwrite/assets/** filter=lfs diff=lfs merge=lfs -text
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
pages:
2+
stage: deploy
3+
script:
4+
- mkdir .public
5+
- cp -r * .public
6+
- mv .public public
7+
artifacts:
8+
paths:
9+
- public
10+
only:
11+
- gh-pages

0 commit comments

Comments
 (0)