feat: Add Structure Note with VectifyAI PageIndex#329
feat: Add Structure Note with VectifyAI PageIndex#329txxxxz wants to merge 5 commits intoHKUDS:mainfrom
Conversation
…note # Conflicts: # web/components/sidebar/SidebarShell.tsx # web/lib/latex.ts
There was a problem hiding this comment.
Pull request overview
Adds the end-to-end Structure Note workspace to DeepTutor, switching the indexing backend to VectifyAI/PageIndex and wiring up the full pipeline (upload/KB source → page indexing → section planning → markdown generation → image filling → PDF rendering) across backend APIs, frontend UI, tests, and docs.
Changes:
- Introduces backend Structure Note subsystem (router, manager/service modules, storage/pathing, PDF renderer) powered by VectifyAI/PageIndex.
- Adds frontend Structure Note workspace integration (sidebar entry, API client, i18n strings, perf route budget, e2e audit).
- Adds focused test coverage for the new service + API contract, and updates runtime path/config guards.
Reviewed changes
Copilot reviewed 55 out of 56 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| web/tests/e2e/structure-note.audit.ts | Adds Playwright accessibility/UX smoke coverage for the Structure Note page. |
| web/scripts/route_budgets.mjs | Adds /structure-note route budget + formatting updates. |
| web/package-lock.json | Lockfile metadata update (marks fsevents as dev). |
| web/locales/zh/app.json | Adds Structure Note UI strings; modifies/removes some existing translations. |
| web/locales/en/app.json | Adds Structure Note UI strings; modifies/removes some existing translations. |
| web/lib/structure-note-api.ts | New typed client API wrapper for Structure Note routes + client-side caching. |
| web/lib/latex.ts | Adjusts markdown/LaTeX normalization behavior and code style. |
| web/components/ui/Button.tsx | Minor UI behavior/style tweak (transition + formatting). |
| web/components/sidebar/SidebarShell.tsx | Adds Structure Note nav entry and localizes brand text. |
| tests/services/test_structure_note_service.py | Adds unit/integration tests for Structure Note pipeline pieces (planning/chunking/markdown/images). |
| tests/services/test_runtime_storage_guard.py | Adds runtime path confinement assertion for Structure Note workspace. |
| tests/services/test_path_service.py | Extends output allowlist tests for Structure Note final artifacts/images. |
| tests/api/test_structure_note_router.py | Adds router contract tests for jobs/projects/KB sources/retry flows. |
| requirements/server.txt | Adds WeasyPrint dependency for server-side PDF export. |
| requirements/cli.txt | Adds Markdown/LiteLLM/PyPDF2 + pins pydantic <2.12 + bumps PyMuPDF. |
| pyproject.toml | Mirrors dependency additions and server extra updates. |
| docs/zh/index.md | Adds Chinese docs home page. |
| docs/zh/guide/troubleshooting.md | Adds Chinese troubleshooting page. |
| docs/zh/guide/pre-config.md | Adds Chinese pre-config guide. |
| docs/zh/guide/local-start.md | Adds Chinese local install guide. |
| docs/zh/guide/local-conda-cursor.md | Adds Chinese conda+Cursor setup guide. |
| docs/zh/guide/data-preparation.md | Adds Chinese data preparation guide. |
| docs/zh/features/overview.md | Adds Chinese features overview. |
| docs/testdoc/structure-note-technical-plan.md | Adds internal technical plan doc for Structure Note. |
| docs/testdoc/structure-note-prd.md | Adds internal PRD doc for Structure Note. |
| docs/roadmap.md | Adds roadmap doc page. |
| docs/index.md | Adds English docs home page. |
| docs/guide/troubleshooting.md | Adds English troubleshooting page. |
| docs/guide/pre-config.md | Adds English pre-config guide. |
| docs/guide/local-start.md | Adds English local install guide. |
| docs/guide/data-preparation.md | Adds English data preparation guide. |
| docs/features/overview.md | Adds English features overview page. |
| deeptutor/services/structure_note/tree_builder.py | Adds LLM-driven section tree builder with fallback logic. |
| deeptutor/services/structure_note/storage.py | Adds artifact/project persistence + retention policy logic. |
| deeptutor/services/structure_note/renderer.py | Adds Markdown→HTML→PDF rendering (Markdown + WeasyPrint) + citation manifest output. |
| deeptutor/services/structure_note/planner.py | Adds DocumentPlan builder from section tree + page-grounded evidence. |
| deeptutor/services/structure_note/page_index.py | Integrates VectifyAI/PageIndex submodule; extracts page artifacts via PyMuPDF; maps to internal models. |
| deeptutor/services/structure_note/normalizer.py | Adds PDF/PPT/PPTX normalization (LibreOffice conversion). |
| deeptutor/services/structure_note/models.py | Adds Structure Note Pydantic models/enums for artifacts, plans, chunks, citations, placeholders. |
| deeptutor/services/structure_note/markdown_postprocessor.py | Adds renderer-compatible markdown/math normalization + validation. |
| deeptutor/services/structure_note/image_pipeline.py | Adds placeholder scanning + crop/fallback image extraction + image citations. |
| deeptutor/services/structure_note/difficulty.py | Adds difficulty presets controlling chunk windows and instructions. |
| deeptutor/services/structure_note/init.py | Exports Structure Note public surface. |
| deeptutor/services/setup/init.py | Adds /structure-note to sidebar ordering + documents workspace directory in init docstring. |
| deeptutor/services/path_service.py | Adds workspace dirs + public output allowlist for Structure Note final/pdf/md and images. |
| deeptutor/services/config/loader.py | Adds default main.yaml baseline and injects Structure Note runtime paths. |
| deeptutor/services/init.py | Exposes structure_note module via lazy import. |
| deeptutor/api/routers/system.py | Adds Structure Note to runtime topology. |
| deeptutor/api/routers/structure_note.py | Adds Structure Note API: jobs/projects/KB source listing/retry + SSE stream endpoint. |
| deeptutor/api/routers/settings.py | Adds /structure-note to default sidebar ordering. |
| deeptutor/api/main.py | Registers the Structure Note router. |
| .gitmodules | Adds VectifyAI/PageIndex as a git submodule. |
Files not reviewed (1)
- web/package-lock.json: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "Delete this entry?": "Delete this entry?", | ||
| "Failed to load entries": "Failed to load entries", | ||
| "Retry": "Retry", | ||
| "Your pick": "Your pick", | ||
| "No categories yet.": "No categories yet.", | ||
| "Original Session": "Original Session" | ||
| "Original Session": "Original Session", |
There was a problem hiding this comment.
The translation keys "Delete" and "Retry" were removed from this locale file, but they are still used in the UI (e.g., t("Delete") in web/app/(utility)/knowledge/page.tsx and t("Retry") in multiple pages). This will cause missing translations (fallback to the key). Please re-add these keys (and any other removed, still-referenced keys) to app.json or move them to the appropriate shared namespace used by those components.
| "Delete this entry?": "确定删除此条目吗?", | ||
| "Failed to load entries": "加载失败", | ||
| "Retry": "重试", | ||
| "Your pick": "你的选择", | ||
| "No categories yet.": "暂无分类。", | ||
| "Original Session": "原始会话" | ||
| "Original Session": "原始会话", |
There was a problem hiding this comment.
The translation keys "Delete" and "Retry" were removed from this locale file, but they are still used in the UI (e.g., t("Delete") in web/app/(utility)/knowledge/page.tsx and t("Retry") in multiple pages). This will cause missing translations (fallback to the key). Please re-add these keys (and any other removed, still-referenced keys) to app.json or move them to the appropriate shared namespace used by those components.
| html_body = markdown(html_ready_markdown, extensions=["extra", "fenced_code", "tables", "toc"]) | ||
| html = ( | ||
| "<!doctype html><html><head><meta charset='utf-8'>" | ||
| f"<title>{title}</title><style>{_STYLE}</style>" | ||
| "</head><body>" | ||
| f"{html_body}</body></html>" | ||
| ) | ||
|
|
||
| pdf_path = final_dir / "final.pdf" | ||
| HTML(string=html, base_url=str(job_dir)).write_pdf(str(pdf_path)) |
There was a problem hiding this comment.
markdown() will pass through raw HTML, and WeasyPrint will fetch referenced resources (e.g., <img src=...>, CSS url(...)) when rendering. Since this Markdown is ultimately LLM-generated and grounded in user-provided documents, this can enable SSRF (http/https fetches) and local file reads (file://...) during PDF render. Consider: (1) sanitizing/escaping raw HTML in the Markdown-to-HTML step, and (2) providing a restrictive url_fetcher to WeasyPrint that only allows local, whitelisted paths under job_dir (block remote schemes and file:// outside the workspace).
| with open(target_path, "wb") as handle: | ||
| for chunk in iter(lambda: file.file.read(8192), b""): | ||
| written_bytes += len(chunk) | ||
| if written_bytes > DocumentValidator.MAX_FILE_SIZE: | ||
| raise HTTPException(status_code=400, detail="Uploaded file exceeds the size limit.") | ||
| handle.write(chunk) | ||
|
|
There was a problem hiding this comment.
This endpoint is async, but _save_upload reads the upload via file.file.read(...) (blocking file I/O) in a tight loop. For large uploads this can block the event loop and degrade API responsiveness. Prefer using await file.read(...) (Starlette’s async UploadFile API) or streaming the copy via asyncio.to_thread/background thread to keep the event loop responsive.
| uvicorn[standard]>=0.24.0 | ||
| websockets>=12.0 | ||
| python-multipart>=0.0.6 | ||
| WeasyPrint>=62.0 |
There was a problem hiding this comment.
Adding WeasyPrint as a runtime dependency can break installs on some Linux environments unless required system libraries (e.g., Pango/Cairo/GLib) are present. If this repo’s install path relies on pip install -r requirements/server.txt in CI/Docker, please ensure the corresponding OS packages are installed (and document them in install scripts/docs), or consider making PDF export an optional extra with a clearer installation path.
Summary
Validation
.venv/bin/python -m pytest tests/services/test_structure_note_service.py tests/api/test_structure_note_router.py -q.venv/bin/python -m ruff check deeptutor/services/structure_note/page_index.py deeptutor/services/structure_note/manager.py tests/services/test_structure_note_service.py tests/api/test_structure_note_router.pynpx eslint components/sidebar/SidebarShell.tsx lib/latex.tsgit diff --check upstream/main...HEADNotes
DEEPTUTOR_PAGEINDEX_MODEL.git submodule update --init --recursive third_party/PageIndexbefore running Structure Note PageIndex.