Skip to content

feat: Add Structure Note with VectifyAI PageIndex#329

Open
txxxxz wants to merge 5 commits intoHKUDS:mainfrom
txxxxz:feat/add-structure-note
Open

feat: Add Structure Note with VectifyAI PageIndex#329
txxxxz wants to merge 5 commits intoHKUDS:mainfrom
txxxxz:feat/add-structure-note

Conversation

@txxxxz
Copy link
Copy Markdown

@txxxxz txxxxz commented Apr 15, 2026

Summary

  • Add the end-to-end Structure Note feature, including the service pipeline, API routes, web workspace page, docs, and focused tests.
  • Make VectifyAI/PageIndex the default page indexing backend for Structure Note instead of relying on the previous self-built Page Index path.
  • Use the PageIndex Graph-RAG-oriented document structure to improve section organization, note planning, and the final generated note expression.
  • Keep DeepTutor local page text and image evidence extraction, then map PageIndex structures into DeepTutor section trees, document plans, chunks, and rendered notes.
  • Invalidate legacy Structure Note caches when old in-house Page Index artifacts are detected so generated notes are rebuilt consistently with the new PageIndex backend.
  • Add the required PageIndex runtime dependencies and pin pydantic below 2.12 to avoid the current Gradio dependency conflict.

Validation

  • .venv/bin/python -m pytest tests/services/test_structure_note_service.py tests/api/test_structure_note_router.py -q
  • .venv/bin/python -m ruff check deeptutor/services/structure_note/page_index.py deeptutor/services/structure_note/manager.py tests/services/test_structure_note_service.py tests/api/test_structure_note_router.py
  • npx eslint components/sidebar/SidebarShell.tsx lib/latex.ts
  • git diff --check upstream/main...HEAD
  • VectifyAI PageIndex import smoke test

Notes

  • PageIndex uses LiteLLM, so Structure Note PageIndex runs need a LiteLLM-compatible model configuration or DEEPTUTOR_PAGEINDEX_MODEL.
  • New checkouts need git submodule update --init --recursive third_party/PageIndex before running Structure Note PageIndex.

Copilot AI review requested due to automatic review settings April 15, 2026 06:27
@txxxxz txxxxz changed the title Add Structure Note with VectifyAI PageIndex feat: Add Structure Note with VectifyAI PageIndex Apr 15, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the end-to-end Structure Note workspace to DeepTutor, switching the indexing backend to VectifyAI/PageIndex and wiring up the full pipeline (upload/KB source → page indexing → section planning → markdown generation → image filling → PDF rendering) across backend APIs, frontend UI, tests, and docs.

Changes:

  • Introduces backend Structure Note subsystem (router, manager/service modules, storage/pathing, PDF renderer) powered by VectifyAI/PageIndex.
  • Adds frontend Structure Note workspace integration (sidebar entry, API client, i18n strings, perf route budget, e2e audit).
  • Adds focused test coverage for the new service + API contract, and updates runtime path/config guards.

Reviewed changes

Copilot reviewed 55 out of 56 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
web/tests/e2e/structure-note.audit.ts Adds Playwright accessibility/UX smoke coverage for the Structure Note page.
web/scripts/route_budgets.mjs Adds /structure-note route budget + formatting updates.
web/package-lock.json Lockfile metadata update (marks fsevents as dev).
web/locales/zh/app.json Adds Structure Note UI strings; modifies/removes some existing translations.
web/locales/en/app.json Adds Structure Note UI strings; modifies/removes some existing translations.
web/lib/structure-note-api.ts New typed client API wrapper for Structure Note routes + client-side caching.
web/lib/latex.ts Adjusts markdown/LaTeX normalization behavior and code style.
web/components/ui/Button.tsx Minor UI behavior/style tweak (transition + formatting).
web/components/sidebar/SidebarShell.tsx Adds Structure Note nav entry and localizes brand text.
tests/services/test_structure_note_service.py Adds unit/integration tests for Structure Note pipeline pieces (planning/chunking/markdown/images).
tests/services/test_runtime_storage_guard.py Adds runtime path confinement assertion for Structure Note workspace.
tests/services/test_path_service.py Extends output allowlist tests for Structure Note final artifacts/images.
tests/api/test_structure_note_router.py Adds router contract tests for jobs/projects/KB sources/retry flows.
requirements/server.txt Adds WeasyPrint dependency for server-side PDF export.
requirements/cli.txt Adds Markdown/LiteLLM/PyPDF2 + pins pydantic <2.12 + bumps PyMuPDF.
pyproject.toml Mirrors dependency additions and server extra updates.
docs/zh/index.md Adds Chinese docs home page.
docs/zh/guide/troubleshooting.md Adds Chinese troubleshooting page.
docs/zh/guide/pre-config.md Adds Chinese pre-config guide.
docs/zh/guide/local-start.md Adds Chinese local install guide.
docs/zh/guide/local-conda-cursor.md Adds Chinese conda+Cursor setup guide.
docs/zh/guide/data-preparation.md Adds Chinese data preparation guide.
docs/zh/features/overview.md Adds Chinese features overview.
docs/testdoc/structure-note-technical-plan.md Adds internal technical plan doc for Structure Note.
docs/testdoc/structure-note-prd.md Adds internal PRD doc for Structure Note.
docs/roadmap.md Adds roadmap doc page.
docs/index.md Adds English docs home page.
docs/guide/troubleshooting.md Adds English troubleshooting page.
docs/guide/pre-config.md Adds English pre-config guide.
docs/guide/local-start.md Adds English local install guide.
docs/guide/data-preparation.md Adds English data preparation guide.
docs/features/overview.md Adds English features overview page.
deeptutor/services/structure_note/tree_builder.py Adds LLM-driven section tree builder with fallback logic.
deeptutor/services/structure_note/storage.py Adds artifact/project persistence + retention policy logic.
deeptutor/services/structure_note/renderer.py Adds Markdown→HTML→PDF rendering (Markdown + WeasyPrint) + citation manifest output.
deeptutor/services/structure_note/planner.py Adds DocumentPlan builder from section tree + page-grounded evidence.
deeptutor/services/structure_note/page_index.py Integrates VectifyAI/PageIndex submodule; extracts page artifacts via PyMuPDF; maps to internal models.
deeptutor/services/structure_note/normalizer.py Adds PDF/PPT/PPTX normalization (LibreOffice conversion).
deeptutor/services/structure_note/models.py Adds Structure Note Pydantic models/enums for artifacts, plans, chunks, citations, placeholders.
deeptutor/services/structure_note/markdown_postprocessor.py Adds renderer-compatible markdown/math normalization + validation.
deeptutor/services/structure_note/image_pipeline.py Adds placeholder scanning + crop/fallback image extraction + image citations.
deeptutor/services/structure_note/difficulty.py Adds difficulty presets controlling chunk windows and instructions.
deeptutor/services/structure_note/init.py Exports Structure Note public surface.
deeptutor/services/setup/init.py Adds /structure-note to sidebar ordering + documents workspace directory in init docstring.
deeptutor/services/path_service.py Adds workspace dirs + public output allowlist for Structure Note final/pdf/md and images.
deeptutor/services/config/loader.py Adds default main.yaml baseline and injects Structure Note runtime paths.
deeptutor/services/init.py Exposes structure_note module via lazy import.
deeptutor/api/routers/system.py Adds Structure Note to runtime topology.
deeptutor/api/routers/structure_note.py Adds Structure Note API: jobs/projects/KB source listing/retry + SSE stream endpoint.
deeptutor/api/routers/settings.py Adds /structure-note to default sidebar ordering.
deeptutor/api/main.py Registers the Structure Note router.
.gitmodules Adds VectifyAI/PageIndex as a git submodule.
Files not reviewed (1)
  • web/package-lock.json: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread web/locales/en/app.json
Comment on lines 883 to +887
"Delete this entry?": "Delete this entry?",
"Failed to load entries": "Failed to load entries",
"Retry": "Retry",
"Your pick": "Your pick",
"No categories yet.": "No categories yet.",
"Original Session": "Original Session"
"Original Session": "Original Session",
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The translation keys "Delete" and "Retry" were removed from this locale file, but they are still used in the UI (e.g., t("Delete") in web/app/(utility)/knowledge/page.tsx and t("Retry") in multiple pages). This will cause missing translations (fallback to the key). Please re-add these keys (and any other removed, still-referenced keys) to app.json or move them to the appropriate shared namespace used by those components.

Copilot uses AI. Check for mistakes.
Comment thread web/locales/zh/app.json
Comment on lines 883 to +887
"Delete this entry?": "确定删除此条目吗?",
"Failed to load entries": "加载失败",
"Retry": "重试",
"Your pick": "你的选择",
"No categories yet.": "暂无分类。",
"Original Session": "原始会话"
"Original Session": "原始会话",
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The translation keys "Delete" and "Retry" were removed from this locale file, but they are still used in the UI (e.g., t("Delete") in web/app/(utility)/knowledge/page.tsx and t("Retry") in multiple pages). This will cause missing translations (fallback to the key). Please re-add these keys (and any other removed, still-referenced keys) to app.json or move them to the appropriate shared namespace used by those components.

Copilot uses AI. Check for mistakes.
Comment on lines +158 to +167
html_body = markdown(html_ready_markdown, extensions=["extra", "fenced_code", "tables", "toc"])
html = (
"<!doctype html><html><head><meta charset='utf-8'>"
f"<title>{title}</title><style>{_STYLE}</style>"
"</head><body>"
f"{html_body}</body></html>"
)

pdf_path = final_dir / "final.pdf"
HTML(string=html, base_url=str(job_dir)).write_pdf(str(pdf_path))
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

markdown() will pass through raw HTML, and WeasyPrint will fetch referenced resources (e.g., <img src=...>, CSS url(...)) when rendering. Since this Markdown is ultimately LLM-generated and grounded in user-provided documents, this can enable SSRF (http/https fetches) and local file reads (file://...) during PDF render. Consider: (1) sanitizing/escaping raw HTML in the Markdown-to-HTML step, and (2) providing a restrictive url_fetcher to WeasyPrint that only allows local, whitelisted paths under job_dir (block remote schemes and file:// outside the workspace).

Copilot uses AI. Check for mistakes.
Comment on lines +74 to +80
with open(target_path, "wb") as handle:
for chunk in iter(lambda: file.file.read(8192), b""):
written_bytes += len(chunk)
if written_bytes > DocumentValidator.MAX_FILE_SIZE:
raise HTTPException(status_code=400, detail="Uploaded file exceeds the size limit.")
handle.write(chunk)

Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This endpoint is async, but _save_upload reads the upload via file.file.read(...) (blocking file I/O) in a tight loop. For large uploads this can block the event loop and degrade API responsiveness. Prefer using await file.read(...) (Starlette’s async UploadFile API) or streaming the copy via asyncio.to_thread/background thread to keep the event loop responsive.

Copilot uses AI. Check for mistakes.
Comment thread requirements/server.txt
uvicorn[standard]>=0.24.0
websockets>=12.0
python-multipart>=0.0.6
WeasyPrint>=62.0
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding WeasyPrint as a runtime dependency can break installs on some Linux environments unless required system libraries (e.g., Pango/Cairo/GLib) are present. If this repo’s install path relies on pip install -r requirements/server.txt in CI/Docker, please ensure the corresponding OS packages are installed (and document them in install scripts/docs), or consider making PDF export an optional extra with a clearer installation path.

Copilot uses AI. Check for mistakes.
@pancacake pancacake added the enhancement New feature or request label Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants