Skip to content

feat: subscription historical filters + PDF supply-price extraction + local PDF save#15

Open
roughian wants to merge 2 commits into
tae0y:mainfrom
roughian:feat/subscription-pdf-supply-prices
Open

feat: subscription historical filters + PDF supply-price extraction + local PDF save#15
roughian wants to merge 2 commits into
tae0y:mainfrom
roughian:feat/subscription-pdf-supply-prices

Conversation

@roughian

@roughian roughian commented May 12, 2026

Copy link
Copy Markdown

Summary

Three related enhancements for the ์ฒญ์•ฝ (subscription) tool group:

  1. Historical / pending-occupancy filters on get_apt_subscription_info

    • rcrit_pblanc_de_from / rcrit_pblanc_de_to (YYYY-MM-DD) โ€” server-side filter on ๋ชจ์ง‘๊ณต๊ณ ์ผ via odcloud cond[] syntax. Lets users pull past notices ("2021๋…„ 7์›”๊นŒ์ง€์˜ ๊ณต๊ณ ").
    • mvn_prearnge_ym_from (YYYYMM) + only_pending_occupancy flag โ€” filter to ๋‹จ์ง€ with future MVN_PREARNGE_YM (i.e. ์‹ค๊ฑฐ๋ž˜๊ฐ€ ๋ฏธํ™•์ • ์ž…์ฃผ์˜ˆ์ •).
    • Each item is enriched with is_pre_occupancy and expected_move_in_year_month derived fields.
    • applied_filters echo in the response for LLM traceability.
  2. New tool get_apt_subscription_supply_prices

    • Accepts house_manage_no (looks up PBLANC_URL via odcloud) or pblanc_url directly.
    • Downloads the notice PDF (15 s timeout, 25 MB cap, magic-byte check) and extracts ํ‰ํ˜•๋ณ„ ๋ถ„์–‘๊ฐ€ via pdfplumber.
    • Extraction strategy: tables first (header keyword detection for ๊ณต๊ธ‰๊ธˆ์•ก/๊ณต๊ธ‰๊ฐ€๊ฒฉ/๋ถ„์–‘๊ฐ€ + ์ฃผํƒํ˜•/ํƒ€์ž…/๋ชจ๋ธ/ํ‰ํ˜• + ์ „์šฉ๋ฉด์ ), regex fallback on price-keyword pages.
    • KRW values are normalized to ๋งŒ์› with a 1์–ต threshold heuristic; results are deduplicated by (unit_type, exclusive_area_sqm).
  3. New tool download_subscription_pdf (added in follow-up commit)

    • Saves the notice PDF to a user-specified local directory (Claude Desktop stdio mode).
    • save_dir is required and ~-expanded; the directory is auto-created.
    • filename is sanitised (path separators, .., control chars removed; whitespace collapsed to _).
    • Default filename: {HOUSE_NM}_{HOUSE_MANAGE_NO}.pdf.
    • overwrite=False (default) appends a numeric suffix (_01, _02, ...) on collision; overwrite=True replaces the existing file.
    • Interactive flow: the tool's docstring and resources/custom-instructions-ko.md instruct the LLM to ask the user for save_dir before calling, rather than guessing a default. (Claude Desktop has no Skill/AskUserQuestion UI; this is the idiomatic equivalent.)

New utility module

  • src/real_estate/common_utils/pdf_parser.py โ€” extract_text() / extract_supply_prices() / SupplyPrice dataclass. Guards: 25 MB / 200 pages / OCR-required detection.

New helpers in _helpers.py

  • _download_pdf โ€” streaming GET with 15 s timeout, 25 MB cap, content-type / magic-byte check.
  • _sanitize_filename_component, _resolve_save_dir, _next_available_path โ€” used by download_subscription_pdf.
  • _current_year_month, _validate_pblanc_date, _validate_year_month โ€” used by the historical filter.

Dependencies

  • pdfplumber==0.11.4 added (pure-Python, pdfminer.six + pypdfium2 transitively). No system packages required, friendly for Docker/HTTP-mode deployments.

Test plan

  • uv run pytest tests/mcp_server/test_subscription.py tests/common_utils/test_pdf_parser.py โ€” 54 tests pass
  • uv run ruff check โ€” clean on changed files
  • uv run pyright on changed files โ€” 0 errors
  • Manual smoke test with a real ์ฒญ์•ฝ ๊ณต๊ณ  PDF (recommended before merging)
  • Manual smoke test of download_subscription_pdf in Claude Desktop stdio mode

Notes / follow-ups

  • _download_pdf accepts arbitrary URLs (via pblanc_url). A follow-up patch is recommended to add a host allowlist and private-IP blocklist for SSRF defence-in-depth before deploying in HTTP mode.
  • The regex fallback \d{2,3}[A-Z]?T? can collide with non-ํ‰ํ˜• numbers when tables are absent. Recommend adding regression fixtures from real PDFs once available.
  • The interactive save_dir prompt is policy-based (docstring + Project Instructions), not hard-enforced. If stricter UX becomes important, splitting into a two-step propose/confirm tool pair is an option.

๐Ÿค– Generated with Claude Code

roughian and others added 2 commits May 12, 2026 20:08
โ€ฆorical filters

get_apt_subscription_info:
- Add rcrit_pblanc_de_from/_to (YYYY-MM-DD) and mvn_prearnge_ym_from
  (YYYYMM) filters, mapped to odcloud cond[] syntax for server-side
  filtering of past notices.
- Add only_pending_occupancy flag that uses the stricter of the current
  year-month and the user-supplied mvn_prearnge_ym_from.
- Enrich items with is_pre_occupancy and expected_move_in_year_month
  derived fields from MVN_PREARNGE_YM.
- Echo applied_filters back in the response for LLM traceability.

get_apt_subscription_supply_prices (new):
- Resolve PBLANC_URL via HOUSE_MANAGE_NO lookup or accept it directly.
- Download the notice PDF with a 25 MB cap and PDF magic-byte check.
- Extract ํ‰ํ˜•๋ณ„ ๋ถ„์–‘๊ฐ€ via pdfplumber: tables first (header keyword
  detection), regex fallback on price-keyword pages. KRW values are
  normalized to ๋งŒ์› with a 1์–ต-threshold heuristic, deduplicated by
  (unit_type, exclusive_area_sqm).

Common utilities:
- New pdf_parser.py with extract_text() / extract_supply_prices() and a
  SupplyPrice dataclass. Guards: 25 MB / 200 pages / OCR-required.

Tests:
- 26 new subscription test cases (filters, derived fields, supply-price
  tool) plus 7 pdf_parser unit tests using a mocked pdfplumber.

Docs:
- README-ko.md and CLAUDE.md updated with the new tool, filters, and
  utility module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New MCP tool that saves the ์ฒญ์•ฝ ๊ณต๊ณ  PDF to a user-specified local
directory (Claude Desktop stdio mode). Designed so the LLM asks the user
for `save_dir` before calling instead of guessing a path.

- download_subscription_pdf(save_dir, house_manage_no=None,
  pblanc_url=None, filename=None, overwrite=False)
- Reuses _download_pdf for the network step (15s timeout, 25 MB cap).
- New helpers in _helpers.py:
  - _sanitize_filename_component: strips path separators, ../, and
    Windows-unsafe characters; collapses whitespace to underscores.
  - _resolve_save_dir: expands ~ and resolves to absolute path.
  - _next_available_path: appends _01.._99 suffix on collision.
- Filename defaults to "{HOUSE_NM}_{HOUSE_MANAGE_NO}.pdf" when looked up.
- Directory is auto-created (mkdir parents=True, exist_ok=True).

Interactive guidance (per Codex consultation, Claude Desktop has no
Skill/AskUserQuestion equivalent):
- Tool docstring instructs the LLM to ask the user for save_dir before
  calling rather than silently picking a default.
- resources/custom-instructions-ko.md gains a section on the same
  interaction policy with example prompts.

Tests:
- 11 new cases covering missing inputs, default filename, lookup +
  filename composition, path-traversal sanitisation, suffix collision,
  overwrite=True, and auto-mkdir.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@roughian roughian changed the title feat: subscription historical filters + PDF supply-price extraction feat: subscription historical filters + PDF supply-price extraction + local PDF save May 12, 2026
@tae0y

tae0y commented May 12, 2026

Copy link
Copy Markdown
Owner

์•ˆ๋…•ํ•˜์„ธ์š” ๊ธฐ์—ฌ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
๋จผ์ € ์ด์Šˆ์—์„œ ๋ฐฉํ–ฅ์„ ์˜๋…ผํ•˜๊ณ  PR์„ ํ™•์ธํ•˜๊ณ  ์‹ถ์–ด์š”.
์ค€๋น„๋˜์‹œ๋ฉด ํ•ด๋‹นํ•˜๋Š” ์ด์Šˆ์—, ํ˜น์€ ์ƒˆ๋กœ ์ด์Šˆ๋ฅผ ์—ด์–ด์„œ
๊ฐœ๊ด„์ ์œผ๋กœ ์„ค๋ช… ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

@tae0y tae0y self-requested a review May 12, 2026 12:57
@tae0y tae0y added the feature New feature or request label May 12, 2026
@tae0y tae0y linked an issue May 12, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants