Skip to content

feat(ingest): capture entity URLs from Markdown link embeds#1

Open
ntokaeva wants to merge 1 commit into
AndyShaman:mainfrom
ntokaeva:feat/extract-entity-urls
Open

feat(ingest): capture entity URLs from Markdown link embeds#1
ntokaeva wants to merge 1 commit into
AndyShaman:mainfrom
ntokaeva:feat/extract-entity-urls

Conversation

@ntokaeva
Copy link
Copy Markdown

Problem

Forwarded Telegram posts of the form

📱 Смотреть на YouTube

are very common — channels write them as Markdown link embeds
[Смотреть на YouTube](https://youtu.be/...). Telegram delivers the
URL only on MessageEntity.text_link.url; the plain message text
holds the visible label without the link.

The current ingest pipeline reads only the visible body
(msg.text or msg.caption), so the URL is silently dropped. Anything
downstream (vault sync, transcript pipelines, search by URL) cannot
recover it.

Solution

New column notes.extracted_urls (idempotent migration, TEXT /
JSON array). On every ingest path that takes a message we now lift
URLs from both entities and caption_entities:

  • text_linkentity.url
  • url → slice of the visible text by offset/length

URLs are kept in document order, deduplicated, written as a JSON
array on the note row. NULL stays NULL on messages without relevant
entities so legacy rows and external tooling that doesn't know about
the column see no change.

Coverage

  • src/bot/handlers/channel.py — text / web / youtube / document / post
  • src/bot/handlers/media_group.py — albums; URLs merge across every
    message in the group
  • src/core/ingest.pyingest_text and ingest_document accept
    extracted_urls: list[str] | None
  • src/core/notes.py — JSON (de)serialization, single mapper for SELECTs
  • src/core/models.py — Pydantic Note.extracted_urls: list[str] | None

Tests

9 new cases in tests/test_entity_urls.py:

  • text_link / url entities extracted from text + caption
  • order preserved, duplicates dropped
  • non-link entity types (bold, mention, hashtag) ignored
  • migration adds the column on a fresh DB
  • DB roundtrip with URLs returns the list intact
  • DB roundtrip without URLs keeps the column SQL NULL (not the string "null")

Full suite: 440 passing.

Compatibility

  • Migration is idempotent (_column_exists check), safe on running DBs
  • Legacy code reading notes still works — the field is Optional
  • Other callers (neighbors.py, sync_deleted.py) keep their own
    SELECTs without the new column; the field stays None via the
    Pydantic default
  • No new dependencies

🤖 Generated with Claude Code

Telegram messages with `[label](URL)` Markdown embeds expose the URL
only via `MessageEntity.text_link.url` — the plain text holds the
visible label without the link. Until now the bot read only the
visible body, so forwarded posts like "📱 Смотреть на YouTube" arrived
in the DB with the video URL silently dropped.

This change introduces a JSON column `notes.extracted_urls` and lifts
URLs from both message-level and caption-level entities (types
`text_link` and `url`) before each ingest path stores the note. The
column stays NULL on messages with no relevant entities, so legacy
rows and downstream tools that don't know about the column see no
change.

Paths covered:
- channel.py: text/web/youtube + document + post
- media_group.py: albums merge URLs across every message in the group

Tests added: 9 new cases pin the extraction contract, migration
behaviour and the roundtrip through SQLite (including the NULL-stays-
NULL case for plain text). Full suite: 440 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant