feat(ingest): capture entity URLs from Markdown link embeds#1
Open
ntokaeva wants to merge 1 commit into
Open
Conversation
Telegram messages with `[label](URL)` Markdown embeds expose the URL only via `MessageEntity.text_link.url` — the plain text holds the visible label without the link. Until now the bot read only the visible body, so forwarded posts like "📱 Смотреть на YouTube" arrived in the DB with the video URL silently dropped. This change introduces a JSON column `notes.extracted_urls` and lifts URLs from both message-level and caption-level entities (types `text_link` and `url`) before each ingest path stores the note. The column stays NULL on messages with no relevant entities, so legacy rows and downstream tools that don't know about the column see no change. Paths covered: - channel.py: text/web/youtube + document + post - media_group.py: albums merge URLs across every message in the group Tests added: 9 new cases pin the extraction contract, migration behaviour and the roundtrip through SQLite (including the NULL-stays- NULL case for plain text). Full suite: 440 passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Forwarded Telegram posts of the form
are very common — channels write them as Markdown link embeds
[Смотреть на YouTube](https://youtu.be/...). Telegram delivers theURL only on
MessageEntity.text_link.url; the plain message textholds the visible label without the link.
The current ingest pipeline reads only the visible body
(
msg.text or msg.caption), so the URL is silently dropped. Anythingdownstream (vault sync, transcript pipelines, search by URL) cannot
recover it.
Solution
New column
notes.extracted_urls(idempotent migration,TEXT/JSON array). On every ingest path that takes a message we now lift
URLs from both
entitiesandcaption_entities:text_link→entity.urlurl→ slice of the visible text byoffset/lengthURLs are kept in document order, deduplicated, written as a JSON
array on the note row. NULL stays NULL on messages without relevant
entities so legacy rows and external tooling that doesn't know about
the column see no change.
Coverage
src/bot/handlers/channel.py— text / web / youtube / document / postsrc/bot/handlers/media_group.py— albums; URLs merge across everymessage in the group
src/core/ingest.py—ingest_textandingest_documentacceptextracted_urls: list[str] | Nonesrc/core/notes.py— JSON (de)serialization, single mapper for SELECTssrc/core/models.py— PydanticNote.extracted_urls: list[str] | NoneTests
9 new cases in
tests/test_entity_urls.py:"null")Full suite: 440 passing.
Compatibility
_column_existscheck), safe on running DBsOptionalneighbors.py,sync_deleted.py) keep their ownSELECTs without the new column; the field stays
Nonevia thePydantic default
🤖 Generated with Claude Code