This file is the build contract for an AI agent working in this repo.
Goal:
- build a local-first Discord guild crawler
- mirror all guild data the configured bot can access
- import classifiable Discord Desktop cache messages without user tokens, including DMs
- store it in SQLite
- support fast text search, semantic search, and raw SQL
- support one-shot backfill and long-running live sync
This spec is intentionally detailed so an agent can keep shipping without re-asking foundational questions.
discrawl is a Go CLI that mirrors Discord guild data into local SQLite.
V1 scope:
- one guild at a time
- all accessible text channels
- all accessible announcement channels
- all accessible forum channels and their posts
- all accessible public threads
- all accessible private threads
- archived thread coverage
- full message history
- desktop-local import from cached Discord Desktop artifacts, with proven DMs stored under
@me - current member snapshot
- FTS5 search
- optional OpenAI embeddings with local vector search
- raw SQL access
Out of scope for V1:
- remote/API personal-account DM crawling
- Discord user-token automation/selfbot flows
- reactions as primary indexed entities
- attachment blob downloads by default
- cross-guild unified sync UX
- write-back or moderation actions
These are settled unless the user explicitly changes them:
- config format:
TOML - config location: platform-native XDG config dir, e.g.
${XDG_CONFIG_HOME:-~/.config}/discrawl/config.tomlon Linux - DB location: platform-native XDG data dir, e.g.
${XDG_DATA_HOME:-~/.local/share}/discrawl/discrawl.dbon Linux - cache dir: platform-native XDG cache dir, e.g.
${XDG_CACHE_HOME:-~/.cache}/discrawl/ - log dir: platform-native XDG state dir, e.g.
${XDG_STATE_HOME:-~/.local/state}/discrawl/logs/ - legacy installs with
~/.discrawl/config.tomlcontinue to load that config when the new default config file does not exist, even if XDG env vars are present - token source:
DISCORD_BOT_TOKENor configured env var, then optional OS keyring fallback - guild model: one guild in CLI UX, multi-guild-ready schema
- search: hybrid, with FTS first and embeddings optional
- embedding provider: OpenAI
- API key source:
OPENAI_API_KEYfrom shell env - message retention: current canonical row + append-only event log
- member retention: current snapshot only
- files: metadata only in DB, fetch binaries later on demand
- reactions: not important for V1
- polls: flatten into text during normalization
An agent should assume:
- repo path:
~/Projects/discrawl - shell:
zsh - Go is installed and modern
- user is Peter
- user keeps many secrets in
~/.profile
- platform-native Discrawl config file
- platform-native Discrawl SQLite database
~/.profile
Do not store raw API keys in repo files.
Expected source:
- env var
OPENAI_API_KEY
Typical place to discover it locally:
~/.profile
The code should read the env var at runtime, not copy the value into config by default.
Important Discord facts that drive the schema:
- channels and threads are closely related; threads should be stored as channels
- forum posts are threads under a forum parent
- message history is paginated and must be backfilled incrementally
- live updates come from Gateway events, not from polling alone
- personal DMs are only supported through desktop-local cache import
- desktop cache messages without a provable channel/guild route are skipped rather than stored as unknown data
- archived public and private threads must be enumerated explicitly
- private archived thread access may require elevated bot perms like
Manage Threads
- guild
- categories
- channels
- threads
- members
- messages
- message lifecycle events
- category
- text
- announcement
- forum
- thread public
- thread private
- thread announcement
Voice channels can be mirrored as metadata rows, but there is no need to crawl message history because there is none.
Use SQLite.
Requirements:
- WAL mode
- foreign keys on
- FTS5 enabled
- vector extension optional
At minimum:
guildschannelsmembersmessagesmessage_eventssync_stateembedding_jobsmessage_fts
Optional once vectors are wired:
message_embeddings
Suggested shape:
create table guilds (
id text primary key,
name text not null,
icon text,
raw_json text not null,
updated_at text not null
);Threads should live in the same table.
Suggested shape:
create table channels (
id text primary key,
guild_id text not null,
parent_id text,
kind text not null,
name text not null,
topic text,
position integer,
is_nsfw integer not null default 0,
is_archived integer not null default 0,
is_locked integer not null default 0,
is_private_thread integer not null default 0,
thread_parent_id text,
archive_timestamp text,
raw_json text not null,
updated_at text not null
);Suggested shape:
create table members (
guild_id text not null,
user_id text not null,
username text not null,
global_name text,
display_name text,
nick text,
discriminator text,
avatar text,
bot integer not null default 0,
joined_at text,
role_ids_json text not null,
raw_json text not null,
updated_at text not null,
primary key (guild_id, user_id)
);Suggested shape:
create table messages (
id text primary key,
guild_id text not null,
channel_id text not null,
author_id text,
message_type integer not null,
created_at text not null,
edited_at text,
deleted_at text,
content text not null,
normalized_content text not null,
reply_to_message_id text,
pinned integer not null default 0,
has_attachments integer not null default 0,
raw_json text not null,
updated_at text not null
);Suggested shape:
create table message_events (
event_id integer primary key autoincrement,
guild_id text not null,
channel_id text not null,
message_id text not null,
event_type text not null,
event_at text not null,
payload_json text not null
);Suggested shape:
create table sync_state (
scope text primary key,
cursor text,
updated_at text not null
);Examples of scope:
guild:<guild_id>:memberschannel:<channel_id>:messagestail:<guild_id>
Suggested shape:
create table embedding_jobs (
message_id text primary key,
state text not null,
attempts integer not null default 0,
updated_at text not null
);Recommended pattern:
- content table =
messages - FTS virtual table =
message_fts - keep it updated explicitly, not by fragile magic
Suggested columns:
message_idguild_idchannel_idauthor_idauthor_namechannel_namecontent
Support three modes:
ftssemantichybrid
Default:
hybridwhen embeddings are enabledftsotherwise
FTS is mandatory.
It should be good enough that the tool is useful before embeddings exist.
Expected use cases:
- exact terms
- commands
- stack traces
- URLs
- model names
- channel names
- user names
Embeddings are optional but planned from day one.
Recommended provider:
- OpenAI
text-embedding-3-small
Implementation guidance:
- batch embedding jobs
- keep embedding generation out of the hot sync path
- store vectors locally
- semantic search should degrade gracefully when vectors are absent
Prefer SQLite-local vector search so the whole product stays portable.
Recommended direction:
sqlite-vec
This can be wired after the base crawler and FTS system work.
Design goals:
- simple for humans
- composable for scripts
- obvious nouns and verbs
- no secrets in flags
Usage:
discrawl [global flags] <command> [args]
-h, --help--version--config <path>--json--plain-q, --quiet-v, --verbose--no-color
initsynctailwiretapsearchsqlmemberschannelsstatusdoctor
Purpose:
- create the platform-native Discrawl config file
- discover accessible Discord guilds
- persist guild id and DB path
Expected flags:
--guild <id>--db <path>--with-embeddings
Purpose:
- one-shot crawl
Expected flags:
--full--since <timestamp>--concurrency <n>--with-embeddings
Requirements:
- idempotent
- restart-safe
- shows progress on stderr
Purpose:
- live sync from Gateway
Expected flags:
--repair-every <duration>--with-embeddings
Requirements:
- reconnect automatically
- write checkpoints
- periodic repair sync
Purpose:
- import Discord Desktop cache artifacts into the local archive
- make cached personal DMs searchable under synthetic guild id
@me
Expected flags:
--path <dir>--dry-run--watch-every <duration>--max-file-bytes <bytes>--full-cache
Requirements:
- never use Discord user tokens
- never extract or persist auth tokens from desktop cache
- scan bounded local files only
- default to route-bearing HTTP cache entries; exhaustive Chromium cache scans require explicit full-cache mode
- store sanitized raw metadata, not full arbitrary cache blobs
Purpose:
- query mirrored messages
Expected flags:
--mode fts|semantic|hybrid--channel <name-or-id>--author <name-or-id>--limit <n>--json--plain
Purpose:
- run read-only SQL
Requirements:
- support query arg or stdin
- block non-read-only statements by default
Subcommands:
listshow <user-id>search <query>
Subcommands:
listshow <channel-id>
Must show:
- guild id
- guild name if known
- db path
- total channels
- total threads
- total messages
- total members
- last sync time
- last tail event time
- embedding backlog
Must check:
- config file readable
- Discord token env var readable unless live access is disabled
- Discord auth valid
- guild reachable
- DB openable
- FTS present
- vector extension present if configured
Format:
- TOML
Location:
- platform-native Discrawl config file
Suggested shape:
version = 1
guild_id = "1456350064065904867"
db_path = "~/.local/share/discrawl/discrawl.db"
cache_dir = "~/.cache/discrawl"
log_dir = "~/.local/state/discrawl/logs"
[discord]
token_source = "env"
token_env = "DISCORD_BOT_TOKEN"
token_keyring_service = "discrawl"
token_keyring_account = "discord_bot_token"
[sync]
concurrency = 4
repair_every = "6h"
full_history = true
[search]
default_mode = "hybrid"
[search.embeddings]
enabled = true
provider = "openai"
model = "text-embedding-3-small"
api_key_env = "OPENAI_API_KEY"
batch_size = 64Config precedence:
- flags
- environment
- config file
Environment variables:
DISCRAWL_CONFIGDISCORD_BOT_TOKENOPENAI_API_KEY
Do not:
- put bot tokens in git
- put API keys in git
- print secrets in normal logs
Do:
- load bot token from env
- fall back to the configured OS keyring item when env is empty
- load OpenAI key from env
- redact secrets in debug and doctor output
- load config
- resolve token
- fetch bot identity
- fetch guild metadata
- fetch guild channels
- fetch active threads
- enumerate archived public threads per parent channel
- enumerate archived private threads per parent channel
- fetch member snapshot
- backfill messages for every crawlable channel and thread
- normalize message content
- upsert
messages - append
message_eventswhere relevant - update FTS rows
- enqueue embedding jobs
- write checkpoints
Use REST pagination with before.
Rules:
- fetch newest page first for incremental runs
- fetch oldest via repeated
beforepaging for full runs - stop when no messages remain
- handle rate limits centrally
Use Gateway events for:
- new messages
- edited messages
- deleted messages
- channel updates
- thread updates
- member updates
Tail should:
- upsert live state
- append lifecycle events
- keep retrying on disconnect
- periodically run repair sync
normalized_content should flatten Discord payloads into searchable text.
Include:
- message content
- embed titles and descriptions where helpful
- poll question and answers
- attachment filenames
- referenced message hints if available
Do not overcomplicate:
- reactions can be ignored
- attachment binary contents are not indexed in V1
Members matter for AI workflows.
Expected use cases:
- “who is this user”
- “find messages by this person”
- “find maintainers”
- “find everyone with a display name containing X”
At minimum, store:
- user id
- username
- display name
- nick
- roles
- bot flag
cmd/discrawl/
internal/cli/
internal/config/
internal/discord/
internal/store/
internal/search/
internal/syncer/
internal/embed/
Responsibilities:
internal/cli: command wiring, output modesinternal/config: parse and validate configinternal/discord: REST + Gateway client wrappersinternal/store: SQLite schema, migrations, queriesinternal/search: FTS and result rankinginternal/syncer: full sync and repair orchestrationinternal/embed: embedding queue and provider integration
Reasonable picks:
- Discord client:
github.com/bwmarrin/discordgo - TOML parser: something small and maintained
- SQLite driver: pick one path and stay consistent
- vector search:
sqlite-vec
Guidance:
- keep dependency count low
- prefer boring stable libraries
- avoid frameworks
- config loader
initstatus- DB open + migrations
- guild metadata sync
- channel sync
- member sync
- full message backfill
- incremental checkpoints
- FTS indexing
searchsqlmemberschannels
tail- reconnect logic
- repair loop
- embedding queue
- vector search
- hybrid ranking
For an AI agent to finish the product without external memory, this repo should contain:
- this spec
- README with user-facing overview
- schema and migration files
- config sample
- CLI contract
- implementation package layout
- token discovery rules
- API key discovery rules
- milestone order
This file is the authoritative engineering spec for now.
discrawl digest provides a per-channel activity summary over a lookback window.
Example usage:
discrawl digest
discrawl digest --since 7d
discrawl digest --since 30d --guild 123456789012345678
discrawl digest --channel general --top-n 5
discrawl --json digest --since 72hBehavior:
- window defaults to
7dwhen--sinceis omitted --sinceaccepts Go durations (72h,30m) andNdshorthand (7d,30d)--guildfilters byguild_id; empty means no guild filter--channelaccepts channel id or exact channel name- per-channel metrics include
messages,replies, andactive_authors - top posters are ranked by message count using member display fallback order:
display_name -> nick -> global_name -> username -> author_id -> unknown - top mentions are ranked from
mention_eventsand include all target types (userandrole) - channels are sorted by message count descending, then channel name ascending
- JSON output returns a
Digestobject with channel rows and totals; plain output emits one tab-separated row per channel
discrawl analytics is a subcommand group for activity-style queries.
Example usage:
discrawl analytics
discrawl analytics quiet --since 30d
discrawl analytics quiet --guild 123456789012345678
discrawl analytics trends --weeks 8
discrawl analytics trends --weeks 12 --channel general
discrawl --json analytics quiet --since 60d
discrawl --json analytics trends --weeks 4Behavior:
analytics quietdefaults to30dlookback and supports--guildanalytics quietincludes top-level text/announcement channels with no messages at all- quiet rows are sorted with never-active channels first, then by longest silence
analytics trendsdefaults to8weeks and supports--guildplus--channel(id or exact name)analytics trendsbuckets messages into Monday-start UTC weeks and zero-fills missing weeks for every returned message-capable channel- trends rows are sorted by total messages descending, then channel name ascending