This guide walks through creating a new expert from scratch. We'll use githubscarf as the running example — it was the first expert built greenfield against the expert contract.
An expert is two things:
A connector — an always-running process that owns two-way access to its data source. It manages polling, auth, and rate limits. It pushes records into PearScarf and can write back to the source when needed.
A knowledge package — natural language files that describe what the expert knows about its source: record types, entity types, how to extract facts, and how the LLM agent should behave.
experts/githubscarf/
├── __init__.py # package docstring
├── manifest.yaml # declares everything about the expert
├── pyproject.toml # Python package metadata
├── .env.example # required credentials template
├── github_connect.py # API client + tools + ingest_record()
├── github_ingest.py # background polling loop
├── schemas/
│ ├── github_pr.json # JSON Schema for PR records
│ └── github_issue.json # JSON Schema for issue records
├── knowledge/
│ ├── agent.md # LLM agent system prompt
│ ├── extraction.md # source-specific extraction guidance
│ ├── entities/
│ │ └── repository.md # new entity type definition (optional)
│ └── records/
│ ├── github_pr.md # PR record documentation
│ └── github_issue.md
└── eval/ # evaluation dataset (optional)
The manifest is the contract between the expert and PearScarf.
name: githubscarf
version: 0.1.0
source_type: github
author: pearshape-ai
description: GitHub expert for PearScarf
relevancy_check: skip
record_types:
- github_pr
- github_issue
schemas:
github_pr: schemas/github_pr.json
github_issue: schemas/github_issue.json
dedup_key: github_id
new_entity_types:
- name: repository # omit if no new entity types
new_entity_types: [] # or empty list
identifier_patterns: []
tools: github_connect.py
ingester: github_ingest.py
eval: eval/Key fields:
relevancy_check— required.skipauto-marks every record relevant on save (use for internal/trusted sources).requiredmeans the expert owns classification: itsingest_recordmay run an internal hard filter and passclassification="noise"for unambiguous noise, or leave it unset to let the framework default topending_triageso the triage agent can judge it. Required experts should also ship aknowledge/relevancy.mddescribing source-specific signal vs noise — the triage agent loads it into its LLM check.record_types— the record type strings this expert produces. Must be globally unique (prefix with source name to avoid collisions:github_issuenotissue).schemas— maps each record type to a JSON Schema file. Used to create typed tables at install time.tools— module withget_tools(ctx)entry point. Called at startup.ingester— module withstart(ctx)entry point. Called when--pollis active.new_entity_types— entity types not in the base schema. Operator approves on install.
One JSON Schema (draft-07) per record type. Properties become columns in the typed table.
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"github_id": {"type": "integer"},
"number": {"type": "integer"},
"title": {"type": "string"},
"body": {"type": "string"},
"state": {"type": "string"},
"author": {"type": "string"},
"url": {"type": "string"},
"created_at": {"type": "string"},
"updated_at": {"type": "string"}
},
"required": ["github_id", "number", "title"]
}The connect module is the core of the expert. It provides:
- API client — wraps the data source's API
ingest_record(data)— saves a record viactx.storage.save_record()get_tools()— returnsBaseToolinstances for the LLM agent- Module-level
get_tools(ctx)— entry point called by PearScarf at startup
from __future__ import annotations
from typing import TYPE_CHECKING, Any
from pearscarf.tools import BaseTool
if TYPE_CHECKING:
from pearscarf.expert_context import ExpertContext
class GitHubConnect:
def __init__(self, ctx: ExpertContext) -> None:
self._ctx = ctx
self._token = ctx.config.get("GITHUB_TOKEN", "")
self._repo = ctx.config.get("GITHUB_REPO", "")
def ingest_record(self, data: dict) -> str | None:
"""Save a record. Returns record_id or None on duplicate."""
raw = json.dumps(data)
content = f"Issue #{data['number']}: {data['title']}\n..."
metadata = {"github_id": data["github_id"], ...}
return self._ctx.storage.save_record(
"github_issue", raw, content=content, metadata=metadata,
dedup_key=str(data["github_id"]),
)
def get_tools(self) -> list[BaseTool]:
return [MyTool(self), ...]
# Module entry point — pearscarf calls this at startup
def get_tools(ctx: ExpertContext) -> GitHubConnect:
return GitHubConnect(ctx)The context is everything the expert receives. No pearscarf internals.
ctx.storage.save_record(type, raw, content, metadata, dedup_key)— save a record.raw= original data,content= LLM-ready string,metadata= structured fields as dict.ctx.storage.get_record(id)— look up a recordctx.storage.mark_relevant(id)— mark a record for extractionctx.bus.send(session_id, to_agent, content)— send a messagectx.bus.create_session(summary)— create a new sessionctx.log.write(agent, event_type, message)— log an eventctx.config— dict fromenv/.<name>.envctx.expert_name— the expert's registered name
Tools are BaseTool subclasses with name, description, input_schema, and execute():
class GitHubListIssuesTool(BaseTool):
name = "github_list_issues"
description = "List issues from the GitHub repository."
input_schema = {
"type": "object",
"properties": {
"state": {"type": "string", "enum": ["open", "closed", "all"]},
},
}
def __init__(self, connect: GitHubConnect) -> None:
self._connect = connect
def execute(self, **kwargs: Any) -> str:
issues = self._connect.list_issues(state=kwargs.get("state", "open"))
if not issues:
return "No issues found."
return "\n".join(f"- #{i['number']}: {i['title']}" for i in issues)The ingester subclasses pearscarf.consumer.Consumer. The consumer base owns the poll loop + thread lifecycle; the module exposes start(ctx) as a thin entry point the expert registry calls.
Each ingested record lands with classification='pending_triage' (via connect.ingest_record(...)), which the Triage consumer picks up automatically — the ingester does not touch the bus.
from __future__ import annotations
from typing import TYPE_CHECKING
from pearscarf.consumer import Consumer
if TYPE_CHECKING:
from pearscarf.expert_context import ExpertContext
class GithubIngest(Consumer):
name = "githubscarf"
default_poll_interval = 300.0
def __init__(self, ctx, poll_interval=None):
from githubscarf.github_connect import GitHubConnect
if poll_interval is None:
poll_interval = float(ctx.config.get("GITHUB_POLL_INTERVAL", self.default_poll_interval))
super().__init__(poll_interval=poll_interval)
self._ctx = ctx
self._connect = GitHubConnect(ctx)
self._pending = []
def _next(self):
if self._pending:
return self._pending.pop(0)
issues = self._connect.list_issues()
if not issues:
return None
self._pending = list(issues)
return self._pending.pop(0) if self._pending else None
def _handle(self, issue):
rid = self._connect.ingest_record(issue)
if rid:
self._ctx.log.write(self._ctx.expert_name, "action", f"Ingested {rid}")
def start(ctx: "ExpertContext"):
consumer = GithubIngest(ctx)
consumer.start()
return consumer._threadSystem prompt for the LLM agent. Tells it what tools are available and how to behave:
You are a GitHub expert agent. You access GitHub through the REST API.
Your job is to manage pull requests and issues — list, read, search as requested.
IMPORTANT: Use the reply tool to send results back. Your text responses
are only logged internally.Source-specific extraction guidance — tells the Extraction consumer what to extract from this source's records and what to ignore. PearScarf automatically combines this with its universal extraction rules and entity type definitions when processing your records:
## GitHub extraction guidance
Extract from the body, not metadata fields. Look for decisions,
blockers, people mentioned by name. Ignore bot comments and
template boilerplate.Only needed if the expert introduces new entity types. Each file defines one type:
**repository**
A GitHub repository — a codebase with its own issues, PRs, and contributors.
Extract when: the record references a specific repo by name.
Do not extract when: "repository" is used generically.Documents each record type's fields and identity rules. Not used in prompt composition — for human reference and agent context.
Source-specific prose guidance for the triage agent. Describes what looks like signal, what looks like noise, and what should be flagged as uncertain for this expert's records. Loaded by the triage agent alongside the deployment onboarding when classifying pending_triage records.
Relevancy guidance for <source> records.
## Keep as relevant
- Direct human correspondence with a known sender or recipient
- Replies in an ongoing thread
- Messages referencing a known project or deal
## Discard as noise
- Marketing, newsletters, sales outreach past the hard filter
- Transactional system mail unrelated to operational work
## Flag as uncertain
- Unknown sender, personally written, no anchor in the world
- Anything where the signal is subtleSkip experts don't need this file.
Inside your connect class, add a method like _is_noise(raw, metadata) -> bool and call it in ingest_record. Pass classification="noise" to save_record on a hit; pass nothing (None) on a miss and the framework will default to pending_triage. Keep the filter deterministic and conservative — only flag unambiguous automated/bulk patterns. Everything borderline should pass through to the triage agent, which has world context the hard filter lacks.
Ship a .env.example with required vars:
GITHUB_TOKEN=
GITHUB_REPO=
GITHUB_POLL_INTERVAL=300
On install, PearScarf copies this to env/.githubscarf.env. The operator fills in values. At startup, build_context() loads them into ctx.config.
psc install ./experts/githubscarfThe install runs 7 validation stages:
- Package locatable
- Manifest valid
- Knowledge contract (agent.md, extraction.md, entity files)
- Entry points (tools module imports, ingester module imports)
- Conflict checks (source_type unique, record_types not claimed)
- Identifier patterns valid
- Eval dataset (non-blocking warning if missing)
On success: DB rows created, typed tables created, credentials scaffolded.
# Fill in credentials
vi env/.githubscarf.env
# Verify
psc expert list
psc expert inspect githubscarf
# Run
psc run --poll- Startup —
start_system()callsget_tools(ctx)on your connect module. The returned connect instance is cached by record type. Ifagent.mdexists, anExpertBotstarts for your expert. - Polling — if
--poll,start(ctx)is called on your ingester module. Your polling loop runs as a daemon thread. - Messages — when the assistant sends a message to your expert (by name), the
ExpertBotdispatches it to a per-sessionExpertAgentwith your tools. - Ingestion — your
ingest_record()writes to the genericrecordstable + your typed table. Triage classifies the record (relevant/noise); Extraction picks up relevant records and extracts entities/facts using yourextraction.md.
- Architecture — system design, startup flow, prompt composition
- Getting Started — installation and first run
- Data Model — entities, facts, graph schema