The System Agent

In Task 2 you built an agent that reads documentation. But documentation can be outdated — the real system is the source of truth. In this task you will give your agent a new tool (query_api) so it can talk to your deployed backend, and teach it to answer two new kinds of questions: static system facts (framework, ports, status codes) and data-dependent queries (item count, scores).

What you will add

You will add a query_api tool to the agent you built in Task 2. The agentic loop stays the same — you are just adding one more tool the LLM can call. The agent can now send requests to your deployed backend in addition to reading files.

CLI interface

Same rules as Task 2. The only change: source is now optional (system questions may not have a wiki source).

uv run agent.py "How many items are in the database?"

{
  "answer": "There are 120 items in the database.",
  "tool_calls": [
    {"tool": "query_api", "args": {"method": "GET", "path": "/items/"}, "result": "{\"status_code\": 200, ...}"}
  ]
}

New tool: `query_api`

Call your deployed backend API. Register it as a function-calling schema alongside your existing tools.

Parameters: method (string — GET, POST, etc.), path (string — e.g., /items/), body (string, optional — JSON request body).
Returns: JSON string with status_code and body.
Authentication: use LMS_API_KEY from .env.docker.secret (the backend key, not the LLM key).

Update your system prompt so the LLM knows when to use wiki tools vs query_api vs read_file on source code.

Note: Two distinct keys: LMS_API_KEY (in .env.docker.secret) protects your backend endpoints. LLM_API_KEY (in .env.agent.secret) authenticates with your LLM provider. Don't mix them up.

Environment variables

Your agent must read all configuration from environment variables, not hardcoded values. The .env.agent.secret and .env.docker.secret files are local conveniences — the autochecker will inject its own values when evaluating your agent.

Variable	Purpose	Source
`LLM_API_KEY`	LLM provider API key	`.env.agent.secret`
`LLM_API_BASE`	LLM API endpoint URL	`.env.agent.secret`
`LLM_MODEL`	Model name	`.env.agent.secret`
`LMS_API_KEY`	Backend API key for `query_api` auth	`.env.docker.secret`
`AGENT_API_BASE_URL`	Base URL for `query_api` (default: `http://localhost:42002`)	Optional, defaults to localhost

Important

The autochecker runs your agent with different LLM credentials and a different backend URL. If you hardcode any of these values, your agent will fail the autochecker evaluation.

Pass the benchmark

Once query_api works, run the evaluation benchmark locally and iterate until your agent passes.

uv run run_eval.py

The script runs your agent against 10 local questions across all classes (wiki lookup, system facts, data queries, bug diagnosis, reasoning). On failure it shows a feedback hint.

  ✓ [1/10] According to the project wiki, what steps are needed to protect a branch?
  ✓ [2/10] What Python web framework does this project use?
  ✓ [3/10] How many items are in the database?

  ✗ [4/10] Query the /analytics/completion-rate endpoint for lab-99...
    feedback: Try GET /analytics/completion-rate?lab=lab-99. Read the error, then find the buggy line.

3/10 passed

Fix the failing question, re-run, move on to the next one.

Benchmark questions (open set)

These are the 10 questions run_eval.py tests locally. There are two grading modes:

Keyword match — the answer must contain one or more of the listed keywords.
LLM judge — the autochecker sends your answer to an LLM grader with a rubric. Used for open-ended reasoning questions where keywords alone are not enough. run_eval.py falls back to a basic length check locally; the bot uses the full LLM judge.

Your agent must also use the listed tool(s) — calling the wrong tool fails the check even if the answer text is correct.

#	Question	Grading	Expected	Tools required
0	According to the project wiki, what steps are needed to protect a branch on GitHub?	keyword	`branch`, `protect`	`read_file`
1	What does the project wiki say about connecting to your VM via SSH? Summarize the key steps.	keyword	`ssh` / `key` / `connect`	`read_file`
2	What Python web framework does this project's backend use? Read the source code to find out.	keyword	`FastAPI`	`read_file`
3	List all API router modules in the backend. What domain does each one handle?	keyword	`items`, `interactions`, `analytics`, `pipeline`	`list_files`
4	How many items are currently stored in the database? Query the running API to find out.	keyword	a number > 0	`query_api`
5	What HTTP status code does the API return when you request `/items/` without an authentication header?	keyword	`401` / `403`	`query_api`
6	Query `/analytics/completion-rate` for a lab with no data (e.g., `lab-99`). What error do you get, and what is the bug in the source code?	keyword	`ZeroDivisionError` / `division by zero`	`query_api`, `read_file`
7	The `/analytics/top-learners` endpoint crashes for some labs. Query it, find the error, and read the source code to explain what went wrong.	keyword	`TypeError` / `None` / `NoneType` / `sorted`	`query_api`, `read_file`
8	Read `docker-compose.yml` and the backend `Dockerfile`. Explain the full journey of an HTTP request from the browser to the database and back.	LLM judge	must trace ≥4 hops: Caddy → FastAPI → auth → router → ORM → PostgreSQL	`read_file`
9	Read the ETL pipeline code. Explain how it ensures idempotency — what happens if the same data is loaded twice?	LLM judge	must identify the `external_id` check and explain that duplicates are skipped	`read_file`

Note

The autochecker tests your agent with 10 additional hidden questions not present in run_eval.py. These include multi-step challenges that require chaining tools (e.g., query an API error, then read the source code to diagnose the bug). You need a genuinely working agent — not hard-coded answers.

Note

How the autochecker scores your agent:

Locally, run_eval.py checks answers with simple keyword matching.
The autochecker bot uses the same keyword checks, but for open-ended reasoning questions (e.g., "explain the request lifecycle") it uses LLM-based judging with a rubric — a stricter and more accurate evaluation.
The bot also verifies that your agent used the correct tools (e.g., query_api for data questions, read_file for code questions).
You need to pass a minimum threshold overall (local + hidden questions combined).

Debugging workflow

Symptom	Likely cause	Fix
Agent doesn't use a tool when it should	Tool description too vague for the LLM	Improve the tool's description in the schema
Tool called but returns an error	Bug in tool implementation	Fix the tool code, test it in isolation
Tool called with wrong arguments	LLM misunderstands the schema	Clarify parameter descriptions
Agent times out	Too many tool calls or slow LLM	Reduce max iterations, try a faster model
Agent crashes with `AttributeError: 'NoneType'`	LLM returns `content: null` when it makes tool calls	Use `(msg.get("content") or "")` instead of `msg.get("content", "")` — the field is present but `null`, not missing
Agent reads the same file in a loop	File is too large and gets truncated, LLM can't find the answer	Increase the content limit sent back to the LLM
Answer is close but doesn't match	Phrasing doesn't contain expected keyword	Adjust system prompt to be more precise

Deliverables

1. Plan (`plans/task-3.md`)

Before writing code, create plans/task-3.md. Describe how you will define the query_api tool schema, handle authentication, and update the system prompt.

After running the benchmark once, add your initial score, first failures, and iteration strategy.

2. Tool and agent updates (update `agent.py`)

Add query_api as a function-calling schema, implement it with authentication, and update the system prompt. Then iterate until the benchmark passes.

3. Documentation (update `AGENT.md`)

Update AGENT.md to document the query_api tool, its authentication, how the LLM decides between wiki and system tools, lessons learned from the benchmark, and your final eval score. At least 200 words.

4. Tests (2 more tests)

Add 2 regression tests for system agent tools. Example questions:

"What framework does the backend use?" → expects read_file in tool_calls.
"How many items are in the database?" → expects query_api in tool_calls.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The System Agent

What you will add

CLI interface

New tool: `query_api`

Environment variables

Pass the benchmark

Benchmark questions (open set)

Debugging workflow

Deliverables

1. Plan (`plans/task-3.md`)

2. Tool and agent updates (update `agent.py`)

3. Documentation (update `AGENT.md`)

4. Tests (2 more tests)

Acceptance criteria

FilesExpand file tree

task-3.md

Latest commit

History

task-3.md

File metadata and controls

The System Agent

What you will add

CLI interface

New tool: query_api

Environment variables

Pass the benchmark

Benchmark questions (open set)

Debugging workflow

Deliverables

1. Plan (plans/task-3.md)

2. Tool and agent updates (update agent.py)

3. Documentation (update AGENT.md)

4. Tests (2 more tests)

Acceptance criteria

New tool: `query_api`

1. Plan (`plans/task-3.md`)

2. Tool and agent updates (update `agent.py`)

3. Documentation (update `AGENT.md`)