Although CLI environments are primarily designed for software and website development, they are also great environments for creating ERs.
They offer several advantages for our use case:
-
Automatic loading of skills
-
Automatic loading of settings/permissions
-
Automatic loading of MCP servers
-
Proper handling of @agents, avoiding session bias and context bloat
-
Multiple, parallel sessions
Particularly for security reasons in regard to malicious mcp code, prompt injection, and potential errors in the CLI itself, we have two main security measures in place
- Sandboxing of all MCPs in Docker containers
- Sandboxing of the CLI itself using Docker SBX
We tested three environments for ER creation and auditing, as well as model evaluation. The repository provides the configuration for all of them.
-
Claude Code - Opus
-
OpenAI Codex - GPT
-
OpenCode - Grok & Gemini (or any other model that can use MCP servers)
- Check out this project
git clone https://github.com/forever-healthy/AI4L- Or download and expand AI4L-main.zip if you don't want to use git.
Tip
On macOS, use CMD-SHIFT-. to show hidden files and access .claude, .opencode, or .mcp.json
prompts/- prompt files used by the agents and skills, including the main AI4L.md promptcreation/- created ERs & QA audits, tmp files & trashdocs/- documentation files, including this oneexamples/- example ERs and audits created with the CLI environments
-
CLAUDE.md- project global instructions, also read by OpenCode & Codex -
.claude/- related to Claude Code, including agents, skills, and settings -
.mcp.json- configuration for the local MCP servers (used by Claude Code) -
.opencode/- related to OpenCode, including agents, skills, and settings -
AGENTS.md- project global instructions for Codex -
.codex/- related to OpenAI Codex, including agents, skills, and settings
For security reasons, we only run our CLIs in a dedicated Docker Sandbox
Docker SBX:
- Protects against prompt injection by nefarious sites or MCP servers
- Protects the local environment from Claude Code/Codex/OpenCode errors
- Provides the containers for the MCP servers, no need to install
Docker Desktopseparately
It installs using brew and needs a one-time login with a Docker account
brew install docker/tap/sbx
sbx login
cd .../AI4L
sbx run claude
sbx run opencode
sbx run codexImportant
We strongly recommend running the CLIs only within the sandbox!!
We are using three local MCP servers with tools that significantly improve the quality of review creation and audit.
The MCPs are configured in:
- Claude Code: .mcp.json
- OpenCode: .opencode/opencode.json
- Codex: .codex/config.toml
They are automatically loaded at startup and then run in isolated Docker containers to protect against malicious code.
There is no need to install Docker Desktop separately, as the Docker SBX will handle the containers for the MCP servers.
- Local requests to websites are less likely to be blocked
- Is set to ignore robots.txt
- Is set to a Chrome user agent to avoid blocks from sites that disallow bots or have aggressive bot detection
cyanheads/clinicaltrialsgov-mcp-server
- Local requests to clinicaltrials.gov are less likely to be blocked
- Local requests to PubMed are less likely to be blocked
- To increase rate limits: Get a free account with NCBI > get a free API key
- Save the
NCBI_API_KEY&NCBI_ADMIN_EMAILin.envto use the API key for increased rate limits
Claude Code includes built-in MCP servers for PubMed & Clinical Trials. However, they need credentials and make requests from Anthropic's Data Center, which are often blocked by target websites. We recommend disabling them and using the local versions instead.
/mcp > `claude.ai Clinical Trials` & `claude.ai Clinical PubMed` > `disabled`
First startup might take a while:
- On first startup, sbx needs to build the container for the CLI and the containers for individual MCP servers.
- Some MCPs might be displayed as failed on the first run
- You can check the status of MCP servers with
/mcp - Usually, containers are fully instantiated after a short while
- On next startup, all MCP containers are cached and will launch and connect instantly
We are currently using Opus / Claude Code for all creation/audit/fixing agents and Sonnet for workflow processing.
Sonnet is more cost-effective and faster for processing tasks that don't require the advanced capabilities of Opus.
You can use this simple prompt to create your first ER and iteratively audit it with the /er skill:
Claude Code / OpenCode
/er full Tadalafil
Codex
$er full TadalafilIt will create an ER for Tadalafil (Cialis), then audit it, then fix the findings, then audit it again, and so on until it gets a 100% pass rate or reaches the maximum number of audits defined.
We have included our /er skill, which allows for easy creation and auditing of evidence reviews. The skill is automatically loaded by the CLIs on startup.
We have implemented the following /er commands (not case sensitive) in the skill:
-
/er create [ <intervention> [ for|to|as|: <goal> ] ] - Create evidence review for <intervention> to achieve <goal>. The default goal is "for Health & Longevity". Results are saved in [creation_dir] with filenames constructed as defined in AI4L.md
-
/er audit [ <filename> | <intervention> | all ] - Review a specific ER, all unaudited ERs, or the latest ER if nothing is specified. Can also be used to create audits for all ERs generated with other environments/models and saved in [creation_dir].
-
/er fix [ <filename> | <intervention> ] - Review a specific ER, or the latest ER if nothing is specified, and fix the findings.
-
/er iterate [ <filename> | <intervention> ] - Loop (audit → fix) until an audit shows 100% pass rate (up to [max_audits])
-
/er full [ <intervention> ] - Create ER for <intervention> → Loop (audit → fix)
-
/er compare [ <intervention> ] - Compare all ERs for <intervention>, or use the latest <intervention> worked on, if none specified. Allows comparison of ERs across different models for the same intervention. Very helpful when combined with the "/er audit all" command to gain a comprehensive understanding of the models' relative performance.
- ZED as a lightweight editor
- MacDown for visually editing .md files
- Markdown Reader Chrome Extension as an easy-to-use markdown viewer
Check out our Lessons Learned document for insights and best practices.