You are an autonomous security researcher. Your goal is to maximize the F1 score of a security detection system by iterating on detect.py.
prepare.py — Generates labeled test data (DO NOT MODIFY)
detect.py — Detection rules and thresholds (YOU MODIFY THIS)
evaluate.py — Scores your detections against ground truth (DO NOT MODIFY)
data/ — Event data files (generated by prepare.py)
The default autoresearch loop in this repo uses only the simulated attack dataset
generated by prepare.py. Treat BOTSv3 and OpenClaw as optional side integrations,
not as the primary optimization target unless the user explicitly asks for them.
Optimize the detection rules in detect.py so they correctly identify malicious events (true positives) while minimizing false alarms on benign events (false positives). The primary metric is F1 score — the harmonic mean of precision and recall.
Follow this loop for each experiment:
python prepare.py # Generate test data
python evaluate.py # Get baseline scoreRun the evaluation with verbose output to understand what's working and what's not:
python evaluate.py --verboseStudy the per-rule breakdown. Look for:
- Rules with low precision (high false positives) → tighten thresholds
- Rules with low recall (missing malicious events) → loosen thresholds or improve logic
- Attack types in the "missed" section → write new rules or fix existing ones
Edit detect.py. You can:
- Tune thresholds — Change
THRESHOLD,QUERY_LENGTH_THRESHOLD,MIN_CONNECTIONS, etc. - Improve detection logic — Add smarter pattern matching, cross-correlate events
- Add new rules — Write new
detect_*functions and register them inDETECTION_RULES - Remove weak rules — If a rule causes more false positives than true positives, remove it
- Change anomaly settings — Modify
ANOMALY_Z_THRESHOLD,ANOMALY_MIN_SAMPLES, or the detection method - Combine signals — Cross-reference detections (e.g., brute force + lateral movement from same IP)
python evaluate.pyRead the output. The key line is:
>>> F1_SCORE=0.XXXXXX <<<
If F1 improved:
python evaluate.py --commitThis auto-commits detect.py to the current git branch.
If F1 did not improve, discard your changes and try a different approach.
Go back to step 2. Try to beat your best score. Each experiment cycle should take 1-2 minutes.
- Only modify
detect.py— Do not touchprepare.py,evaluate.py, or the data files - Optimize against simulated data by default — Use
python prepare.pyandpython evaluate.pyas the main loop - Keep it pure Python — No external dependencies, no network calls, no GPU
- Every change must be testable — Run
python evaluate.pyafter every change - Git discipline — Only commit improvements, never commit regressions
Here are ideas roughly ordered from easy wins to deeper changes:
- Threshold tuning — The current thresholds are rough guesses. Try values ±50%
- Pattern matching — The PowerShell detector could match more payloads
- Cross-correlation — If an IP does brute force AND lateral movement, it's more suspicious
- New rules — Look at the event data structure and find signals the current rules miss
- Statistical methods — Use the AnomalyDetector class to flag outlier metrics
- Ensemble approach — Combine multiple weak signals into a strong detection
- Event context — Consider sequences of events, not just individual events
You are part of a research community of agents, not working alone. Follow this protocol:
Read what other agents have discovered:
python findings.py --summary # What's been tried, what worked, what failed
python swarm.py --leaderboard # Best scores across all branchesKey rules:
- If a direction has been explored, try something different
- Read the
--summaryfor failed ideas so you don't repeat mistakes - Build on insights from the best-performing agents
Check .direction in the repo root for your assigned research focus:
cat .directionThis tells you what area to explore. Stay focused on your direction — other agents handle other areas.
Publish your findings so other agents can learn from you:
from findings import create_finding, publish_finding
finding = create_finding(
agent_id="your-id", # From .direction
branch="your-branch", # Current git branch
direction="your-direction", # From .direction
f1_score=0.XXXX, # Your best score
approach="What you tried",
changes=["Change 1", "Change 2"],
insights=["What you learned"],
failed_ideas=["What didn't work"],
)
publish_finding(finding)- The dataset is deterministic (seeded RNG). Same code always gives same score.
- There are ~2000 events with a realistic benign-to-malicious ratio.
- Attack types include: brute force, DNS exfiltration, C2 beaconing, lateral movement, PowerShell abuse, and privilege escalation.
- A perfect detector would get F1=1.0. In practice, aim for F1 > 0.90.
- Check
python evaluate.py --baselineto see your best score so far.