SecOps Autoresearch — Agent Program

You are an autonomous security researcher. Your goal is to maximize the F1 score of a security detection system by iterating on detect.py.

Project Structure

prepare.py      — Generates labeled test data (DO NOT MODIFY)
detect.py       — Detection rules and thresholds (YOU MODIFY THIS)
evaluate.py     — Scores your detections against ground truth (DO NOT MODIFY)
data/           — Event data files (generated by prepare.py)

Benchmark Scope

The default autoresearch loop in this repo uses only the simulated attack dataset generated by prepare.py. Treat BOTSv3 and OpenClaw as optional side integrations, not as the primary optimization target unless the user explicitly asks for them.

Your Objective

Optimize the detection rules in detect.py so they correctly identify malicious events (true positives) while minimizing false alarms on benign events (false positives). The primary metric is F1 score — the harmonic mean of precision and recall.

Experiment Loop

Follow this loop for each experiment:

1. Setup (first time only)

python prepare.py        # Generate test data
python evaluate.py       # Get baseline score

2. Analyze

Run the evaluation with verbose output to understand what's working and what's not:

python evaluate.py --verbose

Study the per-rule breakdown. Look for:

Rules with low precision (high false positives) → tighten thresholds
Rules with low recall (missing malicious events) → loosen thresholds or improve logic
Attack types in the "missed" section → write new rules or fix existing ones

3. Modify

Edit detect.py. You can:

Tune thresholds — Change THRESHOLD, QUERY_LENGTH_THRESHOLD, MIN_CONNECTIONS, etc.
Improve detection logic — Add smarter pattern matching, cross-correlate events
Add new rules — Write new detect_* functions and register them in DETECTION_RULES
Remove weak rules — If a rule causes more false positives than true positives, remove it
Change anomaly settings — Modify ANOMALY_Z_THRESHOLD, ANOMALY_MIN_SAMPLES, or the detection method
Combine signals — Cross-reference detections (e.g., brute force + lateral movement from same IP)

4. Evaluate

python evaluate.py

Read the output. The key line is:

>>> F1_SCORE=0.XXXXXX <<<

5. Commit or Discard

If F1 improved:

python evaluate.py --commit

This auto-commits detect.py to the current git branch.

If F1 did not improve, discard your changes and try a different approach.

6. Repeat

Go back to step 2. Try to beat your best score. Each experiment cycle should take 1-2 minutes.

Constraints

Only modify detect.py — Do not touch prepare.py, evaluate.py, or the data files
Optimize against simulated data by default — Use python prepare.py and python evaluate.py as the main loop
Keep it pure Python — No external dependencies, no network calls, no GPU
Every change must be testable — Run python evaluate.py after every change
Git discipline — Only commit improvements, never commit regressions

Strategies to Try

Here are ideas roughly ordered from easy wins to deeper changes:

Threshold tuning — The current thresholds are rough guesses. Try values ±50%
Pattern matching — The PowerShell detector could match more payloads
Cross-correlation — If an IP does brute force AND lateral movement, it's more suspicious
New rules — Look at the event data structure and find signals the current rules miss
Statistical methods — Use the AnomalyDetector class to flag outlier metrics
Ensemble approach — Combine multiple weak signals into a strong detection
Event context — Consider sequences of events, not just individual events

🐝 Multi-Agent Collaboration

You are part of a research community of agents, not working alone. Follow this protocol:

Before You Start

Read what other agents have discovered:

python findings.py --summary    # What's been tried, what worked, what failed
python swarm.py --leaderboard   # Best scores across all branches

Key rules:

If a direction has been explored, try something different
Read the --summary for failed ideas so you don't repeat mistakes
Build on insights from the best-performing agents

Your Direction

Check .direction in the repo root for your assigned research focus:

cat .direction

This tells you what area to explore. Stay focused on your direction — other agents handle other areas.

After Each Cycle

Publish your findings so other agents can learn from you:

from findings import create_finding, publish_finding

finding = create_finding(
    agent_id="your-id",          # From .direction
    branch="your-branch",        # Current git branch
    direction="your-direction",  # From .direction
    f1_score=0.XXXX,             # Your best score
    approach="What you tried",
    changes=["Change 1", "Change 2"],
    insights=["What you learned"],
    failed_ideas=["What didn't work"],
)
publish_finding(finding)

Important Notes

The dataset is deterministic (seeded RNG). Same code always gives same score.
There are ~2000 events with a realistic benign-to-malicious ratio.
Attack types include: brute force, DNS exfiltration, C2 beaconing, lateral movement, PowerShell abuse, and privilege escalation.
A perfect detector would get F1=1.0. In practice, aim for F1 > 0.90.
Check python evaluate.py --baseline to see your best score so far.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SecOps Autoresearch — Agent Program

Project Structure

Benchmark Scope

Your Objective

Experiment Loop

1. Setup (first time only)

2. Analyze

3. Modify

4. Evaluate

5. Commit or Discard

6. Repeat

Constraints

Strategies to Try

🐝 Multi-Agent Collaboration

Before You Start

Your Direction

After Each Cycle

Important Notes

FilesExpand file tree

program.md

Latest commit

History

program.md

File metadata and controls

SecOps Autoresearch — Agent Program

Project Structure

Benchmark Scope

Your Objective

Experiment Loop

1. Setup (first time only)

2. Analyze

3. Modify

4. Evaluate

5. Commit or Discard

6. Repeat

Constraints

Strategies to Try

🐝 Multi-Agent Collaboration

Before You Start

Your Direction

After Each Cycle

Important Notes