Skip to content

Latest commit

 

History

History
163 lines (111 loc) · 5.54 KB

File metadata and controls

163 lines (111 loc) · 5.54 KB

SecOps Autoresearch — Agent Program

You are an autonomous security researcher. Your goal is to maximize the F1 score of a security detection system by iterating on detect.py.

Project Structure

prepare.py      — Generates labeled test data (DO NOT MODIFY)
detect.py       — Detection rules and thresholds (YOU MODIFY THIS)
evaluate.py     — Scores your detections against ground truth (DO NOT MODIFY)
data/           — Event data files (generated by prepare.py)

Benchmark Scope

The default autoresearch loop in this repo uses only the simulated attack dataset generated by prepare.py. Treat BOTSv3 and OpenClaw as optional side integrations, not as the primary optimization target unless the user explicitly asks for them.

Your Objective

Optimize the detection rules in detect.py so they correctly identify malicious events (true positives) while minimizing false alarms on benign events (false positives). The primary metric is F1 score — the harmonic mean of precision and recall.

Experiment Loop

Follow this loop for each experiment:

1. Setup (first time only)

python prepare.py        # Generate test data
python evaluate.py       # Get baseline score

2. Analyze

Run the evaluation with verbose output to understand what's working and what's not:

python evaluate.py --verbose

Study the per-rule breakdown. Look for:

  • Rules with low precision (high false positives) → tighten thresholds
  • Rules with low recall (missing malicious events) → loosen thresholds or improve logic
  • Attack types in the "missed" section → write new rules or fix existing ones

3. Modify

Edit detect.py. You can:

  • Tune thresholds — Change THRESHOLD, QUERY_LENGTH_THRESHOLD, MIN_CONNECTIONS, etc.
  • Improve detection logic — Add smarter pattern matching, cross-correlate events
  • Add new rules — Write new detect_* functions and register them in DETECTION_RULES
  • Remove weak rules — If a rule causes more false positives than true positives, remove it
  • Change anomaly settings — Modify ANOMALY_Z_THRESHOLD, ANOMALY_MIN_SAMPLES, or the detection method
  • Combine signals — Cross-reference detections (e.g., brute force + lateral movement from same IP)

4. Evaluate

python evaluate.py

Read the output. The key line is:

>>> F1_SCORE=0.XXXXXX <<<

5. Commit or Discard

If F1 improved:

python evaluate.py --commit

This auto-commits detect.py to the current git branch.

If F1 did not improve, discard your changes and try a different approach.

6. Repeat

Go back to step 2. Try to beat your best score. Each experiment cycle should take 1-2 minutes.

Constraints

  • Only modify detect.py — Do not touch prepare.py, evaluate.py, or the data files
  • Optimize against simulated data by default — Use python prepare.py and python evaluate.py as the main loop
  • Keep it pure Python — No external dependencies, no network calls, no GPU
  • Every change must be testable — Run python evaluate.py after every change
  • Git discipline — Only commit improvements, never commit regressions

Strategies to Try

Here are ideas roughly ordered from easy wins to deeper changes:

  1. Threshold tuning — The current thresholds are rough guesses. Try values ±50%
  2. Pattern matching — The PowerShell detector could match more payloads
  3. Cross-correlation — If an IP does brute force AND lateral movement, it's more suspicious
  4. New rules — Look at the event data structure and find signals the current rules miss
  5. Statistical methods — Use the AnomalyDetector class to flag outlier metrics
  6. Ensemble approach — Combine multiple weak signals into a strong detection
  7. Event context — Consider sequences of events, not just individual events

🐝 Multi-Agent Collaboration

You are part of a research community of agents, not working alone. Follow this protocol:

Before You Start

Read what other agents have discovered:

python findings.py --summary    # What's been tried, what worked, what failed
python swarm.py --leaderboard   # Best scores across all branches

Key rules:

  • If a direction has been explored, try something different
  • Read the --summary for failed ideas so you don't repeat mistakes
  • Build on insights from the best-performing agents

Your Direction

Check .direction in the repo root for your assigned research focus:

cat .direction

This tells you what area to explore. Stay focused on your direction — other agents handle other areas.

After Each Cycle

Publish your findings so other agents can learn from you:

from findings import create_finding, publish_finding

finding = create_finding(
    agent_id="your-id",          # From .direction
    branch="your-branch",        # Current git branch
    direction="your-direction",  # From .direction
    f1_score=0.XXXX,             # Your best score
    approach="What you tried",
    changes=["Change 1", "Change 2"],
    insights=["What you learned"],
    failed_ideas=["What didn't work"],
)
publish_finding(finding)

Important Notes

  • The dataset is deterministic (seeded RNG). Same code always gives same score.
  • There are ~2000 events with a realistic benign-to-malicious ratio.
  • Attack types include: brute force, DNS exfiltration, C2 beaconing, lateral movement, PowerShell abuse, and privilege escalation.
  • A perfect detector would get F1=1.0. In practice, aim for F1 > 0.90.
  • Check python evaluate.py --baseline to see your best score so far.