Skip to content

Conversation

@epappas
Copy link
Member

@epappas epappas commented Dec 23, 2025

Summary

Implements production-grade runtime security monitoring for the Basilica GPU rental marketplace using Trivy (vulnerability scanning) and Falco (runtime threat detection).

Changes

  • Trivy Operator v0.27.3: Scans all namespaces including user workloads (u-*), SBOM generation enabled
  • Falco v0.39.2: GPU-specific threat detection with 12 custom rules (crypto mining, container escape, model theft)
  • Falcosidekick v2.28.0: HA deployment with webhook automation for CRITICAL alerts
  • Network policies: Microsegmentation for security namespaces
  • Ansible playbook: Automated deployment with readiness verification

Key Features

  • User namespace vulnerability scanning
  • ML workflow false positive reduction (wandb, mlflow, dvc exclusions)
  • Automated incident response via webhook to basilica-api
  • Modern eBPF (CO-RE) driver for K3s compatibility
  • PodDisruptionBudgets for maintenance operations

Summary by CodeRabbit

Release Notes

  • New Features
    • Added automated deployment for integrated security stack including vulnerability scanning and runtime threat detection
    • Implemented cluster network policies to control inter-pod and cross-namespace traffic
    • Added resource quotas and limits for workload governance
    • Configured role-based access controls for security operations

✏️ Tip: You can customize this high-level summary in your review settings.

* Add trivy-system and falco namespaces with security labels
* Add resource quotas for trivy-system (2 CPU/4Gi) and falco (3 CPU/8Gi)
* Add limit ranges to prevent resource exhaustion in security namespaces
* Add network policies for Trivy operator (ingress/egress restrictions)
* Add network policies for Falco daemonset and Falcosidekick
* Configure Prometheus scraping ingress and DNS/API egress rules
* Enable webhook egress from Falcosidekick to basilica-api
* Add Trivy CRDs for vulnerability/config/secret/compliance reports
* Add ClusterRole with least-privilege RBAC for Trivy operator
* Deploy Trivy operator v0.27.3 with SBOM generation enabled
* Configure scanning for all namespaces except system (includes u-*)
* Set 30m scan timeout for large ML container images
* Enable concurrent scan limit of 10 jobs
* Add startup/liveness/readiness probes for reliability
* Configure non-root security context with read-only filesystem
* Add PodDisruptionBudget for maintenance operations
* Add Falco ClusterRole with minimal RBAC for pod/event access
* Configure Falco v0.39.2 with modern eBPF (CO-RE) driver
* Deploy DaemonSet on GPU nodes only (nodeSelector: basilica.ai/node-type=gpu)
* Add 12 GPU-specific detection rules for multi-tenant security:
  - Crypto mining detection (xmrig, stratum patterns)
  - Container escape attempts (nsenter, unshare)
  - GPU driver tampering (/dev/nvidia* writes)
  - Model theft detection (GPU memory reads)
  - Privilege escalation (sudo/setuid)
  - Data exfiltration (cloud storage uploads)
  - Reverse shell patterns
* Add ML workflow exclusions to reduce false positives (wandb, mlflow, dvc)
* Deploy Falcosidekick v2.28.0 with HA (2 replicas, anti-affinity)
* Configure webhook automation for CRITICAL alerts to basilica-api
* Add startup/liveness/readiness probes for reliability
* Add PodDisruptionBudgets for maintenance operations
* Add deploy-security.yml playbook for Trivy and Falco deployment
* Implement proper deployment ordering (namespaces -> CRDs -> RBAC -> workloads)
* Add wait conditions for Deployment and DaemonSet readiness (300s timeout)
* Verify Falco DaemonSet coverage matches GPU node count
* Include rollout status verification for all components
* Configure kubectl context from inventory variables
* Add deployment summary with status reporting
@epappas epappas self-assigned this Dec 23, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 23, 2025

Walkthrough

This PR introduces a comprehensive security stack deployment for Kubernetes clusters, adding an Ansible playbook for orchestration, network policies, resource quotas, Falco DaemonSet with Falcosidekick, GPU-specific security rules, and Trivy operator components with complete RBAC configurations.

Changes

Cohort / File(s) Summary
Ansible Orchestration
orchestrator/ansible/playbooks/deploy-security.yml
Playbook automating sequential deployment of Trivy and Falco to K3s clusters with cluster health verification, namespace/RBAC setup, operator/DaemonSet deployment, readiness checks, and status reporting. Includes error handling for cluster readiness and rollout timeouts.
Common Security Governance
orchestrator/k8s/security/common/network-policies.yaml, orchestrator/k8s/security/common/resource-quotas.yaml
NetworkPolicy resources restricting ingress/egress for trivy-operator, falco, and falcosidekick pods across namespaces with Prometheus, DNS, API server, and registry allowlisting. ResourceQuota and LimitRange configurations for CPU, memory, and pod limits in trivy-system and falco namespaces.
Falco Security Deployment
orchestrator/k8s/security/falco/namespace.yaml, rbac.yaml, configmap.yaml, daemonset.yaml, falcosidekick-deployment.yaml, rules-gpu.yaml
Namespace, ServiceAccount, ClusterRole/Binding for Falco with permissions across core/apps/basilica.ai resources. ConfigMap with Falco engine config (modern_ebpf, gRPC, webserver, HTTP output to Falcosidekick). DaemonSet with GPU nodeSelector, privileged hostNetwork/hostPID, health probes, and multiple volume mounts. Falcosidekick Deployment (2 replicas, pod anti-affinity, non-root security context, webhook integration). ConfigMap with 10+ GPU rental security rules detecting crypto mining, container escape, driver tampering, privilege escalation, data exfiltration.
Trivy Operator Deployment
orchestrator/k8s/security/trivy/namespace.yaml, rbac.yaml, operator-deployment.yaml
Namespace with restricted pod-security labels. RBAC ClusterRole/Binding and Role/Binding granting trivy-operator access to pods, nodes, secrets, jobs, custom Aqua/Security reports, and basilica.ai resources. Deployment with non-root securityContext, Prometheus scraping, vulnerability scanning config, health probes, resource constraints, and exclusion rules for namespaces.

Sequence Diagram(s)

sequenceDiagram
    participant Ansible
    participant K3sCluster
    participant TrivyOp as Trivy Operator
    participant FalcoDaemon as Falco DaemonSet
    participant FSK as Falcosidekick
    
    Ansible->>K3sCluster: Verify cluster health & nodes
    alt Cluster NotReady
        Ansible--XK3sCluster: Fail
    end
    
    Ansible->>K3sCluster: Create trivy-system namespace
    Ansible->>K3sCluster: Apply trivy RBAC
    Ansible->>K3sCluster: Deploy Trivy Operator
    K3sCluster->>TrivyOp: Pod startup
    Ansible->>K3sCluster: Wait for operator readiness
    Ansible->>K3sCluster: Verify Trivy Operator status
    
    Ansible->>K3sCluster: Create falco namespace
    Ansible->>K3sCluster: Apply falco RBAC
    Ansible->>K3sCluster: Apply Falco ConfigMap & GPU rules
    Ansible->>K3sCluster: Deploy Falco DaemonSet
    K3sCluster->>FalcoDaemon: DaemonSet rollout (all nodes)
    
    Ansible->>K3sCluster: Deploy Falcosidekick
    K3sCluster->>FSK: Pod startup (2 replicas)
    Ansible->>K3sCluster: Wait for Falcosidekick readiness
    
    FalcoDaemon->>FSK: Forward security events (port 2801)
    FSK->>Ansible: Webhook integration ready
    
    Ansible->>K3sCluster: Apply network policies & quotas
    Ansible->>K3sCluster: Fetch & verify pod status
    Ansible->>Ansible: Log success & next steps
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Hop hop, a security stack so grand!
Trivy and Falco, now hand in hand,
GPU rules guarding, network flows tight,
Manifests dancing in Kubernetes light!
Basilica's fortified, all systems aligned—
One playbook, infinite defenses designed! 🔐

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(k3s) Add automated security scanners' directly and accurately summarizes the main change: implementing automated security scanning infrastructure (Trivy + Falco) for the K3s cluster.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/k3s-automated-security

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (6)
orchestrator/k8s/security/falco/daemonset.yaml (1)

168-177: Reconsider maxUnavailable: 3 for security DaemonSet.

Allowing 3 Falco pods to be unavailable simultaneously means ~15% of GPU nodes (assuming 20 nodes) could lack runtime security monitoring during maintenance operations.

For a security-critical component, consider reducing to maxUnavailable: 1 or using minAvailable to ensure broader coverage:

🔎 Proposed adjustment
 apiVersion: policy/v1
 kind: PodDisruptionBudget
 metadata:
   name: falco
   namespace: falco
 spec:
-  maxUnavailable: 3
+  maxUnavailable: 1
   selector:
     matchLabels:
       app: falco

Or alternatively, ensure at least 90% coverage:

 apiVersion: policy/v1
 kind: PodDisruptionBudget
 metadata:
   name: falco
   namespace: falco
 spec:
-  maxUnavailable: 3
+  minAvailable: "90%"
   selector:
     matchLabels:
       app: falco
orchestrator/k8s/security/trivy/operator-deployment.yaml (2)

62-65: Validate scan performance parameters for cluster scale.

With 30-minute scan timeout and 10 concurrent scans, ensure these values align with your cluster workload density and available resources.

Consider:

  • 10 concurrent scans × 512Mi memory = ~5Gi memory usage from scan jobs
  • For clusters with >100 workloads, scans may queue significantly
  • ResourceQuota limits might be reached with concurrent scans

Would you like me to generate a script to estimate scan job resource requirements based on typical workload counts?


162-171: PDB with minAvailable: 0 provides no availability guarantee.

Setting minAvailable: 0 means the Trivy operator can be completely unavailable during disruptions, which may delay vulnerability scanning.

For a single-replica deployment, consider removing the PDB entirely (Kubernetes won't prevent disruption anyway) or document why zero availability during maintenance is acceptable:

🔎 Alternative approaches

Option 1: Remove PDB (single replica can't guarantee availability):

# Remove this PodDisruptionBudget - single replica provides no HA benefit

Option 2: Document the intent:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: trivy-operator
  namespace: trivy-system
  annotations:
    description: "Allows complete unavailability during maintenance. Scans will resume after disruption."
spec:
  minAvailable: 0
  selector:
    matchLabels:
      app: trivy-operator

Option 3: If high availability is needed, increase replicas and set appropriate minAvailable:

# In Deployment:
spec:
  replicas: 2
# In PDB:
spec:
  minAvailable: 1
orchestrator/k8s/security/falco/falcosidekick-deployment.yaml (1)

53-58: Consider: Hardcoded webhook URL reduces flexibility.

The webhook address is hardcoded to basilica-api.basilica-system.svc.cluster.local:8080. Consider externalizing this via a ConfigMap or Secret for easier environment-specific configuration without manifest changes.

🔎 Alternative using ConfigMap reference
            - name: WEBHOOK_ADDRESS
              valueFrom:
                configMapKeyRef:
                  name: falcosidekick-config
                  key: webhook-address
orchestrator/k8s/security/common/network-policies.yaml (1)

131-143: Naming: Policy name is misleading.

The NetworkPolicy is named falcosidekick-egress but defines both Ingress and Egress policy types. Consider renaming to falcosidekick or falcosidekick-network-policy for clarity.

🔎 Suggested rename
 metadata:
-  name: falcosidekick-egress
+  name: falcosidekick
   namespace: falco
orchestrator/k8s/security/falco/rules-gpu.yaml (1)

63-83: Minor: Redundant condition in container escape detection.

The condition checks both proc.name in (container_escape_tools) (which includes nsenter and unshare) AND separately checks proc.cmdline contains "nsenter" or proc.cmdline contains "unshare". The cmdline checks are redundant since the process name check already covers these binaries.

🔎 Simplified condition
       condition: >
         spawned_process and
         container and
-        (proc.name in (container_escape_tools) or
-         proc.cmdline contains "nsenter" or
-         proc.cmdline contains "unshare") and
+        proc.name in (container_escape_tools) and
         k8s.ns.name startswith "u-"

If you intend to catch these tools invoked via shell (e.g., sh -c "nsenter ..."), keep the cmdline checks but add a comment explaining the intent.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fbcdecb and 24db0d5.

📒 Files selected for processing (13)
  • orchestrator/ansible/playbooks/deploy-security.yml
  • orchestrator/k8s/security/common/network-policies.yaml
  • orchestrator/k8s/security/common/resource-quotas.yaml
  • orchestrator/k8s/security/falco/configmap.yaml
  • orchestrator/k8s/security/falco/daemonset.yaml
  • orchestrator/k8s/security/falco/falcosidekick-deployment.yaml
  • orchestrator/k8s/security/falco/namespace.yaml
  • orchestrator/k8s/security/falco/rbac.yaml
  • orchestrator/k8s/security/falco/rules-gpu.yaml
  • orchestrator/k8s/security/trivy/crds.yaml
  • orchestrator/k8s/security/trivy/namespace.yaml
  • orchestrator/k8s/security/trivy/operator-deployment.yaml
  • orchestrator/k8s/security/trivy/rbac.yaml
🧰 Additional context used
🪛 Checkov (3.2.334)
orchestrator/k8s/security/trivy/rbac.yaml

[high] 8-50: Minimize ClusterRoles that grant control over validating or mutating admission webhook configurations

(CKV_K8S_155)

orchestrator/k8s/security/falco/daemonset.yaml

[medium] 1-152: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[high] 1-152: Container should not be privileged

(CKV_K8S_16)


[medium] 1-152: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 1-152: Containers should not share the host process ID namespace

(CKV_K8S_17)


[medium] 1-152: Containers should not share the host network namespace

(CKV_K8S_19)

🔇 Additional comments (31)
orchestrator/k8s/security/trivy/namespace.yaml (1)

1-10: LGTM! Well-configured namespace with appropriate security posture.

The namespace correctly applies restricted Pod Security Standards, which is appropriate for the Trivy operator that doesn't require privileged access. The labeling follows Kubernetes best practices.

orchestrator/k8s/security/falco/namespace.yaml (1)

1-10: LGTM! Privileged pod security is required for Falco's runtime threat detection.

The privileged Pod Security Standards are necessary for Falco to perform system-level monitoring using eBPF and access host resources. This is the correct configuration for runtime security monitoring tools.

orchestrator/k8s/security/common/resource-quotas.yaml (2)

31-73: Resource limits are well-structured with appropriate defaults.

The LimitRange configurations for both namespaces establish sensible defaults and boundaries that align with the expected workload characteristics.


16-29: Memory limit quota is adequate; no change needed.

The Falco DaemonSet specifies 1Gi memory limit per pod, and Falcosidekick specifies 512Mi per pod (2 replicas). For 20 GPU nodes, the calculation is:

  • 20 Falco pods × 1Gi = 20Gi
  • 2 Falcosidekick replicas × 512Mi = 1Gi
  • Total: 21Gi

The current limits.memory: "24Gi" provides a 3Gi buffer, which is sufficient operational headroom.

orchestrator/k8s/security/falco/daemonset.yaml (4)

26-30: LGTM! Appropriate security context for runtime monitoring.

The privileged settings (hostNetwork, hostPID, system-node-critical priority) are necessary for Falco's runtime threat detection capabilities. These allow Falco to:

  • Monitor system calls and kernel events
  • Access container runtime socket
  • Observe network traffic
  • Track process execution

The static analysis warnings about privileged access are expected false positives for security monitoring tools.


104-105: K3s-specific containerd socket path is correct.

The socket path /run/k3s/containerd/containerd.sock is specific to K3s distributions, correctly distinguishing from standard /run/containerd/containerd.sock.


63-66: CO-RE eBPF is correctly configured via engine.kind: modern_ebpf in falco.yaml.

The environment variables FALCO_BPF_PROBE: "" and SKIP_DRIVER_LOADER: "true" are legacy settings that don't control the engine type; the actual CO-RE eBPF enablement comes from the engine.kind: modern_ebpf configuration in the ConfigMap (falco.yaml, line 15). This requires:

  • Kernel 5.8+ with BTF support (/sys/kernel/btf/vmlinux)
  • Required capabilities: CAP_SYS_BPF, CAP_SYS_PERFMON, CAP_SYS_RESOURCE, CAP_SYS_PTRACE (currently granted via privileged mode)

The configuration is correct for CO-RE eBPF.


32-33: The GPU-only nodeSelector for Falco is intentional and appropriate for this architecture. Basilica is a decentralized GPU compute platform where Falco monitors GPU-specific security threats (cryptocurrency mining, container escapes, data exfiltration). The architecture does not have traditional "non-GPU worker nodes"—it consists of GPU compute nodes and control-plane nodes, both of which have monitoring coverage:

  • GPU nodes: Falco provides runtime security monitoring for GPU rental threats
  • Control-plane: Trivy operator handles vulnerability scanning
  • All nodes: Alloy DaemonSet (no nodeSelector) provides observability and metrics collection

Creating a separate DaemonSet for non-existent node types would be unnecessary. The current design provides appropriate security coverage for the platform's architecture.

Likely an incorrect or invalid review comment.

orchestrator/k8s/security/trivy/operator-deployment.yaml (2)

26-32: LGTM! Excellent security hardening.

The security context follows best practices with:

  • Non-root execution (UID/GID 1000)
  • Seccomp RuntimeDefault profile
  • Proper fsGroup for volume permissions

56-57: Verify user namespace scanning coverage.

The exclusion list omits user namespaces (matching u-* pattern), enabling vulnerability scanning of user workloads as stated in PR objectives.

This configuration correctly balances scanning user deployments while avoiding overhead on system namespaces.

orchestrator/k8s/security/falco/rbac.yaml (1)

1-40: LGTM! Appropriately scoped RBAC with least-privilege principles.

The RBAC configuration grants Falco the minimum permissions needed for runtime monitoring:

  • Read-only access to cluster resources for context enrichment
  • Event creation for security incident reporting
  • Access to custom Basilica GPU resources for domain-specific monitoring

The permissions follow the principle of least privilege without unnecessary write access to sensitive resources.

orchestrator/ansible/playbooks/deploy-security.yml (2)

106-122: Excellent DaemonSet deployment verification.

The rollout status check combined with coverage verification ensures that Falco pods are deployed to all targeted GPU nodes before proceeding. This prevents silent failures where the DaemonSet is created but pods aren't running.


167-177: Helpful operational summary with actionable next steps.

The summary provides clear next steps for operators to verify the deployment and access documentation. This improves the operational experience.

orchestrator/k8s/security/falco/configmap.yaml (4)

14-18: Modern eBPF (CO-RE) configuration is correct for K3s.

The modern_ebpf engine with appropriate buffer settings enables CO-RE (Compile Once - Run Everywhere) eBPF, which provides better compatibility across kernel versions without requiring driver compilation.

This aligns with the PR objective of using CO-RE driver for K3s compatibility.


28-31: Validate output rate limiting for your threat model.

The rate limiting is set to 1 event/second with burst of 10. This is quite restrictive and may drop alerts during high-activity periods (e.g., crypto mining detection across multiple containers).

For a GPU rental marketplace with potential crypto mining threats, consider:

  • Increasing rate to 5-10 events/second
  • Increasing max_burst to 50-100
  • Monitoring dropped events in production

The current settings prioritize preventing alert floods over capturing all security events.


42-47: Syscall event drop handling is production-ready.

The configuration logs and alerts on syscall drops with appropriate rate limiting (3.33% = ~1 alert per 30 drops), preventing alert fatigue while maintaining visibility into potential data loss.


68-72: Falcosidekick integration correctly configured.

The HTTP output to falcosidekick.falco.svc.cluster.local:2801 uses cluster-local DNS for reliable communication, and the user agent correctly identifies the Falco version.

orchestrator/k8s/security/trivy/rbac.yaml (3)

16-19: Good: Least-privilege pattern for secrets access.

Restricting secret access to specific resourceNames (basilica-r2-credentials, basilica-registry-credentials) at cluster scope is the correct approach for limiting exposure while allowing the operator to scan images from private registries.


70-72: Consider: Namespace-scoped Role grants broader secret access than ClusterRole.

The ClusterRole (lines 16-19) correctly restricts secrets to specific resourceNames, but this namespace-scoped Role grants full CRUD on all secrets within trivy-system. While this is likely needed for the operator to manage its own configuration secrets, verify that this scope is intentional and document the rationale.


47-49: Acknowledge: Webhook configuration permissions flagged by static analysis.

The static analysis tool (CKV_K8S_155) correctly flags this ClusterRole for granting control over validating webhook configurations, which is a legitimate least-privilege concern. However, Trivy operator requires these permissions to manage admission control webhooks for image validation at deploy time. The update verb is essential for runtime webhook configuration management, and create is needed if the operator provisions webhook objects.

If admission control webhooks are not needed in your deployment, remove the create and update verbs to comply with the security check.

orchestrator/k8s/security/falco/falcosidekick-deployment.yaml (2)

23-28: LGTM: Robust security context configuration.

The security configuration follows Kubernetes hardening best practices:

  • Non-root user with explicit UID/GID
  • RuntimeDefault seccomp profile
  • No privilege escalation
  • Read-only root filesystem
  • All capabilities dropped

This is well-suited for a production security component.

Also applies to: 94-99


126-135: LGTM: PDB configuration appropriate for HA.

With 2 replicas and minAvailable: 1, the PDB allows rolling updates and voluntary disruptions while maintaining availability. This aligns well with the anti-affinity configuration.

orchestrator/k8s/security/common/network-policies.yaml (3)

44-63: Note: Egress rules allow any destination on specified ports.

Using to: [] (empty selector) allows traffic to any destination on the specified ports. This is appropriate for API server access (since the API server IP varies) and DNS resolution, but be aware this permits egress to any endpoint on ports 443 and 6443, not just the Kubernetes API server.

For tighter control, consider using CIDR-based selectors for the API server if the cluster IP ranges are known and stable.


122-129: LGTM: Falco-to-Falcosidekick communication correctly scoped.

The egress rule correctly uses podSelector without namespaceSelector since both Falco and Falcosidekick are in the same falco namespace. Traffic is restricted to port 2801.


1-30: LGTM: Trivy ingress policy well-defined.

The ingress rules correctly restrict access to:

  • Prometheus scraping from basilica-system on port 8080
  • Health checks from kube-system on port 9090

Using kubernetes.io/metadata.name label selector is reliable as it's automatically set by Kubernetes.

orchestrator/k8s/security/falco/rules-gpu.yaml (6)

10-11: LGTM: Comprehensive crypto miner detection list.

The list covers major cryptocurrency mining software including xmrig, ethminer, t-rex, and other popular GPU miners. This is a solid foundation for detecting unauthorized mining in GPU rentals.


22-23: Consider: kubectl in container escape tools may cause false positives.

While kubectl can be misused for container escape, it's also commonly used for legitimate operations. Users debugging their pods or running Kubernetes-aware applications might trigger this rule. Consider whether this is intentional or if you need an allowlist for specific use cases.


127-146: Potential issue: Privilege escalation condition logic may be inverted.

The condition not user.name = "root" excludes alerts when the user is already root. However, detecting sudo/su usage when already root is less concerning than when a non-root user attempts it. The logic appears correct for the intent, but the condition syntax should use != for clarity:

-        not user.name = "root"
+        user.name != "root"

Also, line 134 mixes proc.cmdline contains "chmod" with proc.args contains "+s". In Falco, proc.args is typically the raw argument list while proc.cmdline is the full command string. Verify this condition works as intended.


31-39: LGTM: ML workflow exclusions for false positive reduction.

The ml_workflow_tools, ml_parent_processes, and ml_checkpoint_patterns lists are well-thought-out for reducing false positives in legitimate ML workflows. Including tools like wandb, mlflow, dvc, and parent processes like jupyter and ray aligns with the PR objectives.


251-279: LGTM: Bulk data transfer rule with appropriate exclusions.

The rule effectively balances detection of model theft with allowances for legitimate checkpointing. Excluding ml_workflow_tools, ml_parent_processes, and files containing checkpoint patterns prevents alerts on routine ML operations while still flagging suspicious transfers of model weights (.pt, .pth, .safetensors, .gguf).


281-301: LGTM: Reverse shell detection patterns.

The rule covers common reverse shell patterns including:

  • Interactive bash (bash -i)
  • Bash TCP redirects (/dev/tcp/)
  • Named pipe attacks (mkfifo)
  • Netcat with exec (nc -e)

Priority CRITICAL is appropriate for this threat level.

@epappas epappas merged commit 75a0e1c into main Dec 23, 2025
14 checks passed
@epappas epappas deleted the feat/k3s-automated-security branch December 23, 2025 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants