|
| 1 | +# DocsGPT Public Threat Model |
| 2 | + |
| 3 | +**Classification:** Public |
| 4 | +**Last updated:** 2026-04-15 |
| 5 | +**Applies to:** Open-source and self-hosted DocsGPT deployments |
| 6 | + |
| 7 | +## 1) Overview |
| 8 | + |
| 9 | +DocsGPT ingests content (files/URLs/connectors), indexes it, and answers queries via LLM-backed APIs and optional tools. |
| 10 | + |
| 11 | +Core components: |
| 12 | +- Backend API (`application/`) |
| 13 | +- Workers/ingestion (`application/worker.py` and related modules) |
| 14 | +- Datastores (MongoDB/Redis/vector stores) |
| 15 | +- Frontend (`frontend/`) |
| 16 | +- Optional extensions/integrations (`extensions/`) |
| 17 | + |
| 18 | +## 2) Scope and assumptions |
| 19 | + |
| 20 | +In scope: |
| 21 | +- Application-level threats in this repository. |
| 22 | +- Local and internet-exposed self-hosted deployments. |
| 23 | + |
| 24 | +Assumptions: |
| 25 | +- Internet-facing instances enable auth and use strong secrets. |
| 26 | +- Datastores/internal services are not publicly exposed. |
| 27 | + |
| 28 | +Out of scope: |
| 29 | +- Cloud hardware/provider compromise. |
| 30 | +- Security guarantees of external LLM vendors. |
| 31 | +- Full security audits of third-party systems targeted by tools (external DBs/MCP servers/code-exec APIs). |
| 32 | + |
| 33 | +## 3) Security objectives |
| 34 | + |
| 35 | +- Protect document/conversation confidentiality. |
| 36 | +- Preserve integrity of prompts, agents, tools, and indexed data. |
| 37 | +- Maintain API/worker availability. |
| 38 | +- Enforce tenant isolation in authenticated deployments. |
| 39 | + |
| 40 | +## 4) Assets |
| 41 | + |
| 42 | +- Documents, attachments, chunks/embeddings, summaries. |
| 43 | +- Conversations, agents, workflows, prompt templates. |
| 44 | +- Secrets (JWT secret, `INTERNAL_KEY`, provider/API/OAuth credentials). |
| 45 | +- Operational capacity (worker throughput, queue depth, model quota/cost). |
| 46 | + |
| 47 | +## 5) Trust boundaries and untrusted input |
| 48 | + |
| 49 | +Trust boundaries: |
| 50 | +- Internet ↔ Frontend |
| 51 | +- Frontend ↔ Backend API |
| 52 | +- Backend ↔ Workers/internal APIs |
| 53 | +- Backend/workers ↔ Datastores |
| 54 | +- Backend ↔ External LLM/connectors/remote URLs |
| 55 | + |
| 56 | +Untrusted input includes API payloads, file uploads, remote URLs, OAuth/webhook data, retrieved content, and LLM/tool arguments. |
| 57 | + |
| 58 | +## 6) Main attack surfaces |
| 59 | + |
| 60 | +1. Auth/authz paths and sharing tokens. |
| 61 | +2. File upload + parsing pipeline. |
| 62 | +3. Remote URL fetching and connectors (SSRF risk). |
| 63 | +4. Agent/tool execution from LLM output. |
| 64 | +5. Template/workflow rendering. |
| 65 | +6. Frontend rendering + token storage. |
| 66 | +7. Internal service endpoints (`INTERNAL_KEY`). |
| 67 | +8. High-impact integrations (SQL tool, generic API tool, remote MCP tools). |
| 68 | + |
| 69 | +## 7) Key threats and expected mitigations |
| 70 | + |
| 71 | +### A. Auth/authz misconfiguration |
| 72 | +- Threat: weak/no auth or leaked tokens leads to broad data access. |
| 73 | +- Mitigations: require auth for public deployments, short-lived tokens, rotation/revocation, least-privilege sharing. |
| 74 | + |
| 75 | +### B. Untrusted file ingestion |
| 76 | +- Threat: malicious files/archives trigger traversal, parser exploits, or resource exhaustion. |
| 77 | +- Mitigations: strict path checks, archive safeguards, file limits, patched parser dependencies. |
| 78 | + |
| 79 | +### C. SSRF/outbound abuse |
| 80 | +- Threat: URL loaders/tools access private/internal/metadata endpoints. |
| 81 | +- Mitigations: validate URLs + redirects, block private/link-local ranges, apply egress controls/allowlists. |
| 82 | + |
| 83 | +### D. Prompt injection + tool abuse |
| 84 | +- Threat: retrieved text manipulates model behavior and causes unsafe tool calls. |
| 85 | +- Threat: never rely on the model to "choose correctly" under adversarial input. |
| 86 | +- Mitigations: treat retrieved/model output as untrusted, enforce tool policies, only expose tools explicitly assigned by the user/admin to that agent, separate system instructions from retrieved content, audit tool calls. |
| 87 | + |
| 88 | +### E. Dangerous tool capability chaining (SQL/API/MCP) |
| 89 | +- Threat: write-capable SQL credentials allow destructive queries. |
| 90 | +- Threat: API tool can trigger side effects (infra/payment/webhook/code-exec endpoints). |
| 91 | +- Threat: remote MCP tools may expose privileged operations. |
| 92 | +- Mitigations: read-only-by-default credentials, destination allowlists, explicit approval for write/exec actions, per-tool policy enforcement + logging. |
| 93 | + |
| 94 | +### F. Frontend/XSS + token theft |
| 95 | +- Threat: XSS can steal local tokens and call APIs. |
| 96 | +- Mitigations: reduce unsafe rendering paths, strong CSP, scoped short-lived credentials. |
| 97 | + |
| 98 | +### G. Internal endpoint exposure |
| 99 | +- Threat: weak/unset `INTERNAL_KEY` enables internal API abuse. |
| 100 | +- Mitigations: fail closed, require strong random keys, keep internal APIs private. |
| 101 | + |
| 102 | +### H. DoS and cost abuse |
| 103 | +- Threat: request floods, large ingestion jobs, expensive prompts/crawls. |
| 104 | +- Mitigations: rate limits, quotas, timeouts, queue backpressure, usage budgets. |
| 105 | + |
| 106 | +## 8) Example attacker stories |
| 107 | + |
| 108 | +- Internet-exposed deployment runs with weak/no auth and receives unauthorized data access/abuse. |
| 109 | +- Intranet deployment intentionally using weak/no auth is vulnerable to insider misuse and lateral-movement abuse. |
| 110 | +- Crafted archive attempts path traversal during extraction. |
| 111 | +- Malicious URL/redirect chain targets internal services. |
| 112 | +- Poisoned document causes data exfiltration through tool calls. |
| 113 | +- Over-privileged SQL/API/MCP tool performs destructive side effects. |
| 114 | + |
| 115 | +## 9) Severity calibration |
| 116 | + |
| 117 | +- **Critical:** unauthenticated public data access; prompt-injection-driven exfiltration; SSRF to sensitive internal endpoints. |
| 118 | +- **High:** cross-tenant leakage, persistent token compromise, over-privileged destructive tools. |
| 119 | +- **Medium:** DoS/cost amplification and non-critical information disclosure. |
| 120 | +- **Low:** minor hardening gaps with limited impact. |
| 121 | + |
| 122 | +## 10) Baseline controls for public deployments |
| 123 | + |
| 124 | +1. Enforce authentication and secure defaults. |
| 125 | +2. Set/rotate strong secrets (`JWT`, `INTERNAL_KEY`, encryption keys). |
| 126 | +3. Restrict CORS and front API with a hardened proxy. |
| 127 | +4. Add rate limiting/quotas for answer/upload/crawl/token endpoints. |
| 128 | +5. Enforce URL+redirect SSRF protections and egress restrictions. |
| 129 | +6. Apply upload/archive/parsing hardening. |
| 130 | +7. Require least-privilege tool credentials and auditable tool execution. |
| 131 | +8. Monitor auth failures, tool anomalies, ingestion spikes, and cost anomalies. |
| 132 | +9. Keep dependencies/images patched and scanned. |
| 133 | +10. Validate multi-tenant isolation with explicit tests. |
| 134 | + |
| 135 | +## 11) Maintenance |
| 136 | + |
| 137 | +Review this model after major auth, ingestion, connector, tool, or workflow changes. |
| 138 | + |
| 139 | +## References |
| 140 | + |
| 141 | +- [OWASP Top 10 for LLM Applications](https://owasp.org/www-project-top-10-for-large-language-model-applications/) |
| 142 | +- [OWASP ASVS](https://owasp.org/www-project-application-security-verification-standard/) |
| 143 | +- [STRIDE overview](https://learn.microsoft.com/azure/security/develop/threat-modeling-tool-threats) |
| 144 | +- [DocsGPT SECURITY.md](../SECURITY.md) |
0 commit comments