Add comprehensive systemd service setup instructions to README.md by Copilot · Pull Request #3 · fabriziosalmi/certmate

Copilot · 2025-06-27T12:35:11Z

Summary

Adds detailed instructions for setting up CertMate as a systemd service to the main README.md documentation, addressing the need for production deployment guidance on Linux systems.

Changes Made

Added a new section "🔧 Setting Up CertMate as a Systemd Service" that includes:

📝 Complete Service Configuration

Example certmate.service file with proper systemd configuration
Uses the exact service file format requested in the issue

🚀 Step-by-Step Setup Instructions

User Creation - Creating a dedicated certmate system user for security
Application Deployment - Moving CertMate to /opt/certmate
Permissions Setup - Setting proper ownership and permissions
Service File Creation - Creating the systemd service file
Environment Configuration - Setting up API bearer token and other environment variables
Service Management - Enabling, starting, and checking service status

✅ Additional Benefits

Explains the advantages of using systemd (auto-start, auto-restart, monitoring, logging)
Provides common service management commands for ongoing operations
Includes security tips for environment variable management

Example Service Configuration

[Unit]
Description=Certmate Service
After=network.target

[Service]
User=certmate
Group=certmate
WorkingDirectory=/opt/certmate
ExecStart=/usr/bin/python3 /opt/certmate/app.py
Restart=always
Environment="API_BEARER_TOKEN=your_secret_token"

[Install]
WantedBy=multi-user.target

Why This Change

This addresses a common deployment scenario where users need to run CertMate as a production service on Linux systems. The systemd integration provides:

Reliability - Automatic restart on failure
Boot Integration - Starts automatically when system boots
Process Management - Standard systemctl commands for management
Security - Runs under dedicated user account
Logging - Integrated with systemd journal

Impact

Zero breaking changes - Only adds new documentation
Follows existing style - Matches README.md formatting and emoji usage
Comprehensive coverage - Addresses all aspects of systemd deployment
Production ready - Provides robust service configuration

This makes CertMate easier to deploy in production environments while maintaining the same high-quality documentation standards established in the project.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

* docs: neutralize DNS provider counts in README and docs/ README.md and six pages under docs/ cited DNS provider counts as 22, 22+, 23, or 24, all pointing to the same canonical table in docs/dns-providers.md. Replace the explicit numbers in prose with neutral wording so the docs do not need to be touched every time a provider is added or removed. The exact list remains in the table on docs/dns-providers.md, which is the single source of truth. Matches the wiki cleanup pushed in the certmate.wiki repo. * fix(certificates): lock metadata RMW in record_*_deployment_status `record_backend_deployment_status` and `record_browser_deployment_status` both did `_load_metadata` → mutate → `_save_metadata` without holding the per-domain lock that `_get_domain_lock(domain)` already provides (used by cert issuance, renewal, and revocation paths in the same class). If a backend deployment-check request and a browser deployment-check request landed in two worker threads at the same time, both would load the same `deployment_status` snapshot, each would set its own key (`backend` or `browser`), each would save — and whichever saved second would silently drop the other's update. The next page render would show stale info until the next status check happened to win the race. Fix: wrap both methods' RMW in `with self._get_domain_lock(domain):`. Same lock, same per-domain granularity, no global contention. Regression test in `tests/test_metadata_concurrent_writes.py` uses a `threading.Barrier(2)` to maximise the race window and runs 50 iterations. With the lock both keys are deterministically present on every iteration; without it, the test reliably catches a lost-write within a handful of iterations. * fix(certificates): quarantine corrupt metadata.json instead of clobbering it _load_metadata used to catch Exception broadly, log a warning, and return {}. The combination of the broad catch and the empty-dict fallback meant that an unparseable metadata.json (truncated write, bit flip, manual edit gone wrong) would be returned as `{}` to callers — and the next _save_metadata would write the empty dict back to the same path. The only copy of the data was destroyed silently on first read. Split the handler: - json.JSONDecodeError → rename the file to `metadata.json.corrupt-<utc-timestamp>` before returning `{}`. The corrupted bytes survive for forensics and the next save lands on a fresh path. Logged as ERROR, not warning, so an operator scanning logs sees it. - OSError → keep the previous behaviour (warning + `{}`). The file itself is fine; a retry might succeed; quarantining would be wrong. Also collapse a divergent inline reader in get_certificate_info that was doing the same `open + json.load + except Exception` dance — it now calls _load_metadata so JSON corruption is handled consistently in one place instead of two. Regression tests in tests/test_metadata_corruption_handling.py pin: - corrupt JSON → file is quarantined and renamed - subsequent _save_metadata writes a fresh file without touching the quarantine (the whole point of the rename) - missing file → empty dict, no quarantine artefact - OSError on read → empty dict, no quarantine (file may still be valid) * fix(health): surface scheduler-setup failure reason on /health When setup_scheduler raised an exception (sqlite locked, corrupt jobstore, dependency missing), the only signals were a single ERROR line in the application log and a Python RuntimeWarning emitted to stderr. /health reduced the failure to a bare `scheduler: not_running` with no context, indistinguishable from "scheduler intentionally not configured". Operators who don't tail logs had no way to tell that automatic certificate renewal had silently stopped working — the audit list flagged this as "admin doesn't know". Capture the setup outcome on the AppContainer: container.scheduler_status = { "state": "uninitialized" | "running" | "failed", "error": str | None, "timestamp": utc-iso str | None, } Mirror it into managers['scheduler_status'] so /health can read it without poking at the container directly. /health now reports, when state=failed: checks.scheduler == "failed" checks.scheduler_error == "<exception message>" checks.scheduler_failed_at == "<utc iso>" status == "degraded" Pre-existing /health behaviour (scheduler missing / not running for non- exception reasons) is unchanged: still reports "not_running" + degraded. The new "failed" branch is strictly additional context, not a contract break for any consumer that only looked at the top-level status. Regression tests in tests/test_scheduler_status_health.py pin the three branches: failed-with-reason, running, and not-set (back-compat). * fix(deployer): close bash parameter-expansion bypass in safe_vars regex The deploy-hook safe-vars regex was `\$\{?CERTMATE_[A-Z_]+\}?` — the optional brace at the head AND tail meant it matched both `${CERTMATE_FOO` (no closing brace) and `${CERTMATE_FOO}` (closed) on equal terms. Bash parameter-expansion operators (`:-`, `:=`, `:+`, `:?`, `//`, `%`, `#`) always sit between the variable name and the closing brace, so when an operator + payload + close was appended the regex still consumed the `${CERTMATE_FOO` head and left only the harmless-looking tail: Input: ${CERTMATE_FOO:-/etc/passwd} After strip: __SAFE__:-/etc/passwd} Dangerous: no metacharacter matches the lone `:-…}` tail -> validator returned OK; at runtime bash expanded the form to /etc/passwd whenever CERTMATE_FOO was unset (which it always is — CertMate only sets a specific allowlist of CERTMATE_* names). Same trick worked with the other expansion operators: ${CERTMATE_PATH:-/etc/shadow} ${CERTMATE_FOO:+rm -rf /} ${CERTMATE_FOO//a/b} Fix the regex to require the closing brace IMMEDIATELY after the variable name. Only two forms now strip-substitute: $CERTMATE_NAME ${CERTMATE_NAME} Anything with operators between the name and the brace does NOT match safe_vars; the unchanged `\$\{` rule in _DANGEROUS_SHELL then catches it and the validator rejects with "contains dangerous shell metacharacters". Empirically verified: 10 bypass attempts (the six expansion operators, the unclosed-brace form, the colon-equals/plus/question variants, plus an unrelated $VAR) now all BLOCK. 10 legitimate $CERTMATE_* forms (documented in the Q&A discussion #180 / wiki Deploy-Hooks page) still PASS — including underscored names, concatenation, and the cp/cert-copy patterns real hooks use. Parametrized regression test in tests/test_deploy_hook_param_expansion_bypass.py pins both directions so the next refactor of safe_vars cannot reopen the hole. * chore(hardening): defensive fixes from MEDIUM audit list Three small, low-risk hardening items the MEDIUM list raised. None of them close an observed bug; each addresses a latent edge case. 1. factory.setup_scheduler — verify SQLite WAL took effect. `PRAGMA journal_mode=WAL` does NOT raise when the filesystem rejects WAL (NFS, certain network mounts, old FAT). SQLite silently falls back to the previous journal mode and concurrency drops with no signal. After issuing the PRAGMA we now read it back and log a warning if the effective mode is not WAL. Correctness was never at risk; the only change is that "scheduler feels slow" stops being a mystery. 2. deployer._run_hook — coerce timeout to int defensively. save_config already does `int(hook.get('timeout', DEFAULT_TIMEOUT))` on the write path, so under normal flow `timeout` arrives as int. But a hand-edited settings.json, an older config schema migrated forward, or a programmatically constructed hook could carry a string or None — `max(str, 1)` raises TypeError in Python 3 and would crash the renewal worker. The read path now tries `int()` with a DEFAULT_TIMEOUT fallback on (TypeError, ValueError), mirroring the write path. Parametrized regression test in tests/test_deploy_hook_timeout_coercion.py covers string '30', unparseable string, None, and float. 3. docs/installation.md — note on NFS-backed data/ directory. CertMate uses standard blocking Python I/O for everything in `data/`. On a stalled NFS server those reads can hang indefinitely; on filesystems that don't support WAL the scheduler store falls back. Document the recommended mount options (`soft,timeo=30,retrans=3`) and where to look for the WAL-fallback warning. Local disk remains the recommended baseline. These three are bundled in one commit on user direction — they are related (defense-in-depth against operational edge cases) and small enough that splitting them would inflate the audit-list churn without adding traceability. * fix(file_operations): initialise temp_file before mkstemp to avoid UnboundLocalError safe_file_write defined `temp_file = Path(_tmp_name)` only AFTER mkstemp() returned, and then referenced `temp_file.exists()` inside both `except (PermissionError, OSError)` and `except Exception` handlers as a cleanup guard. If mkstemp() itself raised — parent directory unwritable, disk full, sandbox restriction, anything that fails before _tmp_name is bound — control jumped straight to one of those except blocks and `temp_file` was never defined. The result: the original OSError was masked by an UnboundLocalError stack trace, and the operator log showed the masking error instead of the actual filesystem problem. Concretely: "cannot find local variable 'temp_file'" replaced "[Errno 28] No space left on device". Fix: initialise `temp_file = None` at the top of the function and gate the cleanup on `if temp_file is not None and temp_file.exists():` in both handlers. The original error now logs verbatim and the return is False (matching every other failure path in this method). Regression test in tests/test_safe_file_write_unbound_temp_file.py mocks mkstemp to raise OSError and asserts: - safe_file_write returns False (not raise) - the simulated OSError message appears in the log - the happy path still works unchanged * docs(installation): document BEHIND_PROXY for reverse-proxy deployments `BEHIND_PROXY=true` is supported in factory.py:347 — it enables Werkzeug's ProxyFix so request.remote_addr resolves to the original client IP via X-Forwarded-* headers. Without it, every request appears to come from the proxy's IP, which collapses per-client rate limiting into a single bucket and obscures the actual client in audit-log entries and "invalid API token attempt from X" warnings. The env var was undocumented across docs/ and README; operators running CertMate behind Nginx / Traefik / Cloudflare had no in-tree hint that the flag existed and were left with degraded rate limiting silently. Add two pieces of documentation: 1. A `BEHIND_PROXY=true` entry in the Environment Variables block of docs/installation.md, with a short justification. 2. A new "Behind a reverse proxy" subsection under Production Deployment with the rationale, when NOT to enable it (the proxy is the trust boundary — without one, X-Forwarded-For is attacker- spoofable), and the required Nginx header-forwarding snippet. No code change. * perf(settings): request-scoped cache for load_settings to kill N+1 disk reads A typical /api/certificates request was calling settings_manager.load_settings() ~100 times when the dashboard listed 50 certificates: once at the top of the handler, once inside get_certificate_info per domain, and once more inside _parse_certificate_info per domain. Each call opened settings.json, parsed JSON, ran the migration/default-fill logic, and acquired the SettingsManager RLock — for the same on-disk bytes, on every call, under the same request. The fix caches the parsed settings on `flask.g` for the duration of one HTTP request. The first call hits disk; subsequent calls return a deepcopy of the cached dict. The cache is invalidated by save_settings and atomic_update so a route that mutates settings and reads again sees the new values. Outside a Flask request context (scheduler, deploy worker, tests, startup) the cache no-ops — behaviour is byte-identical to before. deepcopy on every return preserves the pre-fix semantic that a caller can mutate the returned dict without affecting any other reader. The canonical cached version stays clean; only the returned dict is disposable. The deepcopy of ~5KB settings is ~100us; the round-trip through disk + JSON parse + migration was ~2-5ms — net win of 15-30ms per typical request, more on hosts with many certs. This closes audit findings #1 (N+1 settings reads) and #6 (15+ load_settings per endpoint), and indirectly mitigates #10 (RLock contention) by reducing the acquisition count 15-100x. Regression tests in tests/test_settings_request_scoped_cache.py pin six contracts: - within one request: 10 load_settings calls -> 1 disk read - save_settings inside the request invalidates the cache - atomic_update inside the request invalidates the cache - two distinct requests each trigger their own fresh read - outside a request context: every call hits disk (no leaked cache) - a caller that mutates the returned dict does not pollute the cache * perf(backup): single iterdir pass in create_unified_backup create_unified_backup walked cert_dir twice: once to build the list of domain names embedded in the backup metadata, then again to iterate each domain and copy its files into the zip. On local SSD the wasted N stat() calls are negligible; on NFS or spinning disk they add up proportionally to the number of certificates. Collect the directory Path objects in a single iterdir pass and reuse the list for both the metadata-build step and the zip-write step. The zip output is byte-identical to before. Audit finding #2. Regression test in tests/test_unified_backup_one_pass_iteration.py - pins iterdir-call-count to exactly 1 by patching Path.iterdir with a counter (a future maintenance edit that re-introduces the second pass will trip the test instead of slipping through review) - pins the functional contract: settings.json + backup_metadata.json present, every domain's cert files copied, non-directory artefacts at cert_dir top level correctly skipped * perf(probe): tighten TLS probe timeout to 3s with slow-probe instrumentation The deployment-status check (_probe_https_certificate) opens a raw TCP socket and finishes a TLS handshake for each domain the dashboard wants to verify. The probe runs in a Flask worker thread; the worker is blocked for the full timeout if the target is unreachable. The previous default of 5s combined with sync workers and multiple unreachable hosts could stall an operator's dashboard for tens of seconds — a single hot deployment-status refresh against 10 down hosts could saturate every worker in a 4-worker gunicorn for half a minute. Changes: 1. Default timeout lowered from 5s to 3s. Reachable hosts answer in <250ms typically, so 3s is comfortably above the legitimate range while halving the worst-case stall per probe. 2. New env var CERTMATE_TLS_PROBE_TIMEOUT_SECONDS (clamped to [1,30]) so deployments behind genuinely slow targets can raise the budget without code edits. 3. Probes that take longer than 1s now log a WARNING with the host and the elapsed time. Operators get a direct signal of which target is slow, without reproducing the dashboard stall to grep for the cause. Audit finding #3. Does NOT address the deeper architecture question ("should probes be async / done off the request thread entirely") — that's a separate, bigger change. This is the minimum hardening that caps the per-request worst case. Regression tests in tests/test_tls_probe_timeout.py pin: - the default of 3s - env var override - clamping at both extremes ([1,30]) and fallback on garbage input - a >1s probe logs the warning, a fast probe does not - timeout=None reads the env var so deployment-time tuning works * perf(rate-limit): bound login-attempt dicts so IP rotation cannot grow them forever The login-attempt rate limiter uses module-level defaultdicts keyed by client IP and by username. Per-IP attempt lists are already bounded by _LOGIN_RATE_LIMIT_IP (the limit kicks in at 5 attempts/min so each list never grows past ~5 entries). But the DICT itself acquired one entry per unique IP that ever attempted, and entries were never removed — not even after the window passed and the per-IP list was trimmed to []. A botnet rotating through millions of source IPs would grow the dict unbounded. Practical exposure on CertMate (homelab admin tool) is low, but the defense-in-depth fix is small: - Two soft caps: _MAX_TRACKED_IPS=10000, _MAX_TRACKED_USERS=10000. - _sweep_empty_buckets() drops entries whose attempt list is empty, triggered from _check_login_rate_limit when either dict crosses its cap. Non-empty buckets (active rate-limit windows) are preserved. - The sweep is O(n) and only runs when over cap, so under normal load it's a single len() comparison per call. Worst case after the fix: 10K active attackers × 5 entries × ~16 bytes of timestamp = ~800KB. Bounded. After 60s of attacker inactivity, the lists trim to [] and the next sweep removes them. Audit finding #11. Regression tests in tests/test_login_rate_limit_dict_bounded.py pin four contracts: - under cap: sweep is a no-op - over cap: empty buckets removed, active ones preserved - the sweep is actually wired into _check_login_rate_limit (a sweep that no caller invokes would be dead code) - end-to-end: rate limiting still blocks correctly after the sweep * perf(renewal): thread settings through check_renewals to skip N+1 disk reads The request-scoped settings cache (commit f0b77c3) only fires inside a Flask request context. APScheduler's renewal job runs in a background thread with no request context, so every domain in check_renewals re-triggered a full settings.json load through get_certificate_info -> _parse_certificate_info. For a 50-cert deployment that's 51 disk reads + JSON parses + migration runs in a single hourly tick — negligible at homelab scale, wasteful at any larger scale, and trivial to fix. Threading: - get_certificate_info gains an optional `settings=None` parameter - _parse_certificate_info gains the same - both fall back to load_settings() when not supplied so every existing call site (15+ across the codebase) keeps working - check_renewals loads settings ONCE at the top and passes it down on every per-domain get_certificate_info call Audit finding #12. Two existing tests in tests/test_per_cert_renewal.py were monkey- patching get_certificate_info with a one-arg lambda; they now accept the optional settings kwarg so the new call site (`settings=settings`) doesn't trip them. Regression tests in tests/test_renewal_job_settings_reuse.py pin three contracts: - check_renewals reads settings exactly ONCE regardless of domain count - get_certificate_info called without settings still reloads on its own (existing dashboard endpoint contract preserved) - a caller-supplied settings dict is honoured, not silently discarded * chore(ux): three small UX hardening items from UI audit list The UI audit's CRITICAL/HIGH list of 12 findings contained zero clear bugs after verification — `escapeHtml` is called on every interpolated domain, the Promise chains have proper catch-and-show-UI fallbacks, login CSRF is a non-issue against an admin tool with SameSite=Strict. Three items were genuine defense-in-depth though, and they are bundled here on user direction: 1. SSE retry give-up (templates/base.html, audit #1). `sse.onerror` reconnected with exponential backoff capped at 30s, but never stopped. A logged-out tab would keep firing a 401 every 30s for the lifetime of the open page. Add a `consecutiveFailures` counter; after 10 in a row (~3 minutes of exponential retries) the client gives up and writes a console.warn telling the user to refresh. `onopen` resets the counter so legitimate transient network blips still recover. 2. MutationObserver readyState guard (static/js/certmate.js, audit #2). The modal-focus observer was attached inside a `document.addEventListener('DOMContentLoaded', ...)` with no fallback. If certmate.js loads AFTER DOMContentLoaded (defer / dyn import / future hot-reload), the event already fired and the listener never runs — modals would lose their Esc / backdrop / focus-trap behaviour. Mirror the readyState pattern that CM.refreshRole already uses above: if loading, listen; otherwise attach immediately. 3. CertMate.confirm() before /api/web/cache/clear (static/js/settings.js, audit #11). `clearDeploymentCache()` POSTed straight to the server with no confirm dialog — the operation isn't destructive (next dashboard load re-probes every cert) but a misclick still wastes a probe round and flickers every cert's deployment badge. Wrap in the same CertMate.confirm pattern that the dashboard's invalidateAllCache already uses for consistency. The other 9 findings from the audit were all false positives / YAGNI: domain XSS (escapeHtml present), Promise rejection swallowing (nested catch present), Promise.all without catch (every batchPromise has its own catch so Promise.all cannot reject), login CSRF (no session cookie yet so middleware skips by design), innerHTML mass reflow (single-batched is already optimal), no-debounce on cmd palette (in-memory filter with slice(0,10), sub-ms), CSV race (theoretical single-picker, no practical risk), loading panel stuck (openCertDetail has no async fetch — data is in-memory), token format check (server-side is the real layer). No tests added: each change is either UX-only or guards a future-proof loading-order edge case that has no current trigger. * fix(tests): rewrite stale UI test assertions against v2.5.x templates tests/test_ui.py was last updated in v2.4.10 (commit 834dd9b). v2.5.0/ v2.5.1 then shipped 51 template rewrites — including the help-page overhaul (RELEASE_NOTES.md `fix(help): rewrite for user help, drop marketing`) and the dashboard toggle that hides the create form behind #toggleCreateForm (QW-15). Four UI tests asserted strings/elements that no longer exist in the shipping UI and have been failing on main since v2.5.0 was merged. They're caught now because this branch runs the full e2e suite before the release PR. The four fixes: 1. test_help_navigation: asserted `text=Getting Started` which is from the pre-v2.5.0 quick-links grid. The v2.5.0 rewrite replaced the grid with a section-nav strip; the first content section is now "Quick Start". Match the current `nav[aria-label='Help sections']` plus the Quick Start h3. 2. test_create_certificate_ui: tried `#domain.fill(...)` directly, but the create form has been hidden behind #toggleCreateForm since QW-15. Additionally, the first-run setup-wizard overlay (created by static/js/setup-wizard.js when setup_completed is False, which is always the case in a freshly-started test container) intercepts pointer events on the whole viewport — every subsequent click was timing out. Fix: remove the wizard overlay via page.evaluate(), then click the toggle button before filling #domain. 3. test_docker_quick_start_visible → test_quick_start_section_visible: the old "Docker Quick Start" card is gone from the rewritten help page. Re-pointed at the stable `section#quick-start` anchor + its h3 text. 4. test_first_steps_visible → test_troubleshooting_section_visible: the old "First Steps" card is gone. Re-pointed at the stable `section#troubleshooting` anchor — the diagnostic section users reach when something breaks. Each new selector is anchored to an HTML id that's stable across ordinary UI refactors (section ids are part of the in-page anchor links and the section search filter), so future help-page wording changes can land without breaking these tests again. Verified locally against the v2.5.x test container: 4/4 pass, full e2e suite 112 passed + 2 xfailed + 1 skipped, zero failures. * test(private_ca): 34 unit tests from 0% to deep coverage modules/core/private_ca.PrivateCAGenerator is the root of trust for every client certificate CertMate issues. A regression here invalidates every downstream cert at once — and the file was sitting at 0% unit coverage, so any silent break would only surface through e2e tests (where it'd be diagnosed as "client certs aren't validating" with no pointer at the CA). This adds 34 tests against the real `cryptography` primitives — no mocks of crypto operations, no monkey-patched key generation. Tests are grouped into seven sections; each assertion is named after the specific class of compromise it blocks rather than the line of code it touches. Coverage focus: - TestInitialization (4 tests): idempotency, the force=True backup contract, failure paths. The idempotency test specifically pins that re-initialising on an existing CA does NOT regenerate — historically a cause of "all my certs suddenly fail" outages. - TestCACertificateShape (8 tests, the most critical): RSA-4096, self- signed, basicConstraints CA=TRUE+critical, KeyUsage cert+CRL sign, ~10-year validity, CertMate subject, signature verifies against its own public key, private key file is 0o600 (POSIX only). - TestLoadedStateGuard (3 tests): sign/CRL operations must return None when CA isn't loaded — never silently succeed against a broken state. - TestSignCertificateRequest (10 tests): preserves CSR subject; issuer matches CA subject; validity matches days_valid; signature verifies against CA public key (the load-bearing assertion that proves the CA actually signs); SAN extension propagated; SubjectKeyIdentifier + AuthorityKeyIdentifier present (RFC 5280 §4.2); clientAuth EKU applied AND does not implicitly grant serverAuth (privilege- escalation guard); unknown EKU strings ignored gracefully. - TestGenerateCRL (6 tests): empty revocation list works; specified serials present in CRL byte-for-byte; CRL issuer matches CA subject; CRL signature verifies against CA public key (the only thing between revocation enforcement and a silent bypass); persisted to disk; get_crl_pem returns None pre-generation. - TestMetadata (1 test): round-trip pins serial number string-encoding (big-int can't survive JSON as a number). - TestPEMExport (2 tests): get_ca_cert_pem parseable, export_ca_cert writes valid PEM. Performance: module-scoped fixture for the 4096-bit RSA keygen means one keygen (~5s) shared across 34 tests instead of 34 × 5s. Total file runtime: ~5s. * test(csr_handler): 38 unit tests from 0% to deep coverage modules/core/csr_handler.CSRHandler is the entry-point for the client- certificate issuance flow — every malformed, oversized, or malicious CSR passes through validate_csr_pem() before PrivateCA signs it. A bug in this validator either locks out legitimate users (false reject) or lets through input that crashes the signer (false accept). The file was at 0% unit coverage so any regression would only surface as a mysterious "client cert won't issue" e2e failure. Tests group by method, with one assertion per failure class: - TestValidateCSRPem (7 tests): valid PEM accepted with CSR object returned (so callers don't re-parse); empty / None / garbage / truncated PEM rejected; CSR without CommonName explicitly rejected (otherwise the signer would mint an unidentifiable cert). One test pins the validate→sign handoff contract: validator MUST return the parsed CSR object, not force callers to re-parse the PEM. - TestGetCSRInfo (7 tests): extracts CN, organization, country, key size, signature algorithm; DNS+email SANs surfaced; missing fields default to empty string (not KeyError); no-SAN case returns []. - TestCreateCSR (13 tests): defaults round-trip through validate_csr_pem (catches any signer/validator drift); 4096-bit key path; invalid key sizes rejected (six parametrized: 512, 1024, 3072, 8192, 0, -1); empty/oversized CN rejected (RFC 5280: CN ≤64 chars); FIVE control- character CN attacks rejected (NUL byte injection, LDAP newline, CR/tab/DEL — each can corrupt downstream log parsers, certbot stderr handling, or LDAP integrations); SAN ceiling at exactly 100 (the documented limit); SAN extension persisted in produced CSR; KeyUsage marked critical AND key_cert_sign=False (a client cert that can sign other certs would be a privilege escalation). - TestPersistence (5 tests): save writes both files at the expected paths and extensions; saved private key is 0o600 (POSIX); load round- trip preserves CN; missing file returns False; garbage file returns False. All tests use real CSRs built with the `cryptography` library — no mocked crypto operations. Total runtime: ~3.3s. * test(ocsp_crl): 20 unit tests from 0% to deep coverage modules/core/ocsp_crl.{OCSPResponder, CRLManager} together are the revocation infrastructure. If either returns the wrong answer for a revoked serial, a compromised client cert keeps validating against trust stores that consult OCSP / CRL — a silent security bypass with no log line. The file was at 0% unit coverage so any regression here would only manifest as "revocation isn't working" after the fact. Test groups, each pinning a specific failure mode: - TestOCSPGetCertStatus (5 tests): the most security-critical method in the file. 'good' for active, 'revoked' with metadata for revoked, 'unknown' for missing serial (NEVER 'good' — that would let an attacker present a forged cert and have OCSP say yes), reason defaults to 'unspecified' rather than crashing, manager exception returns 'unknown' not 'good' (silent-blessing-during-outage guard). - TestOCSPGenerateResponse (4 tests): 'good' response has no revocation fields (privacy + format compliance); 'revoked' response carries revocation_time + revocation_reason; malformed input gives internal_error response without raising; the literal 'CertMate OCSP Responder' name is contract for external monitoring. - TestCRLGetRevokedSerials (4 tests): only revoked surfaced; corrupted serial strings (non-int, None) skipped instead of poisoning the whole CRL; serial 0 dropped (RFC 5280 §4.1.2.2 forbids it); DB failure returns [] not raises (empty CRL is safer than no CRL endpoint). - TestCRLLifecycle (7 tests): update_crl produces parseable PEM with and without revocations; DER round-trip through PEM preserves revoked entries; get_crl_pem auto-generates if missing (otherwise a freshly-started instance 404s forever on /crl); get_crl_info reports 'no_crl' on full failure path rather than crashing; revoked count and issuer surfaced; get_crl_der returns None (never partial bytes) when CRL unavailable. Tests share a module-scoped real PrivateCAGenerator (one keygen for all 20 tests). MagicMock for the client_cert_manager — its contract is just `list_client_certificates(revoked=...)` so faking it is faithful. Total runtime: ~0.6s. * test(storage_backends): 56 unit tests targeting cross-cutting helpers + lifecycle modules/core/storage_backends was at ~25% — the LocalFileSystemBackend got incidental coverage from other tests, but the cross-cutting helpers that ALL five backends rely on (retry decorator, transient-error heuristic, domain validation) were unverified. Cloud-SDK request paths deliberately stay out of scope here — exhaustive Azure/AWS/Vault SDK mocking adds maintenance debt without proving the CertMate-layer logic; the e2e suite + manual config tests cover that. Focus areas: - TestIsTransient (8 tests): 429/5xx marked transient (rate limits + upstream errors); 401/400 marked NON-transient (stale creds + bad requests must fail fast, not retry); boto3-style response metadata unwrapped; message keyword fallback for SDKs that don't expose status_code. The 401-not-transient assertion is the load-bearing guard against a retry loop on expired credentials. - TestWithRetry (4 tests): single attempt success, retry-then-recover, non-transient errors propagate on first attempt (no retry, no sleep), exhausted retries re-raise last exception. - TestValidateStorageDomain (12 tests): valid domains pass (5 shapes), path-traversal / null-byte / spaces / forward-slash / backslash rejected (7 shapes). Every cloud backend funnels through this — a bypass here would let one tenant write to another's secret name. - TestLocalFileSystemBackend (9 tests): full store→retrieve round-trip; private key file is 0o600 (POSIX); metadata file is 0o600; missing domain → None; list returns only domain dirs WITH cert.pem (empty dirs filtered); delete removes whole tree; delete missing → False; backend_name contract. - TestAzureSecretNameSanitisation (4 tests): only allowed chars in output; DISTINCT inputs produce distinct outputs (the CRC32 suffix is the collision guard — without it 'my-app.example.com' and 'my.app-example.com' both sanitize to the same secret name and one cert overwrites the other); 127-char Azure cap respected; deterministic. - TestCloudBackendConstructorValidation (7 tests): each cloud backend rejects empty required fields at __init__ time so misconfiguration fails fast instead of producing a half-initialised client. Tests every Azure required field (4 parametrized); AWS credentials; Vault url/token; Infisical client_id. - TestStorageManagerDispatch (5 tests): factory picks local/azure correctly; unknown backend type falls back to local (typo doesn't take down renewal); failed cloud-backend init falls back to local (misconfigured Vault doesn't crash the worker); lazy + idempotent initialization (load_settings called exactly once, same backend instance returned across calls). Total runtime: ~0.2s. * test(certificates): 12 unit tests for concurrent issuance lock + DNS alias paths Targets two specific gaps the audit list flagged in modules/core/ certificates.py: 1. Concurrent-issuance non-blocking lock (line 561). create_certificate raises RuntimeError if another thread is already issuing for the same domain. Without this guard, two simultaneous renewals would race over data/certs/<domain>/ and trigger LE rate-limit denials. Five tests: - first acquire of a fresh domain lock succeeds (sanity) - locks are per-domain, not global (otherwise unrelated cert ops would serialise needlessly) - same-domain lock reused (a fresh lock per call would defeat the non-blocking guard entirely) - second create_certificate while a lock is already held raises RuntimeError with the exact "already in progress" message - end-to-end multi-threaded race with threading.Barrier(2): two threads call create_certificate; exactly one hits the lock barrier, exactly one passes through (and is allowed to fail downstream for unrelated reasons — we only assert the barrier behaviour) 2. check_dns_alias_records DNS failure paths (line 319). The function must surface a structured status (ok / missing / mismatch / error) rather than raise or silently report 'ok'. Seven tests: - ok when CNAME target matches - mismatch when CNAME points elsewhere (the 'cert renewal silently fails because alias drifted' scenario) - missing when no CNAME at source (most common misconfig) - error when DoH resolver itself raises (Cloudflare DoH down / egress firewall blocked); status='error', message in error field - SAN domains each get their own check entry, deduplicated - overall ok only if every per-source check is ok - trailing-dot normalisation: DNS records end with '.', config usually doesn't — both forms must compare equal The third gap from the audit (renewal-job N+1 settings reads) is already covered by commit 7f7b41f's tests/test_renewal_job_settings_reuse.py. Total runtime: ~0.14s. * chore: bump version to v2.5.3 + prepend release notes 20 atomic commits across this branch — security/perf/UI/docs/coverage audit response. Honest precision summary in the release notes: 8 true positives fixed, 10 hardening items shipped, 35 false positives documented + deferred. +160 unit tests across 5 new files. e2e suite verified green (112 passed, 0 failures, real Cloudflare DNS-01 issuance + Playwright UI).

Initial plan

45d5806

Copilot AI assigned Copilot and fabriziosalmi Jun 27, 2025

Copilot started work on behalf of fabriziosalmi June 27, 2025 12:35 View session

fabriziosalmi marked this pull request as ready for review June 27, 2025 12:36

fabriziosalmi closed this Jun 27, 2025

fabriziosalmi deleted the copilot/fix-cf7831d3-fddb-430b-8739-3a4506b3d4ee branch June 27, 2025 12:37

Copilot AI changed the title ~~[WIP] Add systemd service setup instructions to README~~ Add comprehensive systemd service setup instructions to README.md Jun 27, 2025

Copilot AI requested a review from fabriziosalmi June 27, 2025 12:38

Copilot finished work on behalf of fabriziosalmi June 27, 2025 12:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add comprehensive systemd service setup instructions to README.md#3

Add comprehensive systemd service setup instructions to README.md#3
Copilot wants to merge 1 commit into
mainfrom
copilot/fix-cf7831d3-fddb-430b-8739-3a4506b3d4ee

Copilot AI commented Jun 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Copilot AI commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes Made

📝 Complete Service Configuration

🚀 Step-by-Step Setup Instructions

✅ Additional Benefits

Example Service Configuration

Why This Change

Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jun 27, 2025 •

edited

Loading