Add comprehensive systemd service setup instructions to README.md#3
Closed
Copilot wants to merge 1 commit into
Closed
Add comprehensive systemd service setup instructions to README.md#3Copilot wants to merge 1 commit into
Copilot wants to merge 1 commit into
Conversation
Copilot
AI
changed the title
[WIP] Add systemd service setup instructions to README
Add comprehensive systemd service setup instructions to README.md
Jun 27, 2025
fabriziosalmi
added a commit
that referenced
this pull request
May 16, 2026
* docs: neutralize DNS provider counts in README and docs/
README.md and six pages under docs/ cited DNS provider counts as 22, 22+, 23,
or 24, all pointing to the same canonical table in docs/dns-providers.md.
Replace the explicit numbers in prose with neutral wording so the docs do not
need to be touched every time a provider is added or removed. The exact list
remains in the table on docs/dns-providers.md, which is the single source of
truth. Matches the wiki cleanup pushed in the certmate.wiki repo.
* fix(certificates): lock metadata RMW in record_*_deployment_status
`record_backend_deployment_status` and `record_browser_deployment_status`
both did `_load_metadata` → mutate → `_save_metadata` without holding the
per-domain lock that `_get_domain_lock(domain)` already provides (used by
cert issuance, renewal, and revocation paths in the same class).
If a backend deployment-check request and a browser deployment-check
request landed in two worker threads at the same time, both would load
the same `deployment_status` snapshot, each would set its own key
(`backend` or `browser`), each would save — and whichever saved second
would silently drop the other's update. The next page render would show
stale info until the next status check happened to win the race.
Fix: wrap both methods' RMW in `with self._get_domain_lock(domain):`.
Same lock, same per-domain granularity, no global contention.
Regression test in `tests/test_metadata_concurrent_writes.py` uses a
`threading.Barrier(2)` to maximise the race window and runs 50
iterations. With the lock both keys are deterministically present on
every iteration; without it, the test reliably catches a lost-write
within a handful of iterations.
* fix(certificates): quarantine corrupt metadata.json instead of clobbering it
_load_metadata used to catch Exception broadly, log a warning, and return
{}. The combination of the broad catch and the empty-dict fallback meant
that an unparseable metadata.json (truncated write, bit flip, manual edit
gone wrong) would be returned as `{}` to callers — and the next
_save_metadata would write the empty dict back to the same path. The only
copy of the data was destroyed silently on first read.
Split the handler:
- json.JSONDecodeError → rename the file to
`metadata.json.corrupt-<utc-timestamp>` before returning `{}`. The
corrupted bytes survive for forensics and the next save lands on a
fresh path. Logged as ERROR, not warning, so an operator scanning
logs sees it.
- OSError → keep the previous behaviour (warning + `{}`). The file
itself is fine; a retry might succeed; quarantining would be wrong.
Also collapse a divergent inline reader in get_certificate_info that was
doing the same `open + json.load + except Exception` dance — it now calls
_load_metadata so JSON corruption is handled consistently in one place
instead of two.
Regression tests in tests/test_metadata_corruption_handling.py pin:
- corrupt JSON → file is quarantined and renamed
- subsequent _save_metadata writes a fresh file without touching the
quarantine (the whole point of the rename)
- missing file → empty dict, no quarantine artefact
- OSError on read → empty dict, no quarantine (file may still be valid)
* fix(health): surface scheduler-setup failure reason on /health
When setup_scheduler raised an exception (sqlite locked, corrupt jobstore,
dependency missing), the only signals were a single ERROR line in the
application log and a Python RuntimeWarning emitted to stderr. /health
reduced the failure to a bare `scheduler: not_running` with no context,
indistinguishable from "scheduler intentionally not configured". Operators
who don't tail logs had no way to tell that automatic certificate renewal
had silently stopped working — the audit list flagged this as "admin
doesn't know".
Capture the setup outcome on the AppContainer:
container.scheduler_status = {
"state": "uninitialized" | "running" | "failed",
"error": str | None,
"timestamp": utc-iso str | None,
}
Mirror it into managers['scheduler_status'] so /health can read it without
poking at the container directly. /health now reports, when state=failed:
checks.scheduler == "failed"
checks.scheduler_error == "<exception message>"
checks.scheduler_failed_at == "<utc iso>"
status == "degraded"
Pre-existing /health behaviour (scheduler missing / not running for non-
exception reasons) is unchanged: still reports "not_running" + degraded.
The new "failed" branch is strictly additional context, not a contract
break for any consumer that only looked at the top-level status.
Regression tests in tests/test_scheduler_status_health.py pin the three
branches: failed-with-reason, running, and not-set (back-compat).
* fix(deployer): close bash parameter-expansion bypass in safe_vars regex
The deploy-hook safe-vars regex was `\$\{?CERTMATE_[A-Z_]+\}?` — the
optional brace at the head AND tail meant it matched both `${CERTMATE_FOO`
(no closing brace) and `${CERTMATE_FOO}` (closed) on equal terms. Bash
parameter-expansion operators (`:-`, `:=`, `:+`, `:?`, `//`, `%`, `#`)
always sit between the variable name and the closing brace, so when an
operator + payload + close was appended the regex still consumed the
`${CERTMATE_FOO` head and left only the harmless-looking tail:
Input: ${CERTMATE_FOO:-/etc/passwd}
After strip: __SAFE__:-/etc/passwd}
Dangerous: no metacharacter matches the lone `:-…}` tail
-> validator returned OK; at runtime bash expanded the form to
/etc/passwd whenever CERTMATE_FOO was unset (which it always is —
CertMate only sets a specific allowlist of CERTMATE_* names).
Same trick worked with the other expansion operators:
${CERTMATE_PATH:-/etc/shadow}
${CERTMATE_FOO:+rm -rf /}
${CERTMATE_FOO//a/b}
Fix the regex to require the closing brace IMMEDIATELY after the variable
name. Only two forms now strip-substitute:
$CERTMATE_NAME
${CERTMATE_NAME}
Anything with operators between the name and the brace does NOT match
safe_vars; the unchanged `\$\{` rule in _DANGEROUS_SHELL then catches it
and the validator rejects with "contains dangerous shell metacharacters".
Empirically verified: 10 bypass attempts (the six expansion operators,
the unclosed-brace form, the colon-equals/plus/question variants, plus
an unrelated $VAR) now all BLOCK. 10 legitimate $CERTMATE_* forms
(documented in the Q&A discussion #180 / wiki Deploy-Hooks page) still
PASS — including underscored names, concatenation, and the cp/cert-copy
patterns real hooks use.
Parametrized regression test in
tests/test_deploy_hook_param_expansion_bypass.py pins both directions
so the next refactor of safe_vars cannot reopen the hole.
* chore(hardening): defensive fixes from MEDIUM audit list
Three small, low-risk hardening items the MEDIUM list raised. None of
them close an observed bug; each addresses a latent edge case.
1. factory.setup_scheduler — verify SQLite WAL took effect.
`PRAGMA journal_mode=WAL` does NOT raise when the filesystem rejects
WAL (NFS, certain network mounts, old FAT). SQLite silently falls back
to the previous journal mode and concurrency drops with no signal.
After issuing the PRAGMA we now read it back and log a warning if the
effective mode is not WAL. Correctness was never at risk; the only
change is that "scheduler feels slow" stops being a mystery.
2. deployer._run_hook — coerce timeout to int defensively.
save_config already does `int(hook.get('timeout', DEFAULT_TIMEOUT))`
on the write path, so under normal flow `timeout` arrives as int.
But a hand-edited settings.json, an older config schema migrated
forward, or a programmatically constructed hook could carry a string
or None — `max(str, 1)` raises TypeError in Python 3 and would crash
the renewal worker. The read path now tries `int()` with a
DEFAULT_TIMEOUT fallback on (TypeError, ValueError), mirroring the
write path. Parametrized regression test in
tests/test_deploy_hook_timeout_coercion.py covers string '30',
unparseable string, None, and float.
3. docs/installation.md — note on NFS-backed data/ directory.
CertMate uses standard blocking Python I/O for everything in `data/`.
On a stalled NFS server those reads can hang indefinitely; on
filesystems that don't support WAL the scheduler store falls back.
Document the recommended mount options (`soft,timeo=30,retrans=3`)
and where to look for the WAL-fallback warning. Local disk remains
the recommended baseline.
These three are bundled in one commit on user direction — they are
related (defense-in-depth against operational edge cases) and small
enough that splitting them would inflate the audit-list churn without
adding traceability.
* fix(file_operations): initialise temp_file before mkstemp to avoid UnboundLocalError
safe_file_write defined `temp_file = Path(_tmp_name)` only AFTER mkstemp()
returned, and then referenced `temp_file.exists()` inside both
`except (PermissionError, OSError)` and `except Exception` handlers as a
cleanup guard. If mkstemp() itself raised — parent directory unwritable,
disk full, sandbox restriction, anything that fails before _tmp_name is
bound — control jumped straight to one of those except blocks and
`temp_file` was never defined.
The result: the original OSError was masked by an UnboundLocalError stack
trace, and the operator log showed the masking error instead of the
actual filesystem problem. Concretely: "cannot find local variable
'temp_file'" replaced "[Errno 28] No space left on device".
Fix: initialise `temp_file = None` at the top of the function and gate
the cleanup on `if temp_file is not None and temp_file.exists():` in
both handlers. The original error now logs verbatim and the return is
False (matching every other failure path in this method).
Regression test in tests/test_safe_file_write_unbound_temp_file.py mocks
mkstemp to raise OSError and asserts:
- safe_file_write returns False (not raise)
- the simulated OSError message appears in the log
- the happy path still works unchanged
* docs(installation): document BEHIND_PROXY for reverse-proxy deployments
`BEHIND_PROXY=true` is supported in factory.py:347 — it enables Werkzeug's
ProxyFix so request.remote_addr resolves to the original client IP via
X-Forwarded-* headers. Without it, every request appears to come from the
proxy's IP, which collapses per-client rate limiting into a single bucket
and obscures the actual client in audit-log entries and "invalid API
token attempt from X" warnings.
The env var was undocumented across docs/ and README; operators running
CertMate behind Nginx / Traefik / Cloudflare had no in-tree hint that
the flag existed and were left with degraded rate limiting silently.
Add two pieces of documentation:
1. A `BEHIND_PROXY=true` entry in the Environment Variables block of
docs/installation.md, with a short justification.
2. A new "Behind a reverse proxy" subsection under Production
Deployment with the rationale, when NOT to enable it (the proxy is
the trust boundary — without one, X-Forwarded-For is attacker-
spoofable), and the required Nginx header-forwarding snippet.
No code change.
* perf(settings): request-scoped cache for load_settings to kill N+1 disk reads
A typical /api/certificates request was calling
settings_manager.load_settings() ~100 times when the dashboard listed 50
certificates: once at the top of the handler, once inside
get_certificate_info per domain, and once more inside
_parse_certificate_info per domain. Each call opened settings.json,
parsed JSON, ran the migration/default-fill logic, and acquired the
SettingsManager RLock — for the same on-disk bytes, on every call,
under the same request.
The fix caches the parsed settings on `flask.g` for the duration of one
HTTP request. The first call hits disk; subsequent calls return a
deepcopy of the cached dict. The cache is invalidated by save_settings
and atomic_update so a route that mutates settings and reads again sees
the new values. Outside a Flask request context (scheduler, deploy
worker, tests, startup) the cache no-ops — behaviour is byte-identical
to before.
deepcopy on every return preserves the pre-fix semantic that a caller
can mutate the returned dict without affecting any other reader. The
canonical cached version stays clean; only the returned dict is
disposable. The deepcopy of ~5KB settings is ~100us; the round-trip
through disk + JSON parse + migration was ~2-5ms — net win of 15-30ms
per typical request, more on hosts with many certs.
This closes audit findings #1 (N+1 settings reads) and #6 (15+
load_settings per endpoint), and indirectly mitigates #10 (RLock
contention) by reducing the acquisition count 15-100x.
Regression tests in tests/test_settings_request_scoped_cache.py pin six
contracts:
- within one request: 10 load_settings calls -> 1 disk read
- save_settings inside the request invalidates the cache
- atomic_update inside the request invalidates the cache
- two distinct requests each trigger their own fresh read
- outside a request context: every call hits disk (no leaked cache)
- a caller that mutates the returned dict does not pollute the cache
* perf(backup): single iterdir pass in create_unified_backup
create_unified_backup walked cert_dir twice: once to build the list of
domain names embedded in the backup metadata, then again to iterate
each domain and copy its files into the zip. On local SSD the wasted
N stat() calls are negligible; on NFS or spinning disk they add up
proportionally to the number of certificates.
Collect the directory Path objects in a single iterdir pass and reuse
the list for both the metadata-build step and the zip-write step. The
zip output is byte-identical to before.
Audit finding #2.
Regression test in tests/test_unified_backup_one_pass_iteration.py
- pins iterdir-call-count to exactly 1 by patching Path.iterdir with
a counter (a future maintenance edit that re-introduces the second
pass will trip the test instead of slipping through review)
- pins the functional contract: settings.json + backup_metadata.json
present, every domain's cert files copied, non-directory artefacts
at cert_dir top level correctly skipped
* perf(probe): tighten TLS probe timeout to 3s with slow-probe instrumentation
The deployment-status check (_probe_https_certificate) opens a raw TCP
socket and finishes a TLS handshake for each domain the dashboard wants
to verify. The probe runs in a Flask worker thread; the worker is
blocked for the full timeout if the target is unreachable.
The previous default of 5s combined with sync workers and multiple
unreachable hosts could stall an operator's dashboard for tens of
seconds — a single hot deployment-status refresh against 10 down hosts
could saturate every worker in a 4-worker gunicorn for half a minute.
Changes:
1. Default timeout lowered from 5s to 3s. Reachable hosts answer in
<250ms typically, so 3s is comfortably above the legitimate range
while halving the worst-case stall per probe.
2. New env var CERTMATE_TLS_PROBE_TIMEOUT_SECONDS (clamped to [1,30])
so deployments behind genuinely slow targets can raise the budget
without code edits.
3. Probes that take longer than 1s now log a WARNING with the host
and the elapsed time. Operators get a direct signal of which
target is slow, without reproducing the dashboard stall to grep
for the cause.
Audit finding #3. Does NOT address the deeper architecture question
("should probes be async / done off the request thread entirely") —
that's a separate, bigger change. This is the minimum hardening that
caps the per-request worst case.
Regression tests in tests/test_tls_probe_timeout.py pin:
- the default of 3s
- env var override
- clamping at both extremes ([1,30]) and fallback on garbage input
- a >1s probe logs the warning, a fast probe does not
- timeout=None reads the env var so deployment-time tuning works
* perf(rate-limit): bound login-attempt dicts so IP rotation cannot grow them forever
The login-attempt rate limiter uses module-level defaultdicts keyed by
client IP and by username. Per-IP attempt lists are already bounded by
_LOGIN_RATE_LIMIT_IP (the limit kicks in at 5 attempts/min so each list
never grows past ~5 entries). But the DICT itself acquired one entry
per unique IP that ever attempted, and entries were never removed —
not even after the window passed and the per-IP list was trimmed to [].
A botnet rotating through millions of source IPs would grow the dict
unbounded. Practical exposure on CertMate (homelab admin tool) is low,
but the defense-in-depth fix is small:
- Two soft caps: _MAX_TRACKED_IPS=10000, _MAX_TRACKED_USERS=10000.
- _sweep_empty_buckets() drops entries whose attempt list is empty,
triggered from _check_login_rate_limit when either dict crosses its
cap. Non-empty buckets (active rate-limit windows) are preserved.
- The sweep is O(n) and only runs when over cap, so under normal load
it's a single len() comparison per call.
Worst case after the fix: 10K active attackers × 5 entries × ~16 bytes
of timestamp = ~800KB. Bounded. After 60s of attacker inactivity, the
lists trim to [] and the next sweep removes them.
Audit finding #11. Regression tests in
tests/test_login_rate_limit_dict_bounded.py pin four contracts:
- under cap: sweep is a no-op
- over cap: empty buckets removed, active ones preserved
- the sweep is actually wired into _check_login_rate_limit (a sweep
that no caller invokes would be dead code)
- end-to-end: rate limiting still blocks correctly after the sweep
* perf(renewal): thread settings through check_renewals to skip N+1 disk reads
The request-scoped settings cache (commit f0b77c3) only fires inside a
Flask request context. APScheduler's renewal job runs in a background
thread with no request context, so every domain in check_renewals
re-triggered a full settings.json load through get_certificate_info ->
_parse_certificate_info. For a 50-cert deployment that's 51 disk reads
+ JSON parses + migration runs in a single hourly tick — negligible at
homelab scale, wasteful at any larger scale, and trivial to fix.
Threading:
- get_certificate_info gains an optional `settings=None` parameter
- _parse_certificate_info gains the same
- both fall back to load_settings() when not supplied so every
existing call site (15+ across the codebase) keeps working
- check_renewals loads settings ONCE at the top and passes it down
on every per-domain get_certificate_info call
Audit finding #12.
Two existing tests in tests/test_per_cert_renewal.py were monkey-
patching get_certificate_info with a one-arg lambda; they now accept
the optional settings kwarg so the new call site
(`settings=settings`) doesn't trip them.
Regression tests in tests/test_renewal_job_settings_reuse.py pin three
contracts:
- check_renewals reads settings exactly ONCE regardless of domain count
- get_certificate_info called without settings still reloads on its
own (existing dashboard endpoint contract preserved)
- a caller-supplied settings dict is honoured, not silently discarded
* chore(ux): three small UX hardening items from UI audit list
The UI audit's CRITICAL/HIGH list of 12 findings contained zero clear
bugs after verification — `escapeHtml` is called on every interpolated
domain, the Promise chains have proper catch-and-show-UI fallbacks,
login CSRF is a non-issue against an admin tool with SameSite=Strict.
Three items were genuine defense-in-depth though, and they are bundled
here on user direction:
1. SSE retry give-up (templates/base.html, audit #1).
`sse.onerror` reconnected with exponential backoff capped at 30s,
but never stopped. A logged-out tab would keep firing a 401 every
30s for the lifetime of the open page. Add a `consecutiveFailures`
counter; after 10 in a row (~3 minutes of exponential retries) the
client gives up and writes a console.warn telling the user to
refresh. `onopen` resets the counter so legitimate transient
network blips still recover.
2. MutationObserver readyState guard (static/js/certmate.js, audit #2).
The modal-focus observer was attached inside a
`document.addEventListener('DOMContentLoaded', ...)` with no
fallback. If certmate.js loads AFTER DOMContentLoaded (defer / dyn
import / future hot-reload), the event already fired and the
listener never runs — modals would lose their Esc / backdrop /
focus-trap behaviour. Mirror the readyState pattern that
CM.refreshRole already uses above: if loading, listen; otherwise
attach immediately.
3. CertMate.confirm() before /api/web/cache/clear (static/js/settings.js,
audit #11).
`clearDeploymentCache()` POSTed straight to the server with no
confirm dialog — the operation isn't destructive (next dashboard
load re-probes every cert) but a misclick still wastes a probe
round and flickers every cert's deployment badge. Wrap in the same
CertMate.confirm pattern that the dashboard's invalidateAllCache
already uses for consistency.
The other 9 findings from the audit were all false positives /
YAGNI: domain XSS (escapeHtml present), Promise rejection swallowing
(nested catch present), Promise.all without catch (every batchPromise
has its own catch so Promise.all cannot reject), login CSRF (no
session cookie yet so middleware skips by design), innerHTML mass
reflow (single-batched is already optimal), no-debounce on cmd
palette (in-memory filter with slice(0,10), sub-ms), CSV race
(theoretical single-picker, no practical risk), loading panel stuck
(openCertDetail has no async fetch — data is in-memory), token
format check (server-side is the real layer).
No tests added: each change is either UX-only or guards a future-proof
loading-order edge case that has no current trigger.
* fix(tests): rewrite stale UI test assertions against v2.5.x templates
tests/test_ui.py was last updated in v2.4.10 (commit 834dd9b). v2.5.0/
v2.5.1 then shipped 51 template rewrites — including the help-page
overhaul (RELEASE_NOTES.md `fix(help): rewrite for user help, drop
marketing`) and the dashboard toggle that hides the create form behind
#toggleCreateForm (QW-15). Four UI tests asserted strings/elements
that no longer exist in the shipping UI and have been failing on main
since v2.5.0 was merged. They're caught now because this branch runs
the full e2e suite before the release PR.
The four fixes:
1. test_help_navigation: asserted `text=Getting Started` which is from
the pre-v2.5.0 quick-links grid. The v2.5.0 rewrite replaced the
grid with a section-nav strip; the first content section is now
"Quick Start". Match the current `nav[aria-label='Help sections']`
plus the Quick Start h3.
2. test_create_certificate_ui: tried `#domain.fill(...)` directly,
but the create form has been hidden behind #toggleCreateForm since
QW-15. Additionally, the first-run setup-wizard overlay (created
by static/js/setup-wizard.js when setup_completed is False, which
is always the case in a freshly-started test container) intercepts
pointer events on the whole viewport — every subsequent click was
timing out. Fix: remove the wizard overlay via page.evaluate(),
then click the toggle button before filling #domain.
3. test_docker_quick_start_visible → test_quick_start_section_visible:
the old "Docker Quick Start" card is gone from the rewritten help
page. Re-pointed at the stable `section#quick-start` anchor + its
h3 text.
4. test_first_steps_visible → test_troubleshooting_section_visible:
the old "First Steps" card is gone. Re-pointed at the stable
`section#troubleshooting` anchor — the diagnostic section users
reach when something breaks.
Each new selector is anchored to an HTML id that's stable across
ordinary UI refactors (section ids are part of the in-page anchor
links and the section search filter), so future help-page wording
changes can land without breaking these tests again.
Verified locally against the v2.5.x test container: 4/4 pass, full
e2e suite 112 passed + 2 xfailed + 1 skipped, zero failures.
* test(private_ca): 34 unit tests from 0% to deep coverage
modules/core/private_ca.PrivateCAGenerator is the root of trust for every
client certificate CertMate issues. A regression here invalidates every
downstream cert at once — and the file was sitting at 0% unit coverage,
so any silent break would only surface through e2e tests (where it'd be
diagnosed as "client certs aren't validating" with no pointer at the CA).
This adds 34 tests against the real `cryptography` primitives — no mocks
of crypto operations, no monkey-patched key generation. Tests are grouped
into seven sections; each assertion is named after the specific class of
compromise it blocks rather than the line of code it touches.
Coverage focus:
- TestInitialization (4 tests): idempotency, the force=True backup
contract, failure paths. The idempotency test specifically pins that
re-initialising on an existing CA does NOT regenerate — historically a
cause of "all my certs suddenly fail" outages.
- TestCACertificateShape (8 tests, the most critical): RSA-4096, self-
signed, basicConstraints CA=TRUE+critical, KeyUsage cert+CRL sign,
~10-year validity, CertMate subject, signature verifies against its
own public key, private key file is 0o600 (POSIX only).
- TestLoadedStateGuard (3 tests): sign/CRL operations must return None
when CA isn't loaded — never silently succeed against a broken state.
- TestSignCertificateRequest (10 tests): preserves CSR subject; issuer
matches CA subject; validity matches days_valid; signature verifies
against CA public key (the load-bearing assertion that proves the
CA actually signs); SAN extension propagated; SubjectKeyIdentifier
+ AuthorityKeyIdentifier present (RFC 5280 §4.2); clientAuth EKU
applied AND does not implicitly grant serverAuth (privilege-
escalation guard); unknown EKU strings ignored gracefully.
- TestGenerateCRL (6 tests): empty revocation list works; specified
serials present in CRL byte-for-byte; CRL issuer matches CA subject;
CRL signature verifies against CA public key (the only thing between
revocation enforcement and a silent bypass); persisted to disk;
get_crl_pem returns None pre-generation.
- TestMetadata (1 test): round-trip pins serial number string-encoding
(big-int can't survive JSON as a number).
- TestPEMExport (2 tests): get_ca_cert_pem parseable, export_ca_cert
writes valid PEM.
Performance: module-scoped fixture for the 4096-bit RSA keygen means
one keygen (~5s) shared across 34 tests instead of 34 × 5s. Total file
runtime: ~5s.
* test(csr_handler): 38 unit tests from 0% to deep coverage
modules/core/csr_handler.CSRHandler is the entry-point for the client-
certificate issuance flow — every malformed, oversized, or malicious
CSR passes through validate_csr_pem() before PrivateCA signs it. A bug
in this validator either locks out legitimate users (false reject) or
lets through input that crashes the signer (false accept). The file
was at 0% unit coverage so any regression would only surface as a
mysterious "client cert won't issue" e2e failure.
Tests group by method, with one assertion per failure class:
- TestValidateCSRPem (7 tests): valid PEM accepted with CSR object
returned (so callers don't re-parse); empty / None / garbage / truncated
PEM rejected; CSR without CommonName explicitly rejected (otherwise
the signer would mint an unidentifiable cert). One test pins the
validate→sign handoff contract: validator MUST return the parsed
CSR object, not force callers to re-parse the PEM.
- TestGetCSRInfo (7 tests): extracts CN, organization, country, key
size, signature algorithm; DNS+email SANs surfaced; missing fields
default to empty string (not KeyError); no-SAN case returns [].
- TestCreateCSR (13 tests): defaults round-trip through validate_csr_pem
(catches any signer/validator drift); 4096-bit key path; invalid key
sizes rejected (six parametrized: 512, 1024, 3072, 8192, 0, -1);
empty/oversized CN rejected (RFC 5280: CN ≤64 chars); FIVE control-
character CN attacks rejected (NUL byte injection, LDAP newline,
CR/tab/DEL — each can corrupt downstream log parsers, certbot stderr
handling, or LDAP integrations); SAN ceiling at exactly 100 (the
documented limit); SAN extension persisted in produced CSR; KeyUsage
marked critical AND key_cert_sign=False (a client cert that can sign
other certs would be a privilege escalation).
- TestPersistence (5 tests): save writes both files at the expected
paths and extensions; saved private key is 0o600 (POSIX); load round-
trip preserves CN; missing file returns False; garbage file returns
False.
All tests use real CSRs built with the `cryptography` library — no
mocked crypto operations. Total runtime: ~3.3s.
* test(ocsp_crl): 20 unit tests from 0% to deep coverage
modules/core/ocsp_crl.{OCSPResponder, CRLManager} together are the
revocation infrastructure. If either returns the wrong answer for a
revoked serial, a compromised client cert keeps validating against
trust stores that consult OCSP / CRL — a silent security bypass with
no log line. The file was at 0% unit coverage so any regression here
would only manifest as "revocation isn't working" after the fact.
Test groups, each pinning a specific failure mode:
- TestOCSPGetCertStatus (5 tests): the most security-critical method
in the file. 'good' for active, 'revoked' with metadata for revoked,
'unknown' for missing serial (NEVER 'good' — that would let an
attacker present a forged cert and have OCSP say yes), reason
defaults to 'unspecified' rather than crashing, manager exception
returns 'unknown' not 'good' (silent-blessing-during-outage guard).
- TestOCSPGenerateResponse (4 tests): 'good' response has no
revocation fields (privacy + format compliance); 'revoked' response
carries revocation_time + revocation_reason; malformed input gives
internal_error response without raising; the literal
'CertMate OCSP Responder' name is contract for external monitoring.
- TestCRLGetRevokedSerials (4 tests): only revoked surfaced; corrupted
serial strings (non-int, None) skipped instead of poisoning the whole
CRL; serial 0 dropped (RFC 5280 §4.1.2.2 forbids it); DB failure
returns [] not raises (empty CRL is safer than no CRL endpoint).
- TestCRLLifecycle (7 tests): update_crl produces parseable PEM with
and without revocations; DER round-trip through PEM preserves
revoked entries; get_crl_pem auto-generates if missing (otherwise a
freshly-started instance 404s forever on /crl); get_crl_info reports
'no_crl' on full failure path rather than crashing; revoked count
and issuer surfaced; get_crl_der returns None (never partial bytes)
when CRL unavailable.
Tests share a module-scoped real PrivateCAGenerator (one keygen for
all 20 tests). MagicMock for the client_cert_manager — its contract is
just `list_client_certificates(revoked=...)` so faking it is faithful.
Total runtime: ~0.6s.
* test(storage_backends): 56 unit tests targeting cross-cutting helpers + lifecycle
modules/core/storage_backends was at ~25% — the LocalFileSystemBackend
got incidental coverage from other tests, but the cross-cutting helpers
that ALL five backends rely on (retry decorator, transient-error
heuristic, domain validation) were unverified. Cloud-SDK request paths
deliberately stay out of scope here — exhaustive Azure/AWS/Vault SDK
mocking adds maintenance debt without proving the CertMate-layer
logic; the e2e suite + manual config tests cover that.
Focus areas:
- TestIsTransient (8 tests): 429/5xx marked transient (rate limits +
upstream errors); 401/400 marked NON-transient (stale creds + bad
requests must fail fast, not retry); boto3-style response metadata
unwrapped; message keyword fallback for SDKs that don't expose
status_code. The 401-not-transient assertion is the load-bearing
guard against a retry loop on expired credentials.
- TestWithRetry (4 tests): single attempt success, retry-then-recover,
non-transient errors propagate on first attempt (no retry, no sleep),
exhausted retries re-raise last exception.
- TestValidateStorageDomain (12 tests): valid domains pass (5 shapes),
path-traversal / null-byte / spaces / forward-slash / backslash
rejected (7 shapes). Every cloud backend funnels through this — a
bypass here would let one tenant write to another's secret name.
- TestLocalFileSystemBackend (9 tests): full store→retrieve round-trip;
private key file is 0o600 (POSIX); metadata file is 0o600; missing
domain → None; list returns only domain dirs WITH cert.pem (empty
dirs filtered); delete removes whole tree; delete missing → False;
backend_name contract.
- TestAzureSecretNameSanitisation (4 tests): only allowed chars in
output; DISTINCT inputs produce distinct outputs (the CRC32 suffix
is the collision guard — without it 'my-app.example.com' and
'my.app-example.com' both sanitize to the same secret name and one
cert overwrites the other); 127-char Azure cap respected;
deterministic.
- TestCloudBackendConstructorValidation (7 tests): each cloud backend
rejects empty required fields at __init__ time so misconfiguration
fails fast instead of producing a half-initialised client. Tests
every Azure required field (4 parametrized); AWS credentials; Vault
url/token; Infisical client_id.
- TestStorageManagerDispatch (5 tests): factory picks local/azure
correctly; unknown backend type falls back to local (typo doesn't
take down renewal); failed cloud-backend init falls back to local
(misconfigured Vault doesn't crash the worker); lazy + idempotent
initialization (load_settings called exactly once, same backend
instance returned across calls).
Total runtime: ~0.2s.
* test(certificates): 12 unit tests for concurrent issuance lock + DNS alias paths
Targets two specific gaps the audit list flagged in modules/core/
certificates.py:
1. Concurrent-issuance non-blocking lock (line 561). create_certificate
raises RuntimeError if another thread is already issuing for the same
domain. Without this guard, two simultaneous renewals would race over
data/certs/<domain>/ and trigger LE rate-limit denials. Five tests:
- first acquire of a fresh domain lock succeeds (sanity)
- locks are per-domain, not global (otherwise unrelated cert ops
would serialise needlessly)
- same-domain lock reused (a fresh lock per call would defeat the
non-blocking guard entirely)
- second create_certificate while a lock is already held raises
RuntimeError with the exact "already in progress" message
- end-to-end multi-threaded race with threading.Barrier(2): two
threads call create_certificate; exactly one hits the lock barrier,
exactly one passes through (and is allowed to fail downstream for
unrelated reasons — we only assert the barrier behaviour)
2. check_dns_alias_records DNS failure paths (line 319). The function
must surface a structured status (ok / missing / mismatch / error)
rather than raise or silently report 'ok'. Seven tests:
- ok when CNAME target matches
- mismatch when CNAME points elsewhere (the 'cert renewal silently
fails because alias drifted' scenario)
- missing when no CNAME at source (most common misconfig)
- error when DoH resolver itself raises (Cloudflare DoH down /
egress firewall blocked); status='error', message in error field
- SAN domains each get their own check entry, deduplicated
- overall ok only if every per-source check is ok
- trailing-dot normalisation: DNS records end with '.', config
usually doesn't — both forms must compare equal
The third gap from the audit (renewal-job N+1 settings reads) is
already covered by commit 7f7b41f's tests/test_renewal_job_settings_reuse.py.
Total runtime: ~0.14s.
* chore: bump version to v2.5.3 + prepend release notes
20 atomic commits across this branch — security/perf/UI/docs/coverage
audit response. Honest precision summary in the release notes: 8 true
positives fixed, 10 hardening items shipped, 35 false positives
documented + deferred. +160 unit tests across 5 new files. e2e suite
verified green (112 passed, 0 failures, real Cloudflare DNS-01 issuance
+ Playwright UI).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds detailed instructions for setting up CertMate as a systemd service to the main README.md documentation, addressing the need for production deployment guidance on Linux systems.
Changes Made
Added a new section "🔧 Setting Up CertMate as a Systemd Service" that includes:
📝 Complete Service Configuration
certmate.servicefile with proper systemd configuration🚀 Step-by-Step Setup Instructions
certmatesystem user for security/opt/certmate✅ Additional Benefits
Example Service Configuration
Why This Change
This addresses a common deployment scenario where users need to run CertMate as a production service on Linux systems. The systemd integration provides:
Impact
This makes CertMate easier to deploy in production environments while maintaining the same high-quality documentation standards established in the project.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.