Skip to content

Add PHP shared-hosting fingerprint signals to schema-1 ping #3

@rumblefrog

Description

@rumblefrog

Problem

Schema-1's env.* block today captures php, db_engine, db_version, web_server, os_family — enough to slice installs by language/database/server, but nothing that distinguishes a panel running on a constrained shared-hosting environment from one on a VPS, container, or dedicated box. That distinction matters for product decisions:

  • Feature roadmap. Anything that wants to shell out (background ban sync, scheduled exports, large image processing) is dead-on-arrival on a host with disable_functions covering proc_open/exec. Knowing what fraction of installs sit there changes how aggressively we'd ship those features.
  • Default tuning. Memory/execution limits inform sane defaults for paginated queries, batch sizes, and timeouts. If the median install runs at memory_limit=128M, the panel shouldn't default to 5K-row exports.
  • Support triage. "Why doesn't X work?" reports correlate strongly with shared-host signal sets. Shipping the signal once, in the ping, beats asking every reporter to run phpinfo().

There is no single reliable indicator for "this is a shared host" — it's a heuristic mix at best. Established loaders (ionCube Loader Wizard, SourceGuardian's installer, Zend Guard) and installers (Composer, WordPress Site Health, Drupal Status Report) all collect a similar bundle of weak signals and combine them. We should do the same: ship the raw fingerprint inputs in the ping, derive any "host kind" label query-time in SQL.

Design

Signals to add

Sourced from what the PHP loader installers and WP Site Health collect, narrowed to what's both (a) useful for a shared-host inference and (b) cheap and side-effect-free for a daily ping:

Signal Slot PHP source Notes
env.sapi blob php_sapi_name() fpm-fcgi, apache2handler, cgi-fcgi, cli, litespeed, … bounded cardinality.
env.memory_limit_mb double parse ini_get('memory_limit') -1null. \"512M\"512. \"1G\"1024.
env.max_execution_time double (int) ini_get('max_execution_time') 0 is meaningful (no limit) — distinct from absent.
env.disable_functions_count double count(array_filter(explode(',', ini_get('disable_functions')))) Count, not the raw string — see Privacy considerations below.
env.zts bit PHP_ZTS === 1 Thread-safety.
env.php_64bit bit PHP_INT_SIZE === 8 Useful corroborator; cheap.
env.open_basedir_set bit ini_get('open_basedir') !== '' Strongest single shared-host signal. Boolean only — the value would leak the home dir path.
env.allow_url_fopen bit (bool) ini_get('allow_url_fopen') Often disabled on hardened shared.
env.opcache_loaded bit extension_loaded('Zend OPcache') Tuning signal.
env.suhosin_loaded bit extension_loaded('suhosin') Niche but a strong shared-host marker when present.
env.posix_available bit function_exists('posix_geteuid') Its absence is itself a usable signal.
env.host_panel_cpanel bit @is_dir('/usr/local/cpanel') Wrap with @ to avoid open_basedir warnings.
env.host_panel_plesk bit @is_dir('/usr/local/psa') Same.
env.host_panel_directadmin bit @is_dir('/usr/local/directadmin') Same.
env.docroot_user_home bit DOCUMENT_ROOT matches ~^/home/[^/]+/(public_html|domains/.+/public_html)~ Boolean derivation; the raw path leaks the cPanel/DA username.
env.sapi_per_user bit SAPI is fpm-fcgi/cgi-fcgi AND posix_geteuid() matches the owner of __FILE__ and isn't a known service uid (0, 33, 48, 82) Per-user FPM / suEXEC / mod_ruid2 fingerprint. Gated on POSIX availability — emits 0 when POSIX is missing (no false positive).

Total: 1 new blob, 3 new doubles, 12 new bits.

Why ship raw signals, not a derived env.host_kind label

Considered shipping a single panel-derived env.host_kind ∈ {shared, vps, dedicated, container, unknown} blob. Rejected: locks the heuristic to whatever panel version emitted the ping, and the heuristic is exactly the thing we'd want to retune as we see what real installs look like. Raw signals + SQL CASE WHEN keeps the threshold tunable without a panel deploy. Document the recommended SQL pattern in the README so the first analyst doesn't reinvent it.

Slot allocation (proposed lock layout)

The README notes 10/20 blobs, 10/20 doubles committed today; this proposal uses 11/20 blobs, 13/20 doubles, 27/53 bits — well clear of the ceilings.

Pre-deploy reorder OK. Per #1's Locked contract section, schema-1 is mutable until first deploy; wrangler deploy has not been run against a real account yet. After this issue lands, the CONTRIBUTING.md append-only rule takes over and the layout below is frozen.

{
  \"blobs\": [
    \"instance_id\",
    \"panel.version\",
    \"panel.git\",
    \"panel.theme\",
    \"env.php\",
    \"env.sapi\",
    \"env.os_family\",
    \"env.web_server\",
    \"env.db_engine\",
    \"env.db_version\",
    \"extras\"
  ],
  \"doubles\": [
    \"schema\",
    \"panel_features_bits\",
    \"env.memory_limit_mb\",
    \"env.max_execution_time\",
    \"env.disable_functions_count\",
    \"scale.admins\",
    \"scale.servers_enabled\",
    \"scale.bans_active\",
    \"scale.bans_total\",
    \"scale.comms_active\",
    \"scale.comms_total\",
    \"scale.submissions_30d\",
    \"scale.protests_30d\"
  ],
  \"bits\": [
    \"panel.dev\",
    \"features.submit\",
    \"features.protest\",
    \"features.comms\",
    \"features.kickit\",
    \"features.exportpublic\",
    \"features.publiccomments\",
    \"features.steamlogin\",
    \"features.normallogin\",
    \"features.groupbanning\",
    \"features.friendsbanning\",
    \"features.adminrehashing\",
    \"features.smtp_configured\",
    \"features.steam_api_key_set\",
    \"features.geoip_present\",
    \"env.zts\",
    \"env.php_64bit\",
    \"env.open_basedir_set\",
    \"env.allow_url_fopen\",
    \"env.opcache_loaded\",
    \"env.suhosin_loaded\",
    \"env.posix_available\",
    \"env.host_panel_cpanel\",
    \"env.host_panel_plesk\",
    \"env.host_panel_directadmin\",
    \"env.docroot_user_home\",
    \"env.sapi_per_user\"
  ]
}

Reorder rationale:

  • Blobs. Inserted env.sapi after env.php. Reordered the existing four env.* blobs to group OS/web first, database last — this is purely cosmetic and is the only window we'll ever have to do it. extras stays last by convention.
  • Doubles. Moved the three new env.* doubles into positions 3–5, between the meta doubles (schema, panel_features_bits) and the scale.* block. Analysts reading column 5 of an AE row should see env signals adjacent to env signals, not interleaved with scale counters.
  • Bits. Pure append at positions 15–26. Existing 0–14 unchanged.

Wire shape

Adds an additive sub-block to the panel's emitted body. All fields optional per the optionality rule:

{
  \"schema\": 1,
  \"instance_id\": \"…\",
  \"panel\": { \"…\": \"…\" },
  \"env\": {
    \"php\": \"8.2\",
    \"sapi\": \"fpm-fcgi\",
    \"db_engine\": \"mariadb\",
    \"db_version\": \"10.11\",
    \"web_server\": \"litespeed\",
    \"os_family\": \"linux\",
    \"memory_limit_mb\": 256,
    \"max_execution_time\": 30,
    \"disable_functions_count\": 7,
    \"zts\": false,
    \"php_64bit\": true,
    \"open_basedir_set\": true,
    \"allow_url_fopen\": true,
    \"opcache_loaded\": true,
    \"suhosin_loaded\": false,
    \"posix_available\": true,
    \"host_panel_cpanel\": true,
    \"host_panel_plesk\": false,
    \"host_panel_directadmin\": false,
    \"docroot_user_home\": true,
    \"sapi_per_user\": true
  },
  \"scale\": { \"…\": \"…\" },
  \"features\": { \"…\": \"…\" }
}

env.web_server (existing) absorbs the LiteSpeed signal — LiteSpeed shows up in $_SERVER['SERVER_SOFTWARE'] and panel-side normalization should map it into the existing enum (apache, nginx, litespeed, iis, caddy, …). No new blob needed for that.

Why no raw paths or raw disable_functions string

  • DOCUMENT_ROOT under cPanel/DirectAdmin is /home/<username>/public_html — the username is identifying. Ship the boolean (env.docroot_user_home) instead.
  • $_SERVER['SERVER_SOFTWARE'] can include build banners that uniquely fingerprint a host provider's custom Apache/LiteSpeed build. The shared-host-relevant content (cPanel/Plesk/DA presence, LiteSpeed-vs-Apache) is captured by the dedicated bits and the existing env.web_server enum.
  • disable_functions raw string is low-cardinality but is itself a host-provider fingerprint (each major shared host has a recognisable list). The count is enough for the heuristic; promote a per-function bitmask later if a query proves it's needed.
  • open_basedir value would leak the home directory path. Boolean only.

This is the same posture the rest of the schema takes (scale.* raw counts vs. buckets, etc.) — minimise per-install identifying surface inside the data we already commit to collecting.

Recommended SQL macro for host_kind

To document in the README alongside featureFlag(name):

-- hostKind(): heuristic shared-vs-not classification computed query-time from
-- the env.* fingerprint signals. Tuneable without a panel/Worker deploy by
-- editing this query.
SELECT
  CASE
    WHEN ((toUInt64(double2) >> 17) & 1) = 1                       -- env.open_basedir_set
      AND (
        ((toUInt64(double2) >> 22) & 1) = 1                        -- env.host_panel_cpanel
        OR ((toUInt64(double2) >> 23) & 1) = 1                     -- env.host_panel_plesk
        OR ((toUInt64(double2) >> 24) & 1) = 1                     -- env.host_panel_directadmin
        OR ((toUInt64(double2) >> 25) & 1) = 1                     -- env.docroot_user_home
        OR ((toUInt64(double2) >> 26) & 1) = 1                     -- env.sapi_per_user
      )
      THEN 'shared'
    WHEN double5 IS NOT NULL AND double5 <= 256                    -- env.memory_limit_mb
      AND double4 IS NOT NULL AND double4 <= 60                    -- env.max_execution_time
      THEN 'constrained'
    ELSE 'unconstrained'
  END AS host_kind,
  count() AS pings
FROM telemetry
WHERE timestamp > now() - INTERVAL 7 DAY
GROUP BY host_kind
ORDER BY pings DESC;

Bit positions (>> 17, etc.) come from the new lock file; they're stable for the life of schema-1 once this lands.

Acceptance criteria

  • schema/1.lock.json updated to the layout above; npm test -- -u regenerates test/__snapshots__/ae.test.ts.snap with the new positions; no other snapshot drift.
  • test/layout.test.ts still green: bounds (≤ 20/20/53), no duplicates, README AE-layout block byte-equal to schema/1.lock.json.
  • src/schema.ts's schema1Env extended with the 18 new optional fields, all .optional(). The whole env sub-schema stays .passthrough().
  • src/ae.ts requires no new mapping logic — every new field reads through the existing readPath / bit-pack path. (Confirm by adding a test: a payload with all 18 new env fields populated round-trips through mapDataPoint to the expected blob/double/bit positions.)
  • test/fixtures.ts's canonicalPayload includes all 18 new env fields with realistic shared-host-ish values; the snapshot reflects them.
  • Sparse-payload tests cover the new fields: posting a body with the existing canonical env block but none of the new fields returns 204; the new doubles serialize to null, the new bits serialize to 0, env.sapi blob serializes to null.
  • Per-feature bit-position tests for at least 3 of the new bits using lock.bits.indexOf(name) (mirrors the existing features.kickit test).
  • README.md's AE-layout block updated to match the new lock file.
  • README.md's example curl body (the one in Local dev) extended with the new env.* fields.
  • README.md gains a Deriving host_kind query-time sub-section under AE layout with the SQL macro above and the bit-position rationale.
  • CONTRIBUTING.md's rule 1 (append-only) is unchanged — this is the last legal reorder before that rule turns load-bearing.
  • All four CI gates pass: npm run typecheck, npm run lint, npm test, npm run deploy:dry-run.

Out of scope

  • Panel-side implementation. The PHP code that gathers these signals lives in sbpp/sourcebans-pp and lands in a paired panel PR (see Notes). This issue is the consumer-side schema work only.
  • Per-disable_functions bitmask. Defer until a query proves the count alone isn't enough. Adding it later is a clean append (one new double or a few bits).
  • Shipping env.host_kind as a typed blob. Rejected — derive query-time per the rationale above.
  • Shipping $_SERVER['SERVER_SOFTWARE'] verbatim. Rejected — see Why no raw paths …. If a panel sends it as an unknown key it'll land in extras and we can decide later.
  • Detecting Windows-flavoured shared hosting (Plesk-on-Windows, IIS shared). The bits cover Linux conventions; Windows shared is a long tail and the existing env.os_family already separates the cohorts. Worth its own follow-up if data shows enough Windows installs to matter.
  • Backfilling existing AE rows. N/A — nothing deployed yet.

Open questions (resolve in the PR, not the issue)

  • env.php_64bit — keep or drop? Cheap to ship but has near-zero analytical use in 2026 (32-bit PHP is rare). Argument for keeping: it's a reliable corroborator for old-OS detection, and we have plenty of bit slots. Argument for dropping: bits are append-only once shipped, so a useless bit is a permanent waste of one position. Lean: keep, but defensible either way.
  • env.allow_url_include — add now or defer? It's adjacent to allow_url_fopen and a security signal, but only weakly correlated with shared hosting. Defer (lands in extras if a panel emits it; promote later if useful).
  • "Known service uid" set for env.sapi_per_user. Proposed: {0, 33, 48, 82} (root, www-data, apache, www on FreeBSD). Different distributions use different uids; the panel-side helper should probably accept any uid that owns __FILE__ AND is ≥ 1000 as the per-user signal, rather than enumerating service uids. Decide in the panel PR — the wire field is just a bit, the panel chooses how to populate it.
  • Should env.memory_limit_mb be the parsed integer or the raw string? Parsed wins for query ergonomics (numeric aggregations, percentiles). Raw wins for round-tripping unusual values. Parsed is proposed; the parser is ~5 lines of PHP.

Notes

  • This is an additive + permitted-reorder change against schema-1, made possible by the fact that nothing is deployed. After this PR merges and the Worker actually ships, the CONTRIBUTING.md rule 1 (append-only) is in effect and a similar reorder would require a schema bump.
  • Lockstep panel-side work needed in sbpp/sourcebans-pp before first deploy (open as its own issue / paired PR; do not merge here without it):
    1. Re-vendor web/includes/telemetry/schema-1.lock.json from the new lock file in this repo.
    2. Add an Sbpp\\Telemetry\\HostFingerprint helper that computes the 18 new env.* fields, with each individual probe wrapped to be safe under open_basedir / disable_functions / missing POSIX (the latter is what the boolean is for — it must not throw when absent).
    3. Wire the helper into Telemetry::collect() so the fields populate the env sub-block.
    4. Extend the panel's TelemetrySchemaParityTest (PHPUnit) to include the new field names — the existing parity assertion already deep-equals against the vendored lock file, so it'll fail loudly until the helper is in place.
  • Source survey (what real PHP loaders/installers collect, for context on field choice):
    • ionCube Loader Wizardphp_sapi_name, PHP_VERSION, PHP_INT_SIZE, PHP_ZTS, PHP_OS, extension_dir, loaded Zend extensions, dl() availability.
    • SourceGuardian Loader Installer — same shape, plus a uid/gid check on the loader file.
    • Composer (composer diagnose)php_sapi_name, disable_functions, allow_url_fopen, open_basedir, suhosin.executor.*, memory_limit, xdebug.mode, OPcache state, proc_open availability.
    • WordPress Site Health (class-wp-debug-data.php)php_sapi_name, PHP_VERSION, PHP_INT_SIZE, memory_limit, max_execution_time, max_input_vars, upload_max_filesize, disable_functions, $_SERVER['SERVER_SOFTWARE'], loaded extensions, plus get_filesystem_method() === 'direct' as the de-facto "are we shared" check.
    • Drupal Status Report — adds apache_get_modules() when available; trusted-host check.
  • The signal set above is the intersection of what those tools actually use and what's safe to ship under this repo's privacy contract — every raw-string / raw-path field they collect is reduced to a boolean here for that reason.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions