A symptom-keyed index for things that surface during Periscope installs, upgrades, and day-2 operation. Most feature-specific issues are documented inline with the feature — this page indexes them so an operator can grep by what they saw, plus adds cross-cutting items (chart-versions OOM, network changes, scanner false-positives) that don't fit a single feature doc.
If you don't find your symptom here, the most diagnostic single
file is the Periscope pod's slog stream — every privileged action
gets a structured line, and every request carries an
X-Request-Id that grep-joins to the matching audit row.
| Symptom | Where |
|---|---|
| "Open Shell" button missing on a pod page | pod-exec.md §8 |
| Shell button missing on a cluster page header | cluster-shell.md §9 |
| WebSocket upgrade fails / instant disconnect | pod-exec.md |
E_FORBIDDEN on cluster-shell click |
cluster-shell.md §9 |
Shell session pod stays Pending >30s |
cluster-shell.md §9 — also see below |
Helm command not in cluster_shell_close.commands[] |
Upgraded? On v1.1.5+ helm is audited; pre-v1.1.5 it isn't (CHANGELOG) |
| Node shell button missing on a node page | node-shell-ssm.md §7 |
Node shell AccessDenied on StartSession (document/...) |
node-shell-ssm.md §7 — split the policy into two StartSession statements |
Node shell preflight: node not Online / E_NODE_NOT_EC2 |
node-shell-ssm.md §7 |
Node shell AccessDenied on AssumeRole though aud looks right |
node-shell-ssm.md §7 — trust policy can't gate on groups |
E_REAUTH_REQUIRED opening a node shell |
node-shell-ssm.md §7 — id_token stale, sign in again |
| Tier user gets 403 on everything | cluster-rbac.md |
groups claim missing / empty |
okta.md / auth0.md |
| Login bounces to IdP forever | okta.md / auth0.md |
| Refresh fails after N hours | okta.md / auth0.md |
Callback URL mismatch / Sign-in redirect URI mismatch |
okta.md / auth0.md |
Audit page shows nothing / auditEnabled: false |
audit.md |
X-Audit-Scope: self for an admin user |
audit.md |
Audit DB grows past maxSizeMB |
audit.md |
| SSE drops every ~60 seconds | watch-streams.md §7 |
| Watch stream returns 404 immediately | watch-streams.md §7 |
| Watch stream cap reached | watch-streams.md §7 |
| Page updates feel polled, not live | watch-streams.md §7 |
| EKS "Drift not computed" | eks-upgrade-readiness.md |
| EKS Upgrade Insights "load failed" | eks-upgrade-readiness.md |
Agent registration tls: certificate required |
agent-onboarding.md |
Agent registration SPKI pin mismatch |
agent-onboarding.md |
Agent registration registration rejected (HTTP 401) |
agent-onboarding.md |
| Agent connects, then drops every few seconds | agent-onboarding.md |
| Cluster card stuck "unreachable" | agent-onboarding.md |
| Periscope pod OOMKilled when listing chart versions | below |
| Artifact Hub flags HIGH/MEDIUM CVE on a clean image | below |
| Agent tunnel reconnect churn after the central LB changes IP | below |
| Local microk8s: TLS cert error after switching networks | below |
Agent can't connect after a full reinstall (connection refused) |
below |
Symptom: SPA fetches /api/clusters/{c}/helm/chart/versions
against a public helm repo with a large index.yaml
(prometheus-community, bitnami, ingress-nginx, etc.) and
the Periscope pod restarts. kubectl describe shows
Last State: Terminated, Reason: OOMKilled, Exit Code: 137.
Cause: the handler reads the entire index.yaml into memory
and yaml.Unmarshals it in one shot. For prometheus-community
(5–10 MB index.yaml), the burst hits ~600–700 MiB RSS — well past
the chart default resources.limits.memory: 512Mi. The pod is
OOM-killed mid-request; ingress-nginx surfaces this as a 503.
Fix: bump the memory limit while a streaming-parse follow-up lands. 1Gi is comfortable headroom for every public helm repo we've benchmarked:
# values.yaml
resources:
requests:
memory: 256Mi
limits:
memory: 1GiApply with helm upgrade periscope ... --reuse-values --set resources.limits.memory=1Gi.
Notes:
- The 5-minute
chartVersionsCacheonly helps the second request — first fetch after pod start still pays the full burst. - Affects all backends equally (in-cluster, agent, eks, kubeconfig).
- Steady-state Periscope RSS is ~200 MiB; the bump is for the chart- versions burst alone.
- The streaming-parse fix tracks under the helm-chart-versions endpoint refactor — once it lands, the 1Gi bump can revert.
Symptom: Operator clicks the shell button. The drawer opens, the WebSocket connects, but the terminal sits at "starting…" for
30 seconds.
kubectl -n periscope-system describe pod periscope-shell-*showsFailed: ErrImagePull/ImagePullBackOffagainstghcr.io/gnana997/periscope-shell:<version>.
Cause: the cluster's container runtime can't reach ghcr.io —
typically because the cluster is air-gapped, the egress runs through
a corporate proxy that doesn't allow ghcr.io, or the proxy needs
auth not configured on the kubelet.
Fix: mirror the shell image into the registry the cluster can reach, then point the chart at it:
# Server chart (values.yaml)
clusterShell:
enabled: true
image:
repository: registry.internal.acme.com/periscope/periscope-shell
tag: v1.1.5 # match your installed periscope version
pullPolicy: IfNotPresent
# Agent chart (per managed cluster) — only needed if you use the
# agent backend and the kubelet on managed clusters also can't reach
# ghcr.io. Same value, applied separately because the agent and
# server can be sized for different network postures.The shell image is opt-in (default clusterShell.enabled: false),
so operators who don't use cluster shell never pull it. Once
mirrored, operators on locked-down clusters get the same in-browser
experience without punching a hole through their egress policy.
Notes:
- The
clusterShell.podStartTimeoutSeconds(default 30s) determines how long Periscope waits before returningE_SHELL_POD_TIMEOUT— bump it if your mirror is slow on first pull, but the underlying issue is registry reachability, not timeout tuning. - The shell image is multi-arch (linux/amd64 + linux/arm64), so a
one-time
crane copyor equivalent into your mirror works for both.
Symptom: The agent log on a managed cluster shows a steady cadence of:
agent.dial_attempt server=wss://...:8443/api/agents/connect
agent.dial_failed err=...
agent.reconnect_in seconds=8
The central server's cluster registry shows the cluster as "unreachable" or flickering between connected/unreachable.
Cause: the agent's serverURL was set at install time. If the
LB / Ingress IP behind that URL changes — e.g. AWS NLB rotates a
node IP, a hostname-based serverURL DNS-cache TTL expires and resolves
new, or the central Periscope LB was reprovisioned — there's a brief
window where the agent's cached DNS / TCP connection is wrong.
Why it usually self-heals: the agent's reconnect supervisor
(internal/tunnel) backs off [0, 1s, 3s, 8s] with a max 30s
ceiling and re-dials from scratch each cycle. DNS re-resolves on
each dial. A 1-2 minute window of churn is the typical signature of
an LB rotation; longer means something else broke.
When to act:
- Hostname-based
serverURL: usually nothing to do; DNS TTL expires and reconnect succeeds. If your DNS TTL is unusually long (>5 min), consider lowering it on the Periscope server's record to shorten the churn window. - IP-based
serverURL(rare in production — typically dev rigs or temporary topologies): you'll need tohelm upgradethe agent with the new IP. After upgrade the agent reads the new value from the rendered env, reconnects on the next dial.
Diagnostic: turn on agent debug logging to see the per-attempt detail (DNS resolution, TCP error, TLS handshake) — see agent-onboarding.md → Turn on debug logging on the agent.
Symptom: After uninstalling the central Periscope helm release and reinstalling it (e.g. testing a fresh-install flow on local microk8s or kind), the agent on a managed cluster crashloops with:
"err":"state: registration: POST register:
Post \"http://192.168.0.6:31429/api/agents/register\":
dial tcp 192.168.0.6:31429: connect: connection refused"
The IP looks right but the port doesn't answer.
Cause: when the central Periscope service is exposed as
service.type: NodePort (typical for local-dev rigs), Kubernetes
allocates a new random NodePort on every fresh install. The
agent's saved serverURL / registrationURL point at the old
NodePorts, which the new service doesn't bind. TCP gets refused.
In-place helm upgrade --reuse-values preserves NodePorts because
the Service object isn't recreated. Only full uninstall → install
triggers re-allocation.
Fix:
# 1. Look up the new NodePorts
microk8s kubectl -n periscope get svc periscope -o yaml \
| grep -A2 'nodePort:'
# Example output:
# nodePort: 30733 (port 8080 — HTTP / registration)
# nodePort: 32167 (port 8443 — tunnel)
# 2. Update the agent values
sed -i \
-e 's|registrationURL: http://.*|registrationURL: http://<host>:30733|' \
-e 's|serverURL: wss://.*|serverURL: wss://<host>:32167|' \
kind-agent-values.yaml
# 3. Helm upgrade the agent — agent reconnects on the new URL
helm upgrade periscope-agent \
oci://ghcr.io/gnana997/charts/periscope-agent --version 1.1.5 \
-n periscope -f kind-agent-values.yamlIf the agent had not yet completed registration (TCP refused at the
register step), the bootstrap token is still unused — you can keep
it. If registration succeeded against the old install before the
reinstall, the new central server doesn't have the agent's cert in
its registry, so mint a fresh token from
POST /api/agents/tokens and re-register.
To avoid: for local-dev rigs that you might re-install repeatedly, pin the NodePort by patching the Service after install:
microk8s kubectl -n periscope patch svc periscope --type=json -p='[
{"op":"replace","path":"/spec/ports/0/nodePort","value":31429},
{"op":"replace","path":"/spec/ports/1/nodePort","value":31410}
]'The chart doesn't currently expose service.nodePorts.* overrides
(values issue on the backlog). For production, use a LoadBalancer or
Ingress — NodePorts would be cluster-internal only and not referenced
from the agent values, so this issue doesn't apply.
Symptom: After moving your laptop between networks (wifi → hotspot,
office → home, etc.), kubectl port-forward against a local microk8s
cluster fails with:
error: error upgrading connection: error dialing backend: tls: failed
to verify certificate: x509: certificate is valid for 192.168.0.6,
172.17.0.1, 172.18.0.1, ..., not 172.20.10.3
The new IP (172.20.10.3 here) is wherever your laptop landed after
the network swap.
Cause: this is a snap-microk8s + mobile-laptop interaction, not a Periscope issue. Three ingredients combine to produce it:
- microk8s's apiserver binds to
0.0.0.0(every interface, including ones that come and go with your network). - The apiserver TLS cert was minted at install time with the host's IPs at that moment — your wifi IP became a SAN, but your hotspot IP didn't exist yet.
- microk8s runs an interface-watcher daemon that rewrites your
kubeconfig's
server:line when it sees a new primary IP. So kubectl now dials the new IP, the apiserver answers, and Go's x509 verifier rejects the cert.
Production clusters don't combine these three. Cloud-managed K8s (EKS, GKE, AKS) reach apiserver via stable DNS, cert covers the DNS name. kind binds to localhost only, never touches your host IPs. k3s defaults to localhost-only kubeconfig. Only snap-microk8s combined with a moving laptop trips all three.
Fix (non-destructive, ~30 seconds of apiserver downtime, no workload disruption):
sudo microk8s refresh-certs -e server.crtThis regenerates the apiserver cert with current host IPs in the SAN
list. Pods keep running through the brief apiserver restart; kubelets
reconnect; in-flight kubectl calls retry transparently.
Durable fix (one-time setup, ~10 minutes): pin the apiserver to a stable hostname your network state can't change:
- Add a hosts entry:
echo "127.0.0.1 periscope-dev.local" | sudo tee -a /etc/hosts - Edit
/var/snap/microk8s/current/certs/csr.conf.template, addDNS.5 = periscope-dev.local sudo microk8s refresh-certs -e server.crt- Update kubeconfig's
server:line tohttps://periscope-dev.local:16443 - For Periscope-agent installs, set
agent.serverURL=wss://periscope-dev.local:31410(or whatever your tunnel port is)
After this, you can swap networks freely without breaking kubectl, helm, or the agent tunnel.
Note for cluster operators: the durable fix is what production clusters do anyway (stable hostname, cert pinned to the hostname). A laptop dev rig is the only place this gets tricky.
See also: local-dev.md if it exists, otherwise this entry is the canonical place.
Almost every entry on this page can be confirmed (or ruled out) by grepping the Periscope pod's slog stream for matching keys. The binary writes structured JSON; relevant fields per scenario:
| Scenario | What to grep |
|---|---|
| OOM during chart-versions | helm chart versions (handler name) + check pod RestartCount |
| Cluster-shell pod won't start | cluster_shell.pod_create, cluster_shell.pod_wait_ready |
| Node shell won't open | ssm_shell.* (handler), node_shell (startup config line); the STS/SSM AccessDenied is in the audit ssm_session_open failure + the client error frame's message |
| Agent tunnel issues | tunnel.*, agent.session_event, agent.dial_* |
| OIDC / auth issues | auth.*, look for groups field on the line |
| Watch streams | watch.stream_open, watch.stream_close |
| Audit pipeline | audit:, especially around startup |
All log lines carry an X-Request-Id that the audit pipeline also
records on the corresponding audit row — one ID grepped across the
two surfaces gives a complete end-to-end trace.
For agent-backed clusters, the same request_id appears in both
the central server's stdout AND the agent's stdout (when
agent.logLevel: debug), so an end-to-end trace across three logs
is one grep of the same UUID.
| File | What's there |
|---|---|
| pod-exec.md §8 | Pod exec session troubleshooting (4 entries) |
| cluster-shell.md §9 | Cluster shell troubleshooting (8 entries) |
| agent-onboarding.md | Agent registration + tunnel troubleshooting |
| audit.md | Audit pipeline issues |
| watch-streams.md | SSE watch stream issues |
| eks-upgrade-readiness.md | EKS Upgrade Insights / AMI drift |
| cluster-rbac.md | Tier mode / RBAC binding mistakes |
| okta.md | Okta-specific OIDC mistakes |
| auth0.md | Auth0-specific OIDC mistakes |