We have three environments: dev, staging, prod.
dev:
- Our development environment.
- We test and break things by ourselves in dev.
- Runs the latest commit for branch
main
for every microservice.
staging:
- Our quality assurance environment.
- As identical to production as possible.
- The development environments of the other systems we integrate to, e.g. Waltti Raportointi or Waltti Vehicle Registry, connect to staging.
- New APC counting systems or new versions of old APC counting systems connect to staging before they are accepted into production.
prod:
- Our production environment.
- Production data is what is shown to passengers and Waltti customers, e.g. transit planners in municipalities.
- The production environments of the other systems we integrate to, e.g. Waltti Raportointi or Waltti Vehicle Registry, connect to prod.
The following table describes the differences in naming between the environments.
pilot dev | pilot prod | dev | staging | prod | common | |
---|---|---|---|---|---|---|
Description | Pilot phase development environment | Pilot phase production environment | Productization phase development environment | Productization phase staging environment | Productization phase production environment | Common, shared things like DNS management for productization phase |
Method | ClickOps | ClickOps | IaC | IaC | IaC | IaC |
GCP project | apc-sandbox | apc-sandbox | apc-sandbox | apc-staging | apc-sandbox | |
GCP resource region, e.g. for K8s | europe-west3 (Frankfurt, Germany) | europe-west3 (Frankfurt, Germany) | europe-west3 (Frankfurt, Germany) | europe-west3 (Frankfurt, Germany) | europe-west3 (Frankfurt, Germany) | |
K8s cluster | sandbox-autopilot | sandbox-autopilot | prototype | staging | N/A | |
K8s namespace | dev | sandbox | dev | staging | N/A | |
CloudAMQP team | Waltti | Waltti | Waltti | Waltti | N/A | |
CloudAMQP instance name | dev-mqtt | dev-mqtt | sandbox-mqtt | staging-mqtt | N/A | |
CloudAMQP instance tags | dev | dev | sandbox | staging | N/A | |
CloudAMQP GCP region | europe-west3 (Frankfurt, Germany) | europe-west3 (Frankfurt, Germany) | europe-west3 (Frankfurt, Germany) | europe-west3 (Frankfurt, Germany) | N/A | |
CloudAMQP MQTT hostname | burly-gold-finch.rmq3.cloudamqp.com | burly-gold-finch.rmq3.cloudamqp.com | lively-cobalt-wasp.rmq5.cloudamqp.com | crisp-green-hippo.rmq2.cloudamqp.com | N/A | |
MQTT broker CNAME | dev.mqtt.apc.lmj.fi | dev.mqtt.apc.lmj.fi | mqtt-dev.apc.waltti.fi | mqtt-staging.apc.waltti.fi | mqtt.apc.waltti.fi | N/A |
StreamNative organization | waltti | waltti | waltti | waltti | N/A | |
StreamNative instance | waltti | waltti | alpha | beta | N/A | |
StreamNative cluster | pulsar | pulsar | sandbox | staging | N/A | |
StreamNative tenant | apc-dev | apc-sandbox | apc-sandbox | apc-staging | N/A | |
StreamNative service URL | pulsar+ssl://pulsar.waltti.snio.cloud:6651 | pulsar+ssl://pulsar.waltti.snio.cloud:6651 | pulsar+ssl://sandbox.waltti.snio.cloud:6651 | pulsar+ssl://staging.waltti.snio.cloud:6651 | N/A | |
StreamNative OAuth 2.0 audience | urn:sn:pulsar:waltti:waltti | urn:sn:pulsar:waltti:waltti | urn:sn:pulsar:waltti:alpha | urn:sn:pulsar:waltti:beta | N/A | |
StreamNative GCP region | europe-west3 (Frankfurt, Germany) | europe-west3 (Frankfurt, Germany) | europe-west1 (St. Ghislain, Belgium) | europe-west1 (St. Ghislain, Belgium) | N/A |
- Always pull latest edge tag into dev environment
- Run e2e tests in dev.
- e2e tests are separate entity (testing multiple microservices not a single microservice)
- Pull the latest service versions that passed e2e tests in dev environment into staging environment, for example every Wednesday morning at 09:00 local time.
- In staging manifests, use sha-* tags instead of edge, though.
In staging and prod, monitor:
- log message counts of Pulsar clients for different error levels
- message rates of different Pulsar topics
- Pulsar topic storage usage
- cloud resource consumption
If staging or prod has problems according to monitoring, notify developers.
If staging has not had problems for e.g. a week, manually copy staging image versions into prod and update manifest repo tag prod.
Our services should survive if external services go down.
If we write broken logic, before reaching prod the logic has to survive:
- CI tests, e.g. type checking, linting and unit tests,
- e2e tests in dev, e.g. a week in staging without tripping alarms.
If the broken logic breaks the data flow so that a downstream microservice chokes on broken messages it consumes from Pulsar, in dev and staging it is probably easiest to fix the logic bug and forcefully empty the relevant Pulsar topics with broken messages.
If the broken logic reaches prod, we might need to fix the downstream service with a kludge.