Conversation
Spring Boot 2.2→2.7 upgrade removed javax.validation from the default web starter; id-gen-service needs spring-boot-starter-validation added explicitly. Spring Boot 2.4+ also dropped JUnit 4 vintage from the test starter; migrated both services' test files from @RunWith/@ignore to the JUnit 5 equivalents (@ExtendWith/@disabled). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
yarn install fails on node:alpine images because some dependencies require git for resolution. Adding git via apk before yarn install. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Some deps (digitalcredentials/jsonld.js) use ssh://git@github.com/ URLs. Alpine CI runners have no SSH keys, so rewrite to HTTPS at build time. Also add openssh-client so git can handle any remaining SSH probes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
postgres:latest now resolves to 17.x which has stricter initialization timing. Pin to postgres:14 (stable LTS) and add start_period so health check failures during container startup don't count as retries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. wait_for_port.sh used [[ ]] (bash) but Makefile calls it via 'sh' (dash on Ubuntu). The counter never triggered exit, looping forever for 6h until GH Actions timeout. Fixed: use POSIX [ ] syntax. 2. Keycloak healthcheck tested management port 9990 (up in ~10s) but the HTTP server on 8080 takes ~29s. Registry started before 8080 was ready, failed to fetch OIDC config, crashed. Fixed: test 8080. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Port 8080 (JBoss HTTP) takes 3-5+ min on CI; health check timed out at 5.5 min before Keycloak was ready, blocking registry from starting. Revert to port 9990 (management port, up in <10s) which was working before. The Makefile already calls wait_for_port.sh 8080 after docker-compose up completes, handling actual HTTP readiness. Also remove start_period from db: postgres:14 initializes fast and start_period was causing docker-compose v1 to start keycloak before postgres finished its first health check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
wait_for_port.sh: ((i=i+1)) is bash-only and does not update the variable in the parent shell when run via sh/dash, causing an infinite loop. Replace with POSIX-compatible i=$((i+1)). TenantJwtDecoder / CustomJwtDecoders: SecurityConfig.addManager() was calling CustomJwtDecoders.fromOidcIssuerLocation() at Spring Boot startup, which eagerly fetched the OIDC discovery document from http://keycloak:8080/auth/realms/sunbird-rc/.well-known/openid-configuration. Keycloak's WildFly management port (9990) becomes healthy ~16s after container start, but the HTTP port (8080) is only ready ~10s later. Registry starts immediately after keycloak:service_healthy (9990), so the OIDC fetch raced and failed, preventing Spring Boot from starting. Fix by deferring the HTTP call to the first actual JWT decode using double-checked locking in TenantJwtDecoder. The issuer key is set to oidcIssuerLocation directly (OIDC spec requires the discovery doc issuer to match the location URL). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Increase wait_for_port.sh timeout from 60 to 90 retries (600s→900s) to give services more headroom in slow CI environments. Add --max-time 5 to curl to prevent hangs on half-open connections, and suppress the noisy progress bar with -s. Add a 'Dump container logs on failure' workflow step that runs on failure to show registry/keycloak/db logs — previously there was no way to diagnose why a service failed to start. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Maven pom has maven.compiler.target=11, producing class file version 55 (Java 11). The old base image frolvlad/alpine-java:jdk8-slim runs Java 8, which only supports class file versions up to 52. This caused an UnsupportedClassVersionError on every registry container start, making port 8081 permanently inaccessible in CI. Switch to eclipse-temurin:11-jre-alpine which is a well-maintained JRE 11 Alpine image and matches the compiler target. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without KAFKA_LISTENERS, Confluent Kafka derives bind addresses from KAFKA_ADVERTISED_LISTENERS. The INTERNAL listener was advertised as kafka:9092 (the container hostname IP), so Kafka only bound to that IP — not 127.0.0.1. The health check command runs inside the container and targets localhost:9092, which was never listening, so the check always failed and the metrics service (which depends on kafka being healthy) could never start. Add KAFKA_LISTENERS with 0.0.0.0 bindings so both listeners are reachable on all interfaces inside the container. Also add a 30s start_period so Kafka has time to fully initialize before health check retries start counting. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
confluentinc/cp-kafka:latest is now version 8.2.0, which removed Zookeeper support entirely. The existing KAFKA_ZOOKEEPER_CONNECT config is not recognized, so Kafka never starts, its health check always fails, and the metrics service (which depends on kafka:service_healthy) can never launch. Pin both cp-kafka and cp-zookeeper to 7.6.1, the last stable 7.x release that supports Zookeeper-based configuration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ests The second E2E suite tests http://localhost:8070/v1/metrics but the Makefile only waited for keycloak (8080) and registry (8081). Metrics starts after both kafka and registry are healthy, so it may not be ready by the time Karate reaches that test. Add wait_for_port.sh 8070 to ensure the metrics service is accepting connections first. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three related fixes for the second (async) test suite: 1. CustomJwtDecoderProviderConfigurationUtils: the lazy OIDC decoder's RestTemplate had no timeout. Under CI load with 16 services, the OIDC discovery call to keycloak:8080 could take >30s, blocking all 5 parallel Karate threads until Karate's own 30s timeout fired. Add connectTimeout=10s and readTimeout=15s via SimpleClientHttpRequestFactory so the call fails fast and retries. 2. docker-compose-v1.yml: metrics depends_on now includes clickhouse:service_healthy. The metrics service calls log.Fatal on any ClickHouse query error, killing the process. Without the dependency, metrics started before ClickHouse was ready, received its first Kafka metric, tried to insert, got a connection error, and crashed — making port 8070 disappear after wait_for_port passed. 3. karate-config.js: raise Karate's readTimeout from the default 30s to 90s and connectTimeout to 10s. With 16+ containers running on a shared CI runner, operations that take 1-5s in the first test suite (6 containers) can take 30-60s in the second (16 containers). The old 30s Karate timeout was too tight for this environment. 4. Makefile: add explicit OIDC endpoint readiness check (curl --retry on .well-known/openid-configuration) before both test suites. This ensures the lazy decoder's very first call finds Keycloak fully initialized, not just TCP-reachable on port 8080. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The clickhouse health check had no interval/timeout/retries/start_period, so it used Docker defaults (retries=3, interval=30s). If ClickHouse takes more than 90s to be ready on CI (3 × 30s), it becomes permanently unhealthy, blocking metrics from starting since we added clickhouse:service_healthy as a metrics dependency. Also fix the multiline YAML for the test command (was split across two lines as a plain scalar) — replace with the unambiguous list form. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…JwtException Removing the 10s/15s SimpleClientHttpRequestFactory timeout from CustomJwtDecoderProviderConfigurationUtils because when Keycloak is slow to respond, IllegalArgumentException propagates out of TenantJwtDecoder.decode() uncaught by Spring Security's BearerTokenAuthenticationFilter (which only catches OAuth2AuthenticationException), causing Tomcat threads to hang and Karate tests to time out after 90s. The Makefile already waits for the OIDC endpoint before running tests, making the timeout unnecessary. Added try/catch in TenantJwtDecoder.decode() to wrap any init exception as BadJwtException so Spring Security always handles it cleanly with a 401 response. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…estart policy The metrics container was crashing during docker-compose up, causing make test to fail at Makefile:43. Root cause: confluent-kafka-go's NewConsumer() calls panic(err) on failure; since StartConsumer ran in a goroutine without recovery, any panic propagated to the runtime and killed the process. Fixes: - Wrap kafka.StartConsumer goroutine with defer/recover so a Kafka consumer failure logs an error instead of crashing the HTTP server - Add healthcheck (/health, 15s interval) to metrics in docker-compose so Docker can detect when it is truly ready, and restart:on-failure so Docker Engine recovers if it does exit - Expand CI failure diagnostics to include metrics, clickhouse, kafka logs and print exit status of all containers for easier post-mortem Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…/kafka-go (pure Go)
Root cause: confluent-kafka-go v1.9.2 bundles a pre-compiled librdkafka_musl_linux.a
that is x86-64 only (ELF EM:62). On ARM64 the Docker build fails with
"file in wrong format". In CI (x86-64) the build succeeds but the CGO binary
has runtime instability that causes the container to become unhealthy.
Fix: replace confluent-kafka-go with segmentio/kafka-go v0.4.47, a pure Go
Kafka client. Consequences:
- CGO_ENABLED=0: no gcc/musl-dev needed in Dockerfile; binary is statically
linked and works on both ARM64 (Mac M-series) and x86-64 (CI/prod)
- kafka-go's FetchMessage retries on broker unavailability instead of panicking,
so a transient Kafka hiccup no longer crashes the server
- Manual offset commit preserved via FetchMessage+CommitMessages pattern
Verified locally (ARM64 Docker): image builds, container starts, /health returns
{"status":"UP"}, /v1/metrics returns {}.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ead-safe With ASYNC_ENABLED=true, concurrent threads (Kafka consumers, @async methods) raced in getInstance() before the HashMap was populated, causing multiple SqlgGraph.open() calls that attempted to CREATE the same PostgreSQL sequence (V_graph_global_config_ID_seq), resulting in a duplicate key constraint error that crashed the registry container. - DBProviderFactory: HashMap -> ConcurrentHashMap, getInstance() -> synchronized, putIfAbsent -> put (redundant under synchronized) - docker-compose-v1.yml: add restart:on-failure and start_period:60s to registry Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… port 8123 BusyBox wget in the clickhouse alpine image exits 0 even on connection refused, making the health check always pass (false positive). Additionally, port 8123 (HTTP interface) is not exposed by this image -- only port 9000 (native protocol). Replace with: clickhouse-client --query "SELECT 1" 2>/dev/null | grep -q "^1" This correctly exits 1 when the server is unavailable and 0 when ready. Also halve the interval (30s->15s) and double retries (10->20) so Docker has more attempts within the same total window before marking the container unhealthy. Verified locally: container reaches (healthy) status within 35 seconds. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR addresses security vulnerabilities and implements CVE fixes across the Sunbird RC application stack. The changes include dependency updates, environment configuration improvements, and Docker container security enhancements.