Skip to content

Security: Fix CVEs and Vulnerability Mitigation Across Dependencies#362

Draft
sntiwari1 wants to merge 28 commits into
mainfrom
cve_fix
Draft

Security: Fix CVEs and Vulnerability Mitigation Across Dependencies#362
sntiwari1 wants to merge 28 commits into
mainfrom
cve_fix

Conversation

@sntiwari1
Copy link
Copy Markdown
Collaborator

This PR addresses security vulnerabilities and implements CVE fixes across the Sunbird RC application stack. The changes include dependency updates, environment configuration improvements, and Docker container security enhancements.

sntiwari1 and others added 28 commits May 6, 2026 13:26
Spring Boot 2.2→2.7 upgrade removed javax.validation from the default
web starter; id-gen-service needs spring-boot-starter-validation added
explicitly. Spring Boot 2.4+ also dropped JUnit 4 vintage from the test
starter; migrated both services' test files from @RunWith/@ignore to the
JUnit 5 equivalents (@ExtendWith/@disabled).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
yarn install fails on node:alpine images because some dependencies
require git for resolution. Adding git via apk before yarn install.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Some deps (digitalcredentials/jsonld.js) use ssh://git@github.com/ URLs.
Alpine CI runners have no SSH keys, so rewrite to HTTPS at build time.
Also add openssh-client so git can handle any remaining SSH probes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
postgres:latest now resolves to 17.x which has stricter initialization
timing. Pin to postgres:14 (stable LTS) and add start_period so health
check failures during container startup don't count as retries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. wait_for_port.sh used [[ ]] (bash) but Makefile calls it via 'sh'
   (dash on Ubuntu). The counter never triggered exit, looping forever
   for 6h until GH Actions timeout. Fixed: use POSIX [ ] syntax.

2. Keycloak healthcheck tested management port 9990 (up in ~10s) but
   the HTTP server on 8080 takes ~29s. Registry started before 8080
   was ready, failed to fetch OIDC config, crashed. Fixed: test 8080.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Port 8080 (JBoss HTTP) takes 3-5+ min on CI; health check timed out
at 5.5 min before Keycloak was ready, blocking registry from starting.

Revert to port 9990 (management port, up in <10s) which was working
before. The Makefile already calls wait_for_port.sh 8080 after
docker-compose up completes, handling actual HTTP readiness.

Also remove start_period from db: postgres:14 initializes fast and
start_period was causing docker-compose v1 to start keycloak before
postgres finished its first health check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
wait_for_port.sh: ((i=i+1)) is bash-only and does not update the
variable in the parent shell when run via sh/dash, causing an infinite
loop. Replace with POSIX-compatible i=$((i+1)).

TenantJwtDecoder / CustomJwtDecoders: SecurityConfig.addManager() was
calling CustomJwtDecoders.fromOidcIssuerLocation() at Spring Boot
startup, which eagerly fetched the OIDC discovery document from
http://keycloak:8080/auth/realms/sunbird-rc/.well-known/openid-configuration.
Keycloak's WildFly management port (9990) becomes healthy ~16s after
container start, but the HTTP port (8080) is only ready ~10s later.
Registry starts immediately after keycloak:service_healthy (9990),
so the OIDC fetch raced and failed, preventing Spring Boot from
starting. Fix by deferring the HTTP call to the first actual JWT
decode using double-checked locking in TenantJwtDecoder. The issuer
key is set to oidcIssuerLocation directly (OIDC spec requires the
discovery doc issuer to match the location URL).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Increase wait_for_port.sh timeout from 60 to 90 retries (600s→900s)
to give services more headroom in slow CI environments. Add --max-time 5
to curl to prevent hangs on half-open connections, and suppress the
noisy progress bar with -s.

Add a 'Dump container logs on failure' workflow step that runs on
failure to show registry/keycloak/db logs — previously there was no
way to diagnose why a service failed to start.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Maven pom has maven.compiler.target=11, producing class file version
55 (Java 11). The old base image frolvlad/alpine-java:jdk8-slim runs
Java 8, which only supports class file versions up to 52. This caused
an UnsupportedClassVersionError on every registry container start,
making port 8081 permanently inaccessible in CI.

Switch to eclipse-temurin:11-jre-alpine which is a well-maintained
JRE 11 Alpine image and matches the compiler target.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without KAFKA_LISTENERS, Confluent Kafka derives bind addresses from
KAFKA_ADVERTISED_LISTENERS. The INTERNAL listener was advertised as
kafka:9092 (the container hostname IP), so Kafka only bound to that
IP — not 127.0.0.1. The health check command runs inside the container
and targets localhost:9092, which was never listening, so the check
always failed and the metrics service (which depends on kafka being
healthy) could never start.

Add KAFKA_LISTENERS with 0.0.0.0 bindings so both listeners are
reachable on all interfaces inside the container. Also add a 30s
start_period so Kafka has time to fully initialize before health
check retries start counting.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
confluentinc/cp-kafka:latest is now version 8.2.0, which removed
Zookeeper support entirely. The existing KAFKA_ZOOKEEPER_CONNECT
config is not recognized, so Kafka never starts, its health check
always fails, and the metrics service (which depends on
kafka:service_healthy) can never launch.

Pin both cp-kafka and cp-zookeeper to 7.6.1, the last stable 7.x
release that supports Zookeeper-based configuration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ests

The second E2E suite tests http://localhost:8070/v1/metrics but the
Makefile only waited for keycloak (8080) and registry (8081). Metrics
starts after both kafka and registry are healthy, so it may not be
ready by the time Karate reaches that test. Add wait_for_port.sh 8070
to ensure the metrics service is accepting connections first.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three related fixes for the second (async) test suite:

1. CustomJwtDecoderProviderConfigurationUtils: the lazy OIDC decoder's
   RestTemplate had no timeout. Under CI load with 16 services, the
   OIDC discovery call to keycloak:8080 could take >30s, blocking all
   5 parallel Karate threads until Karate's own 30s timeout fired.
   Add connectTimeout=10s and readTimeout=15s via
   SimpleClientHttpRequestFactory so the call fails fast and retries.

2. docker-compose-v1.yml: metrics depends_on now includes
   clickhouse:service_healthy. The metrics service calls log.Fatal
   on any ClickHouse query error, killing the process. Without the
   dependency, metrics started before ClickHouse was ready, received
   its first Kafka metric, tried to insert, got a connection error,
   and crashed — making port 8070 disappear after wait_for_port passed.

3. karate-config.js: raise Karate's readTimeout from the default 30s
   to 90s and connectTimeout to 10s. With 16+ containers running on a
   shared CI runner, operations that take 1-5s in the first test suite
   (6 containers) can take 30-60s in the second (16 containers). The
   old 30s Karate timeout was too tight for this environment.

4. Makefile: add explicit OIDC endpoint readiness check
   (curl --retry on .well-known/openid-configuration) before both test
   suites. This ensures the lazy decoder's very first call finds
   Keycloak fully initialized, not just TCP-reachable on port 8080.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The clickhouse health check had no interval/timeout/retries/start_period,
so it used Docker defaults (retries=3, interval=30s). If ClickHouse
takes more than 90s to be ready on CI (3 × 30s), it becomes permanently
unhealthy, blocking metrics from starting since we added
clickhouse:service_healthy as a metrics dependency.

Also fix the multiline YAML for the test command (was split across
two lines as a plain scalar) — replace with the unambiguous list form.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…JwtException

Removing the 10s/15s SimpleClientHttpRequestFactory timeout from CustomJwtDecoderProviderConfigurationUtils
because when Keycloak is slow to respond, IllegalArgumentException propagates out of TenantJwtDecoder.decode()
uncaught by Spring Security's BearerTokenAuthenticationFilter (which only catches OAuth2AuthenticationException),
causing Tomcat threads to hang and Karate tests to time out after 90s.

The Makefile already waits for the OIDC endpoint before running tests, making the timeout unnecessary.
Added try/catch in TenantJwtDecoder.decode() to wrap any init exception as BadJwtException so Spring
Security always handles it cleanly with a 401 response.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…estart policy

The metrics container was crashing during docker-compose up, causing make test
to fail at Makefile:43. Root cause: confluent-kafka-go's NewConsumer() calls
panic(err) on failure; since StartConsumer ran in a goroutine without recovery,
any panic propagated to the runtime and killed the process.

Fixes:
- Wrap kafka.StartConsumer goroutine with defer/recover so a Kafka consumer
  failure logs an error instead of crashing the HTTP server
- Add healthcheck (/health, 15s interval) to metrics in docker-compose so
  Docker can detect when it is truly ready, and restart:on-failure so Docker
  Engine recovers if it does exit
- Expand CI failure diagnostics to include metrics, clickhouse, kafka logs
  and print exit status of all containers for easier post-mortem

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…/kafka-go (pure Go)

Root cause: confluent-kafka-go v1.9.2 bundles a pre-compiled librdkafka_musl_linux.a
that is x86-64 only (ELF EM:62). On ARM64 the Docker build fails with
"file in wrong format". In CI (x86-64) the build succeeds but the CGO binary
has runtime instability that causes the container to become unhealthy.

Fix: replace confluent-kafka-go with segmentio/kafka-go v0.4.47, a pure Go
Kafka client. Consequences:
- CGO_ENABLED=0: no gcc/musl-dev needed in Dockerfile; binary is statically
  linked and works on both ARM64 (Mac M-series) and x86-64 (CI/prod)
- kafka-go's FetchMessage retries on broker unavailability instead of panicking,
  so a transient Kafka hiccup no longer crashes the server
- Manual offset commit preserved via FetchMessage+CommitMessages pattern

Verified locally (ARM64 Docker): image builds, container starts, /health returns
{"status":"UP"}, /v1/metrics returns {}.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ead-safe

With ASYNC_ENABLED=true, concurrent threads (Kafka consumers, @async methods)
raced in getInstance() before the HashMap was populated, causing multiple
SqlgGraph.open() calls that attempted to CREATE the same PostgreSQL sequence
(V_graph_global_config_ID_seq), resulting in a duplicate key constraint error
that crashed the registry container.

- DBProviderFactory: HashMap -> ConcurrentHashMap, getInstance() -> synchronized,
  putIfAbsent -> put (redundant under synchronized)
- docker-compose-v1.yml: add restart:on-failure and start_period:60s to registry

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… port 8123

BusyBox wget in the clickhouse alpine image exits 0 even on connection refused,
making the health check always pass (false positive). Additionally, port 8123
(HTTP interface) is not exposed by this image -- only port 9000 (native protocol).

Replace with: clickhouse-client --query "SELECT 1" 2>/dev/null | grep -q "^1"
This correctly exits 1 when the server is unavailable and 0 when ready.
Also halve the interval (30s->15s) and double retries (10->20) so Docker has
more attempts within the same total window before marking the container unhealthy.

Verified locally: container reaches (healthy) status within 35 seconds.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant