Skemium is a Java CLI tool that generates and compares Debezium CDC (Change Data Capture) Avro schemas from database tables. It detects whether database schema changes will break Debezium CDC production by comparing current vs. next versions of Avro schemas using Confluent Schema Registry compatibility checks.
- Group/Artifact:
com.github.kafkesc:skemium - Current version: Defined in
pom.xml(<version>tag) - License: Apache 2.0
- Main class:
io.snyk.skemium.SkemiumMain
| Command | Purpose |
|---|---|
generate |
Connects to a PostgreSQL database, extracts table schemas, converts to Avro, saves to directory |
compare |
Compares two directories of generated Avro schemas for compatibility |
compare-files |
Compares two individual .avsc files for compatibility (supports external type resolution via --include-schema) |
- JDK 21+ (AdoptOpenJDK recommended for development; GraalVM for native builds)
- Maven 3.9+
- Docker (required for tests — Testcontainers spins up PostgreSQL)
- Optional: asdf — run
asdf installto get exact versions from.tool-versions - Optional: Taskfile — installed via asdf, provides shortcut tasks
| Tool | Version |
|---|---|
| Maven | 3.9.12 |
| Java | adoptopenjdk-21.0.9+10.0.LTS |
| Snyk CLI | 1.1302.0 |
| Taskfile | 3.46.4 |
GraalVM is commented out in .tool-versions — uncomment java oracle-graalvm-21.0.8 (and comment out the adoptopenjdk line) only for native binary builds.
Agent rule: Always prefer
Taskfile.ymltasks over rawmvncommands. Usetask <name>when a matching task exists. Only fall back to crafting a directmvn(or other) command when no Taskfile task covers the need.
task package # clean + build + test (mvn clean package)
task package.uber-jar # clean + uber-jar, skips tests
task package.native-executable # clean + native binary, skips tests (requires GraalVM)
task clean # mvn clean
task tag-version -- X.Y.Z # Set version in pom.xml + git commit + git tag
task snyk.test # Run Snyk security scansmvn test # Run tests only (no Taskfile task for test-only)
mvn -B package # CI-style build (batch mode)src/main/java/io/snyk/skemium/
├── SkemiumMain.java # Entry point, Picocli root command
├── BaseCommand.java # Abstract base for all commands (logging verbosity)
├── BaseComparisonCommand.java # Abstract base for compare commands (compatibility, output, CI mode)
├── GenerateCommand.java # `generate` subcommand
├── CompareCommand.java # `compare` subcommand
├── CompareFilesCommand.java # `compare-files` subcommand
├── CompareResult.java # Result record for `compare`
├── CompareFilesResult.java # Result record for `compare-files`
├── avro/
│ └── TableAvroSchemas.java # Core data type: key/value/envelope Avro schemas for a table
├── cli/
│ └── ManifestReader.java # Reads JAR MANIFEST.MF for version info (singleton pattern)
├── db/
│ ├── DatabaseKind.java # Enum of supported DBs (currently only POSTGRES)
│ ├── TableSchemaFetcher.java # Interface for fetching table schemas (extends AutoCloseable)
│ ├── CatalogSchemaAndTableTopicNamingStrategy.java # Topic naming: <catalog>.<schema>.<table>
│ └── postgres/
│ ├── PostgresTableSchemaFetcher.java # PostgreSQL implementation
│ └── PostgresSchemaRefreshable.java # Package-local wrapper exposing hidden Debezium refresh()
├── helpers/
│ ├── Avro.java # Kafka Connect → Avro schema conversion + schema file generation
│ ├── Git.java # JGit helper for local repo info (commit, branch, tag)
│ ├── JSON.java # Jackson JSON serialization helpers (pretty, compact, from)
│ └── SchemaRegistry.java # Compatibility checking + schema equality (JSON normalization)
└── meta/
└── MetadataFile.java # `.skemium.meta.json` metadata record
src/test/java/io/snyk/skemium/
├── WithPostgresContainer.java # Base class for tests needing PostgreSQL (Testcontainers)
├── TestHelper.java # Test utilities (defines RESOURCES path constant)
├── GenerateCommandTest.java # Integration tests for `generate`
├── CompareCommandTest.java # Integration tests for `compare` + CI mode
├── CompareFilesCommandTest.java # Tests for `compare-files` + CI mode + --include-schema
├── CompareFilesResultTest.java # Unit tests for compare-files result logic + multi-schema
├── avro/TableAvroSchemasTest.java # Tests for Avro schema handling
├── helpers/SchemaRegistryTest.java # Tests for compatibility checking
├── meta/MetadataFileTest.java # Tests for metadata serialization
└── db/postgres/ # PostgreSQL-specific tests
src/test/resources/
├── db_schema/chinook.initdb.sql # Test fixture: Chinook sample database
├── schema_employee/ # Test fixture: employee table schemas
├── schema_employee_invalid_checksum/
├── schema_change-no_changes/ # Test fixture: identical current/next
├── schema_change-backward_compatible/
├── schema_change-non_backward_compatible/
├── schema_change-compatible_with_table_addition/
├── schema_change-key_added/
├── schema_change-key_removed/
└── compare-files/ # Test fixture: individual .avsc files
├── valid-schemas/ # person-v1, v2-compatible, v2-incompatible, v1-reordered
├── invalid-schemas/ # malformed.json, empty.avsc, invalid-avro.avsc
└── multi-schema/ # External type resolution fixtures (issue-type + event-with-issue-ref)
schemas/ # Avro schemas for Skemium's own output formats
├── skemium.generate.meta.avsc # Schema for `.skemium.meta.json`
├── skemium.compare.result.avsc # Schema for `compare` JSON output
└── skemium.compare-files.result.avsc # Schema for `compare-files` JSON output
- Java 21 features used:
recordtypes, text blocks ("""),switchexpressions, Java doc comments with///syntax - Package structure:
io.snyk.skemiumwith sub-packagesavro,cli,db,db.postgres,helpers,meta - Logging: SLF4J with Logback; every class gets
private static final Logger LOG = LoggerFactory.getLogger(ClassName.class); - Null annotations:
@Nonnulland@Nullablefromjavax.annotation - CLI framework: Picocli — commands are
Callable<Integer>, options use@Option, parameters use@Parameters - JSON serialization: Jackson with
@JsonPropertyannotations;JSON.javahelper providespretty()andcompact()methods - Data classes: Java
recordtypes are preferred (e.g.,CompareResult,CompareFilesResult,TableAvroSchemas,MetadataFile) - Static factory methods: Records use
static build(...)methods rather than exposing constructors directly - Enum naming:
SCREAMING_SNAKE_CASE(e.g.,DatabaseKind.POSTGRES) - Constants:
private static finalwith descriptive names - Singletons:
ManifestReaderuses apublic static final SINGLETONfield pattern
All commands follow this structure:
- Extend
BaseCommand(orBaseComparisonCommandfor comparison commands) - Implement
Callable<Integer>(return 0 for success, 1 for failure) - In
call(): first callsetLogLevelFromVerbosity(), thenvalidate(), thenlogInput() - Annotate with
@Command(name = "...", ...)with consistent heading format strings - CLI options support environment variable fallback via
defaultValue = "${env:VAR_NAME}"
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Application error (incompatibility found, parsing failure, connection error) |
| 2 | Picocli parameter validation error (missing/invalid parameters) |
Controlled by the -v flag (repeatable). Default level is ERROR:
<none> → ERROR
-v → WARN
-vv → INFO
-vvv → DEBUG
-vvvv+ → TRACE
Generated schema files follow the pattern: DB_NAME.DB_SCHEMA.DB_TABLE.EXTENSION
- Key schema:
.key.avsc - Value schema:
.val.avsc - Envelope schema:
.env.avsc - Checksum:
.sha256 - Metadata:
.skemium.meta.json
- JUnit 5 (Jupiter) with
junit-jupiter-apiandjunit-jupiter-params - Testcontainers for PostgreSQL integration tests — requires Docker running
- Tests that need a database extend
WithPostgresContainer, which manages a PostgreSQL 17.2 container with the Chinook sample database
- Database:
chinook(Chinook sample DB — music store schema) - User:
chinook-db-user/ Password:chinook-db-pass - Init script:
src/test/resources/db_schema/chinook.initdb.sql - PostgreSQL configured with
wal_level=logicalfor Debezium CDC
Schema change test scenarios are in src/test/resources/schema_change-*/ with current/ and next/ subdirectories. Each contains pre-generated Avro schemas and .skemium.meta.json metadata files.
Two test methods regenerate Avro schema files from Java record annotations:
CompareCommandTest.refreshCompareResultFileSchema()→ writesschemas/skemium.compare.result.avscCompareFilesCommandTest.refreshSchemaComparisonResultFileSchema()→ writesschemas/skemium.compare-files.result.avsc
These call Avro.saveAvroSchemaForType() which generates schemas from @JsonProperty-annotated record classes. CI checks for uncommitted changes after tests via git diff --exit-code, so if schema files change, they must be committed.
TestHelper.RESOURCES—Path.of("src", "test", "resources")constant for locating test fixturesWithPostgresContainer.createPostgresContainerConfiguration()— creates a DebeziumConfigurationfor the test container
mvn test # All tests
mvn test -pl . # Single module (this is a single-module project)There is no separate unit-vs-integration test split — all tests run together. Some tests are slow because they spin up Docker containers.
All dependency versions are centralized in pom.xml <properties> using the ver. prefix convention:
<ver.avro>1.12.1</ver.avro>
<ver.jackson>2.21.3</ver.jackson>
<ver.debezium>3.4.3.Final</ver.debezium>Dependabot is configured (.github/dependabot.yml) for automated dependency updates. Configuration:
- Weekly schedule for both Maven and GitHub Actions
- Ignores major version bumps
- Groups all dependencies into a single PR per ecosystem
- Commit prefix:
chore(conventional commits)
| Library | Purpose |
|---|---|
debezium-core + debezium-connector-postgres |
Database schema extraction |
kafka-connect-avro-converter (Confluent) |
Avro schema conversion and compatibility checking |
picocli + picocli-codegen |
CLI framework + annotation processing |
jackson-* (core, databind, annotations, jsr310, avro) |
JSON/Avro serialization |
jgit |
Git repository info for metadata |
commons-codec |
SHA256 checksums |
commons-compress |
Transitive/compression support |
commons-lang3 |
String/object utilities |
guava (transitive via Confluent) |
Sets.difference() for table comparison |
testcontainers |
PostgreSQL test containers |
The project uses the Confluent Maven repository (https://packages.confluent.io/maven/) in addition to Maven Central. This is required for the kafka-connect-avro-converter dependency.
Runs on PRs to main and pushes to main:
- Gitleaks: Secret scanning (skips dependabot PRs)
- Snyk: Dependency vulnerability scanning with SARIF upload to GitHub Code Scanning (skips dependabot PRs)
- Build & Test:
mvn -B package, thengit diff --exit-codeto ensure no uncommitted changes, thendorny/test-reporterfor JUnit test results
Runs Snyk/ProdsecOrb security scans and secrets scanning. Uses cimg/openjdk:21.0.9 Docker image.
Triggered by pushing a vX.Y.Z tag to main. Two separate workflows:
release-uberjar.yaml: Builds and uploads skemium-VERSION-jar-with-dependencies.jar to GitHub Releases.
release-binaries.yaml: Builds native binaries on 4 platform combinations using GraalVM:
| Runner | OS | Arch |
|---|---|---|
ubuntu-24.04 |
linux | x86_64 |
ubuntu-24.04-arm |
linux | aarch64 |
macos-15-intel |
macos | x86_64 |
macos-latest |
macos | aarch64 |
Native binary naming: skemium-VERSION-OS-ARCH (e.g., skemium-1.2.2-linux-x86_64). Uses -O3 -march=compatibility for broad CPU compatibility.
task tag-version -- X.Y.Z # Updates pom.xml, commits, creates git tag
git push origin main --follow-tags --tags # Push tag triggers release workflows- Default branch:
main - Branch naming: Use conventional commit type prefixes:
feat/,fix/,chore/,docs/,refactor/,build/,ci/,test/,perf/,style/,revert/ - Commit messages: Follow Conventional Commits
- Pre-commit hooks: Gitleaks for secret detection (
.pre-commit-config.yaml) - Code owners:
@snyk/data-backendand@snyk/productinfra_data-backend - Rebasing: Changes to
mainshould be tracked by rebasing (perCONTRIBUTING.md)
-
Docker required for tests: Most tests use Testcontainers and will fail without Docker running.
-
Schema files are auto-generated during tests: Some test methods regenerate the
schemas/*.avscfiles. CI checks that these are committed — if you change arecordtype that has a corresponding schema file, you must run tests and commit the updated.avscfiles. -
Jackson annotations version quirk:
jackson-annotationsuses a separate version property (ver.jackson-annotations) because its version string differs from other Jackson modules (e.g.,2.21vs2.21.1). See the comment inpom.xml. -
TOPIC_PREFIXis required but unused: When creating Debezium configurations,TOPIC_PREFIXmust be set even though Skemium doesn't actually produce to Kafka. SeeGenerateCommand.java:217. -
Only PostgreSQL is supported: The
DatabaseKindenum andTableSchemaFetcherinterface are designed for multiple database types, but onlyPOSTGRESis implemented. -
GraalVM needed for native builds only: Regular development uses AdoptOpenJDK 21. Switch to GraalVM (uncomment in
.tool-versions) only when building native binaries. -
Avro field ordering matters for equality:
SchemaRegistry.checkSchemaEquality()normalizes JSON (sorts object keys and record fields by name) before comparing, because Avro'sSchema.equals()is order-sensitive. -
Key schema can be null: Tables without a
PRIMARY KEYhave anullkey schema. All code paths must handle this (seeTableAvroSchemas,SchemaRegistry.checkCompatibility()). -
Checksum validation on load:
TableAvroSchemas.loadFrom()validates SHA256 checksums when loading schemas from disk. Invalid checksums throwIOException. Missing checksum files log a warning but continue. -
Environment variable fallback: All CLI options support environment variable configuration via Picocli's
${env:VAR_NAME}defaultValue syntax (e.g.,DB_HOSTNAME,DB_PORT,COMPATIBILITY,CI_MODE). -
PostgresSchemaRefreshableis intentionally package-local: It extendsPostgresSchemato expose therefresh()method which is hidden inside the Debezium Postgres Connector library. This is necessary to load table schemas. Don't make it public. Its constructor must mirror the upstreamPostgresSchemaconstructor, which changed in Debezium 3.4 to take aCdcSourceTaskContext<PostgresConnectorConfig>(instead of a barePostgresConnectorConfig) plus aCustomConverterRegistry.PostgresTableSchemaFetcherconstructs both inline (new CdcSourceTaskContext<>(rawConfig, connectorConfig, connectorConfig.getCustomMetricTags())andnew CustomConverterRegistry(null)—nullyields an empty converter list). Future Debezium upgrades may change this signature again; checkio.debezium.connector.postgresql.PostgresSchemaandPostgresConnectorTask#start()in the Debezium sources for the canonical construction pattern. -
Topic naming strategy:
CatalogSchemaAndTableTopicNamingStrategyformats topics as<catalog>.<schema>.<table>(e.g.,chinook.public.artist). This naming is also used as the table identifier (TableAvroSchemas.identifier()). -
Tests mutate the database:
CompareCommandTestrunsALTER TABLEstatements against the Testcontainers PostgreSQL instance to test schema change detection. These changes persist within the container's lifecycle, so test ordering may matter if tests share the same container. -
Case-insensitive enum parsing:
SkemiumMainconfigures Picocli with.setCaseInsensitiveEnumValuesAllowed(true), so CLI enum options like--compatibilityaccept any case. -
Avro.kafkaConnectSchemaToAvroSchema()unwraps unions: Kafka Connect maps records toUNION[NULL, RECORD]by default. The helper extracts the non-null subtype. If the input isnull, it returnsnull(for tables without primary keys).