This is the canonical reference for working with the DataHub codebase. It applies to all coding agents (Claude Code, Cursor, Codex CLI, Devin, etc.) and human developers alike.
Build and test:
./gradlew build # Build entire project
./gradlew check # Run all tests and linting
./gradlew format # Format all code (Java, Markdown, GraphQL, YAML)
# Note that each directory typically has a build.gradle file, but the available tasks follow similar conventions.
# Java code.
./gradlew spotlessApply # Java code formatting
# Python code.
./gradlew :metadata-ingestion:testQuick # Fast Python unit tests
./gradlew :metadata-ingestion:lint # Python linting (ruff, mypy)
./gradlew :metadata-ingestion:lintFix # Python linting auto-fix (ruff only)
# Markdown, GraphQL, YAML formatting
./gradlew :datahub-web-react:mdPrettierWrite # Format markdown files
./gradlew :datahub-web-react:graphqlPrettierWrite # Format GraphQL schemas
./gradlew :datahub-web-react:githubActionsPrettierWrite # Format GitHub ActionsIf you are using git worktrees then exclude this as that might cause git related failures when running any gradle command.
./gradlew ... -x generateGitPropertiesGlobal
IMPORTANT: Verifying Python code changes:
- ALWAYS use
./gradlew :metadata-ingestion:lintFixto verify Python code changes - NEVER use
python3 -m py_compile- it doesn't catch style issues or type errors - NEVER use
rufformypycommands directly - use the Gradle task instead - lintFix runs ruff formatting and fixing automatically, ensuring code quality
- For smoke-test changes, the lintFix command will also check those files
CRITICAL: Always use Gradle tasks for formatting and linting. Never use npm/yarn/npx commands directly.
Format everything:
./gradlew format # Format all code (Java, Markdown, GraphQL, YAML)
./gradlew formatChanged # Format only changed files (faster)Format specific file types:
# Markdown files
./gradlew :datahub-web-react:mdPrettierWrite # Format all markdown
./gradlew :datahub-web-react:mdPrettierCheck # Check markdown formatting
# GraphQL schemas
./gradlew :datahub-web-react:graphqlPrettierWrite # Format GraphQL files
./gradlew :datahub-web-react:graphqlPrettierCheck # Check GraphQL formatting
# GitHub Actions YAML
./gradlew :datahub-web-react:githubActionsPrettierWrite # Format workflow files
./gradlew :datahub-web-react:githubActionsPrettierCheck # Check workflow files
# Java code
./gradlew spotlessApply # Format Java code
# Python code
./gradlew :metadata-ingestion:lintFix # Format and fix Python code
./gradlew :metadata-ingestion:lint # Check Python formattingIf you see CI failures like:
markdown_format / markdown_format_check (pull_request)- Use./gradlew :datahub-web-react:mdPrettierWritegraphql_prettier_check- Use./gradlew :datahub-web-react:graphqlPrettierWritespotlessJavaCheck- Use./gradlew spotlessApply- Python linting failures - Use
./gradlew :metadata-ingestion:lintFix
Never do this:
npx prettier --write "docs/**/*.md" # WRONG - bypasses Gradle
yarn prettier --write # WRONG - bypasses Gradle
npm run format # WRONG - bypasses GradleAlways do this:
./gradlew :datahub-web-react:mdPrettierWrite # CORRECT - uses Gradle
./gradlew format # CORRECT - formats everything- Consistent configuration: Gradle tasks use the project's Prettier config
- Pre-commit hook integration: Gradle tasks match what CI runs
- Dependency management: Ensures correct tool versions
- Cross-platform: Works reliably across all environments
Java SDK v2 integration tests:
See metadata-integration/java/datahub-client/CLAUDE.md for detailed integration test documentation.
DataHub is a schema-first, event-driven metadata platform with three core layers:
- GMS (Generalized Metadata Service): Java/Spring backend handling metadata storage and REST/GraphQL APIs
- Frontend: React/TypeScript application consuming GraphQL APIs
- Ingestion Framework: Python CLI and connectors for extracting metadata from data sources
- Event Streaming: Kafka-based real-time metadata change propagation
metadata-models/: Avro/PDL schemas defining the metadata modelmetadata-service/: Backend services, APIs, and business logicdatahub-web-react/: Frontend React applicationmetadata-ingestion/: Python ingestion framework and CLIdatahub-graphql-core/: GraphQL schema and resolvers
Most of the non-frontend modules are written in Java. The modules written in Python are:
metadata-ingestion/datahub-actions/metadata-ingestion-modules/airflow-plugin/metadata-ingestion-modules/gx-plugin/metadata-ingestion-modules/dagster-plugin/metadata-ingestion-modules/prefect-plugin/
Each Python module has a gradle setup similar to metadata-ingestion/ (documented above)
- Entities: Core objects (Dataset, Dashboard, Chart, CorpUser, etc.)
- Aspects: Metadata facets (Ownership, Schema, Documentation, etc.)
- URNs: Unique identifiers (
urn:li:dataset:(urn:li:dataPlatform:mysql,db.table,PROD)) - MCE/MCL: Metadata Change Events/Logs for updates
- Entity Registry: YAML config defining entity-aspect relationships (
metadata-models/src/main/resources/entity-registry.yml)
IMPORTANT: Validation must work across all APIs (GraphQL, OpenAPI, RestLI).
- Never add validation in API-specific layers (GraphQL resolvers, REST controllers) - this only protects one API
- Always implement AspectPayloadValidators in
metadata-io/src/main/java/com/linkedin/metadata/aspect/validation/ - Register as Spring beans in
SpringStandardPluginConfiguration.java - Follow existing patterns: See
SystemPolicyValidator.javaandPolicyFieldTypeValidator.javaas examples
- Schema changes in
metadata-models/trigger code generation across all languages - Backend changes in
metadata-service/and other Java modules expose new REST/GraphQL APIs - Frontend changes in
datahub-web-react/consume GraphQL APIs - Ingestion changes in
metadata-ingestion/emit metadata to backend APIs
- This is production code - maintain high quality
- Follow existing patterns within each module
- Generate appropriate unit tests
- Use type annotations everywhere (Python/TypeScript)
- Java: Use Spotless formatting, Spring Boot patterns, TestNG/JUnit Jupiter for tests
- Python: Use ruff for linting/formatting, pytest for testing, pydantic for configs
- Type Safety: Everything must have type annotations, avoid
Anytype, use specific types (Dict[str, int],TypedDict) - Data Structures: Prefer dataclasses/pydantic for internal data, return dataclasses over tuples
- Code Quality: Avoid global state, use named arguments, don't re-export in
__init__.py, refactor repetitive code - Error Handling: Robust error handling with layers of protection for known failure points
- Type Safety: Everything must have type annotations, avoid
- TypeScript: Use Prettier formatting, strict types (no
any), React Testing Library
Always use semantic color tokens from datahub-web-react/src/conf/theme/colorThemes/types.ts. Never use hardcoded hex values, REDESIGN_COLORS, ANTD_GRAY, or direct alchemy colors.gray[X] imports.
In styled-components (no import needed — theme is available via props):
background: ${(props) => props.theme.colors.bg};
color: ${(props) => props.theme.colors.text};
border: 1px solid ${(props) => props.theme.colors.border};In React component bodies:
import { useTheme } from 'styled-components';
const theme = useTheme();
<Icon color={theme.colors.icon} />For alchemy components (<Text>, <Icon>, etc.) — do not pass color/colorLevel props. Let them inherit from themed parent styled-components.
Do not import from:
datahub-web-react/src/alchemy-components/theme/foundations/colors.ts(raw palette, only used internally by the theme)REDESIGN_COLORSorANTD_GRAYfromentityV2/shared/constants.ts
Only add comments that provide real value beyond what the code already expresses.
Do NOT add comments for:
- Obvious operations (
# Get user by ID,// Create connection) - What the code does when it's self-evident (
# Loop through items,// Set variable to true) - Restating parameter names or return types already in signatures
- Basic language constructs (
# Import modules,// End of function)
DO add comments for:
- Why something is done, especially non-obvious business logic or workarounds
- Context about external constraints, API quirks, or domain knowledge
- Warnings about gotchas, performance implications, or side effects
- References to tickets, RFCs, or external documentation that explain decisions
- Complex algorithms or mathematical formulas that aren't immediately clear
- Temporary solutions with TODOs and context for future improvements
Examples:
# Good: Explains WHY and provides context
# Use a 30-second timeout because Snowflake's query API can hang indefinitely
# on large result sets. See issue #12345.
connection_timeout = 30
# Bad: Restates what's obvious from code
# Set connection timeout to 30 seconds
connection_timeout = 30- Python: Tests go in the
tests/directory alongsidesrc/, useassertstatements - Java: Tests alongside source in
src/test/ - Frontend: Tests in
__tests__/or.test.tsxfiles - Smoke tests go in the
smoke-test/directory
IMPORTANT: Quality over quantity. Avoid AI-generated test anti-patterns that create maintenance burden without providing real value.
Focus on behavior, not implementation:
- Test what the code does (business logic, edge cases that occur in production)
- Don't test how it does it (implementation details, private fields via reflection)
- Don't test third-party libraries work correctly (Spring, Micrometer, Kafka clients, etc.)
- Don't test Java/Python language features (
synchronizedmethods are thread-safe,@Nonnullparameters reject nulls)
Avoid these specific anti-patterns:
- Testing null inputs on
@Nonnull/@NonNullannotated parameters - Verifying exact error message wording (creates brittleness during refactoring)
- Testing every possible input variation (case sensitivity x whitespace x special chars = maintenance nightmare)
- Using reflection to verify private implementation details
- Redundant concurrency testing on
synchronizedmethods - Testing obvious getter/setter behavior without business logic
- Testing Lombok-generated code (
@Data,@Builder,@Valueclasses) - you're testing Lombok's code generator, not your logic - Testing that annotations exist on classes - if required annotations are missing, the framework/compiler will fail at startup, not in your tests
Appropriate test scope:
- Simple utilities (enums, string parsing, formatters): ~50-100 lines of focused tests
- Happy path for each method
- One example of invalid input per method
- Edge cases likely to occur in production
- Complex business logic: Test proportional to risk and complexity
- Integration points and system boundaries
- Security-critical operations
- Error handling for realistic failure scenarios
- Warning sign: If tests are 5x+ the size of implementation, reconsider scope
Examples of low-value tests to avoid:
// BAD: Testing @Nonnull contract (framework's job)
@Test
public void testNullParameterThrowsException() {
assertThrows(NullPointerException.class,
() -> service.process(null)); // parameter is @Nonnull
}
// BAD: Testing Lombok-generated code
@Test
public void testBuilderSetsAllFields() {
MyConfig config = MyConfig.builder()
.field1("value1")
.field2("value2")
.build();
assertEquals(config.getField1(), "value1");
assertEquals(config.getField2(), "value2");
}
// BAD: Testing that annotations exist
@Test
public void testConfigurationAnnotations() {
assertNotNull(MyConfig.class.getAnnotation(Configuration.class));
assertNotNull(MyConfig.class.getAnnotation(ComponentScan.class));
}
// If @Configuration is missing, Spring won't load the context - you don't need a test for this
// BAD: Exact error message (brittle)
assertEquals(exception.getMessage(),
"Unsupported database type 'oracle'. Only PostgreSQL and MySQL variants are supported.");
// BAD: Redundant variations
assertEquals(DatabaseType.fromString("postgresql"), DatabaseType.POSTGRES);
assertEquals(DatabaseType.fromString("PostgreSQL"), DatabaseType.POSTGRES);
assertEquals(DatabaseType.fromString("POSTGRESQL"), DatabaseType.POSTGRES);
assertEquals(DatabaseType.fromString(" postgresql "), DatabaseType.POSTGRES);
// ... 10 more case/whitespace variations
// GOOD: Focused behavioral test
@Test
public void testFromString_ValidInputsCaseInsensitive() {
assertEquals(DatabaseType.fromString("postgresql"), DatabaseType.POSTGRES);
assertEquals(DatabaseType.fromString("POSTGRESQL"), DatabaseType.POSTGRES);
assertEquals(DatabaseType.fromString(" postgresql "), DatabaseType.POSTGRES);
}
@Test
public void testFromString_InvalidInputThrows() {
assertThrows(IllegalArgumentException.class,
() -> DatabaseType.fromString("oracle"));
}
// GOOD: Testing YOUR custom validation logic on a Lombok class
@Test
public void testCustomValidation() {
assertThrows(IllegalArgumentException.class,
() -> MyConfig.builder().field1("invalid").build().validate());
}When in doubt: Ask "Does this test protect against a realistic regression?" If not, skip it.
Critical test: metadata-io/src/test/java/com/linkedin/metadata/system_info/collectors/PropertiesCollectorConfigurationTest.java
This test prevents sensitive data leaks by requiring explicit classification of all configuration properties as either sensitive (redacted) or non-sensitive (visible in system info).
When adding new configuration properties: The test will fail with clear instructions on which classification list to add your property to. Refer to the test file's comprehensive documentation for template syntax and examples.
This is a mandatory security guardrail - never disable or skip this test.
- Follow Conventional Commits format for commit messages
- Breaking Changes: Always update
docs/how/updating-datahub.mdfor breaking changes. Write entries for non-technical audiences, reference the PR number, and focus on what users need to change rather than internal implementation details
When creating PRs, follow the template in .github/pull_request_template.md:
PR Title Format (from Contributing Guide):
<type>[optional scope]: <description>
Types: feat, fix, refactor, docs, test, perf, style, build, ci
Example: feat(parser): add ability to parse arrays
Checklist (verify before submitting):
- PR conforms to the Contributing Guideline (especially PR Title Format)
- Links to related issues (if applicable)
- Tests added/updated (if applicable)
- Docs added/updated (if applicable)
- Breaking changes documented in
docs/how/updating-datahub.md
Use scripts/dev/datahub-dev.sh for ALL environment operations.
Do NOT use ./gradlew quickstartDebug directly — always use the wrapper script.
A stdlib-only Python CLI for agent-driven development. No venv needed — runs with system python3.
Always use the shell wrapper as the entry point:
scripts/dev/datahub-dev.sh <command>Run scripts/dev/datahub-dev.sh --help to see all available subcommands (start, setup, frontend,
status, wait, rebuild, test, flag list/get, env, sync-flags, reset, nuke).
- Setup (once):
scripts/dev/datahub-dev.sh setup— installs Python dev environment (providesdatahubCLI). For frontend work, also runscripts/dev/datahub-dev.sh setup frontend. - Start:
scripts/dev/datahub-dev.sh start - Code: Make changes to Java/Python/frontend code
- Rebuild:
scripts/dev/datahub-dev.sh rebuild --wait - Test:
scripts/dev/datahub-dev.sh test <test-path> - Iterate: Repeat steps 2–4
Frontend hot-reload: Run scripts/dev/datahub-dev.sh frontend to start the React dev server with hot-reload (instead of rebuilding the frontend container).
Worktree note: All Gradle commands inside the tool already pass -x generateGitPropertiesGlobal
to avoid git-related failures in worktrees.
| Source directory | Container |
|---|---|
metadata-service/ |
datahub-gms |
datahub-graphql-core/ |
datahub-gms |
metadata-io/ |
datahub-gms |
datahub-frontend/ |
datahub-frontend-react |
metadata-jobs/mce-consumer-job/ |
datahub-mce-consumer |
metadata-jobs/mae-consumer-job/ |
datahub-mae-consumer |
metadata-models/ |
All (triggers full rebuild + code generation) |
Set any env var for DataHub containers via env set + env restart:
scripts/dev/datahub-dev.sh env set KEY=VALUE
scripts/dev/datahub-dev.sh env restart # required — changes take effect on restart
scripts/dev/datahub-dev.sh env list # show current vars and pending_restart statusDo NOT manually edit .env files, use docker compose -e, or export — always use the wrapper.
All flag changes require a container restart. Use env set + env restart:
scripts/dev/datahub-dev.sh env set SHOW_BROWSE_V2=true
scripts/dev/datahub-dev.sh env restartflag list and flag get are read-only inspection tools — they show the current live values from
the running server but do not change anything.
The flag manifest at scripts/generated/flag-classification.json is auto-generated
(gitignored). Run scripts/dev/datahub-dev.sh sync-flags after adding fields to FeatureFlags.java
or after a fresh clone.
When to use each:
reset: GMS returns 503 and doesn't recover, frontend shows "Unable to connect", tests fail with connection errorsnuke --keep-data: Containers in restart loops, port conflicts,resetdidn't fix itnuke: ES index corruption, MySQL schema issues after model changes, PDL model changes needing clean slate,nuke --keep-datadidn't fix it
Set AGENT_MODE=1 to get machine-readable JSON test reports at smoke-test/build/test-report.json:
AGENT_MODE=1 scripts/dev/datahub-dev.sh test tests/test_system_info.pyThese commands work against any DataHub instance — local dev, staging, or production. Provide connection details via environment variables:
export DATAHUB_GMS_URL=http://localhost:8080 # or your instance URL
export DATAHUB_GMS_TOKEN=<your-token> # omit if auth is not requireddatahub init writes ~/.datahubenv with the GMS URL and an access token. Run it once before
using any other CLI commands that require authentication.
# Quickstart: local instance with default credentials
datahub init --username datahub --password datahub
# Full agent best-practices guide (defaults, env vars, all scenarios)
datahub init --agent-contextdatahub graphql executes queries and mutations against the DataHub GraphQL API and can
introspect the live schema to discover available operations.
# Discover what's available
datahub graphql --list-operations --format json
# Inspect a specific operation's arguments
datahub graphql --describe dataset --format json
# Preview a query before executing
datahub graphql --query "{ me { corpUser { urn } } }" --dry-run
# Execute a query
datahub graphql --query "{ me { corpUser { urn username } } }" --format jsonFor full agent best practices (discovery, dry-run, error codes, common recipes):
datahub graphql --agent-contextEssential reading:
docs/architecture/architecture.md- System architecture overviewdocs/modeling/metadata-model.md- How metadata is modeleddocs/what-is-datahub/datahub-concepts.md- Core concepts (URNs, entities, etc.)
External docs:
- https://docs.datahub.com/docs/developers - Official developer guide
- https://demo.datahub.com/ - Live demo environment
Gradle tasks manage all venvs automatically. Never create, activate, or pip-install into them manually. When running smoke tests outside Gradle: smoke-test/venv/bin/python -m pytest ...
- Entity Registry is defined in YAML, not code (
entity-registry.yml) - All metadata changes flow through the event streaming system
- GraphQL schema is generated from backend GMS APIs