Thanks for considering contributing to build-kg! This project turns any topic into a structured knowledge graph on your own PostgreSQL, and there are many ways to help -- from adding domain profiles to improving the pipeline itself.
| Contribution | Difficulty | Impact |
|---|---|---|
| Add a domain profile for your industry | Low | High -- unlocks a new domain with pre-built ontology |
| Add ID extraction patterns for a jurisdiction | Low | Medium -- improves provision ID quality |
| Fix a bug | Varies | High |
| Improve documentation | Low | Medium |
| Add a new jurisdiction | Low | Medium -- expands country support |
| Improve chunking or parsing | Medium-High | High -- better graph quality |
| Improve ontology generation | Medium | High -- better auto-generated graph structures |
# 1. Fork and clone the repository
git clone https://github.com/YOUR_USERNAME/build-kg.git
cd build-kg
# 2. Install all dependencies (venv, packages, browser)
make setup
# 3. Start the database
docker compose -f db/docker-compose.yml up -d
# 6. Configure environment
cp .env.example .env
# Edit .env with your settings
# 7. Initialize the graph
python -m build_kg.setup_graph
# 8. Run tests to verify
pytest tests/ -vbuild-kg/
├── src/build_kg/ # Main package
│ ├── config.py # Configuration from .env
│ ├── crawl.py # Web crawler (Crawl4AI)
│ ├── chunk.py # Document chunker (Unstructured)
│ ├── load.py # Database loader
│ ├── parse.py # Sync parser (Anthropic or OpenAI)
│ ├── parse_batch.py # Batch parser (Batch API)
│ ├── setup_graph.py # AGE graph setup
│ ├── verify.py # Setup verification
│ ├── id_extractors.py # Regex-based ID extraction
│ ├── domain.py # Domain profile system
│ └── domains/ # YAML domain profiles
│ ├── default.yaml
│ ├── food-safety.yaml
│ ├── financial-aml.yaml
│ └── data-privacy.yaml
├── db/ # Database setup
├── docs/ # Documentation
├── examples/ # Example manifests and data
├── tests/ # Test suite
├── AGENTS.md # OpenAI Codex instructions
├── .claude/skills/ # Claude Code / Kiro / Qoder / Antigravity skill
│ └── build-kg/
│ └── SKILL.md # /build-kg skill definition (Agent Skills standard)
├── .github/copilot-instructions.md # GitHub Copilot instructions
├── .cursor/rules/build-kg.mdc # Cursor rules
└── .windsurf/rules/build-kg.md # Windsurf rules
We use Ruff for linting:
# Check for issues
ruff check src/ tests/
# Auto-fix what's possible
ruff check --fix src/ tests/- Line length: 120 characters
- Rules: E (pycodestyle errors), F (pyflakes), I (isort), W (pycodestyle warnings)
# Run all tests
pytest tests/ -v
# Run a specific test file
pytest tests/test_id_extractors.py -v
# Run domain profile tests
pytest tests/test_domain.py -vTests are designed to run without a database connection. Integration tests that require a running database are skipped automatically if the DB is not available.
- Fork the repository and create a feature branch from
main - Make your changes and ensure tests pass
- Run the linter:
ruff check src/ tests/ - Write tests for new functionality
- Submit a PR with a clear description of what changed and why
This is one of the highest-impact contributions you can make. Each new profile unlocks build-kg for an entirely new domain with a pre-built ontology.
-
Create
src/build_kg/domains/your-domain.yamlusingfood-safety.yamlas a template -
Set
extends: defaultto inherit base configuration -
Define domain-specific configuration:
Ontology -- the graph structure:
ontology.nodes-- node types with labels, descriptions, and propertiesontology.edges-- edge types with source/target labels and descriptionsontology.root_node-- primary node type that maps 1:1 to source fragmentsontology.json_schema-- expected LLM output JSON format
Parsing -- what the LLM extracts:
parsing.requirement_types-- e.g.,[consent, data_processing, breach_notification]for privacyparsing.target_signal_examples-- e.g.,[data.retention_period, consent.mechanism]parsing.scope_examples-- e.g.,[data_controller, data_processor, data_subject]
ID Patterns -- regex for domain-specific IDs:
id_patterns.patterns-- e.g., GDPR Article patterns, FATF Recommendation patternsid_patterns.authority_priorities-- which patterns to try first for each authority
Discovery -- how the
/build-kgskill finds sources:discovery.search_templates-- search queries for finding sourcesdiscovery.sub_domains-- checklist of sub-areas to cover
-
Add tests to
tests/test_domain.pyto verify the profile loads correctly -
Update the profile table in
README.md -
Test with
/build-kg <your topic>using your new domain profile (setDOMAIN=your-domainin.env)
Example profile ideas:
pharma-- FDA drug regulations, clinical trial requirements, GMPenvironmental-- EPA regulations, emissions standards, waste managementtelecom-- FCC rules, spectrum licensing, net neutralityconstruction-- building codes, safety standards, permitsaviation-- FAA regulations, airworthiness, pilot licensingmaritime-- IMO conventions, port state control, SOLAS
The jurisdiction field is a freeform TEXT column in the database, so no schema changes are needed to support a new country. To add support for a new jurisdiction:
- Add authority-specific regex patterns to
src/build_kg/id_extractors.pyif the jurisdiction uses a unique ID format - Update the jurisdiction list in
.claude/skills/build-kg/SKILL.md
To add new regex patterns for regulatory ID formats:
- Edit
src/build_kg/id_extractors.py - Add patterns to
ProvisionIDExtractor.PATTERNS:'your_pattern_name': re.compile(r'your_regex_here'),
- Add authority mapping to
AUTHORITY_PATTERNS:'Authority Name': ['your_pattern_name', 'other_patterns'],
- Add format rules to
ProvisionIDValidator.FORMAT_RULES(optional) - Add test cases to
tests/test_id_extractors.py
When reporting a bug, please include:
- What happened: Describe the error or unexpected behavior
- What you expected: What should have happened instead
- How to reproduce: Step-by-step commands to reproduce the issue
- Environment: Python version, OS, Docker version
- Error output: Full traceback or error message
We follow the Contributor Covenant Code of Conduct. Be kind, be respectful, be constructive.