Skip to content

Add database table schemas documentation for LLM agents #71

Description

@edmundmiller

Overview

To help LLM agents (like Claude Code) better understand and work with the nf-core stats data, we should document the database table schemas in a machine-readable format.

Current situation

  • LLM agents working with this project need to understand the data structure
  • Evidence.dev pages reference tables like github_traffic_stats, github_contributor_stats, etc.
  • SQL queries in sources/nfcore_db/ and queries/ directories reference various table schemas
  • No centralized schema documentation exists for automated tools

Proposed solution

Create comprehensive table schema documentation that includes:

1. Schema documentation file

  • Create docs/database-schemas.md or similar
  • Document all main tables with column descriptions, types, and sample data
  • Include relationships between tables
  • Add notes about data collection frequency and sources

2. Key tables to document

From the existing SQL queries and pipeline code, prioritize:

  • github_traffic_stats (repository views/clones)
  • github_contributor_stats (contributor activity by week)
  • github_issue_stats (issues and pull requests)
  • nfcore_pipelines (repository metadata)
  • slack_messages (Slack channel activity)
  • slack_members (Slack membership stats)
  • org_members (GitHub organization members)

3. Machine-readable format considerations

  • Use consistent markdown tables
  • Include JSON schema definitions if helpful
  • Consider adding dlt schema exports
  • Make it easy for LLMs to parse and understand

Benefits

  • LLM agents can write better SQL queries
  • Faster development when creating new visualizations
  • Better understanding of available data for new features
  • Improved onboarding for developers
  • Self-documenting codebase

Acceptance criteria

  • Create schema documentation covering all major tables
  • Include column names, types, descriptions, and sample values
  • Document table relationships and foreign keys
  • Add data collection notes (frequency, source APIs)
  • Update CLAUDE.md to reference the schema documentation
  • Ensure documentation is easily discoverable and maintainable

This will significantly improve the ability of Claude Code and other LLM agents to understand and work with the nf-core stats data effectively.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions