This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
This is the find-datalad-repos project - a Python package that discovers and tracks DataLad repositories across multiple hosting platforms (GitHub, OSF, GIN, hub.datalad.org, ATRIS). The tool searches for repositories that are either DataLad datasets or have used datalad run commands, generating comprehensive reports and maintaining an up-to-date registry.
Use the available development virtual environment:
source venvs/dev-pycharm/bin/activate# Run all tox environments (lint + typing)
tox
# Run only linting
tox -e lint
# Run only type checking
tox -e typing
# Run tests (currently commented out in tox.ini)
tox -e py3# Search for repositories across all hosts
tox -e run
# Search specific hosts (github,osf,gin,hub.datalad.org,atris)
tox -e run -- --hosts github,gin
# Generate diff reports
tox -e diff# Main command
find-datalad-repos
# Diff command
diff-datalad-reposcore.py: Defines abstract base classesUpdaterandSearcherwith generic types for different repository hostsrecord.py: CentralRepoRecordmodel that manages collections of repositories from all hosts- Host-specific modules: Each platform has its own updater implementation:
github.py: GitHub API integration with abuse detection handlinggin.py: GIN platform supportosf.py: Open Science Framework integration
tables.py: Report generation and markdown table formattingreadmes.py: README file generation for discovered repositoriesutil.py: Shared utilities including Git operations and status management
- Search: Host-specific
Searcherclasses query APIs for DataLad repositories - Collection:
Updaterclasses process search results and update repository records - Storage: All discoveries are stored in
datalad-repos.json - Reporting: Generate markdown tables and individual README files in
READMEs/directory
- Multi-platform support: Searches GitHub, OSF, GIN, hub.datalad.org, and ATRIS
- DataLad detection: Identifies both DataLad datasets and repositories using
datalad run - Rate limiting: Built-in delays and abuse detection handling for GitHub API
- Automated reporting: Generates comprehensive markdown reports and individual repository summaries
- Registry integration: Outputs consumed by https://registry.datalad.org/
- Environment variables:
GIN_TOKEN,GITHUB_TOKEN,HUB_DATALAD_ORG_TOKEN - Organization grouping: Maintained in
github-orgs.jsonwith group classification ("ours" vs "public") - Output files:
datalad-repos.json(data),README.md(main report),READMEs/(individual reports)
The codebase uses a sophisticated generic type system with TypeVars (T, U, S) to ensure type safety across different repository host implementations while maintaining a consistent abstract interface.