This file provides guidance to AI coding agents (Claude Code, Codex, Cursor, Copilot, and others) when working in this repository. It is loaded into agent context automatically — keep it concise.
Flow-Data-Sources is a Python scraper that crawls Flow ecosystem documentation sites, GitHub repositories, and GitHub org discussions, converting them to Markdown for use in RAG pipelines, chatbots, and other AI knowledge bases. A GitHub Actions workflow runs the scraper daily at 00:00 UTC and commits the refreshed output back to the repo.
Setup (Python 3.7+ per README; CI uses 3.9 in .github/workflows/scrape-and-push.yml):
pip install requests beautifulsoup4 markdownify— install the three runtime dependencies
Run:
python scraper.py— crawl every URL inSITES(scraper.py:12) and write Markdown toscraped_docs/python merge.py— produce three files inmerged_docs/:all_merged.md,essentials_merged.md,cadence_docs_merged.md
There is no Makefile, no test suite, and no linter configured.
scraper.py— single-file crawler with three modes, dispatched bycrawl_site()(scraper.py:392):- Normal sites (non-GitHub): domain-limited BFS, HTML stripped of
<script>/<style>/<noscript>, converted viamarkdownify - GitHub repos: walks
/tree/(main|master)/...and/blob/(main|master)/..., downloads files viaraw.githubusercontent.com, wraps non-Markdown files in triple backticks - GitHub org discussions (
/orgs/<org>/discussions): paginates listing, scrapes each thread todiscussion_<id>.md
- Normal sites (non-GitHub): domain-limited BFS, HTML stripped of
merge.py— walksscraped_docs/and produces three merged outputs; folder allowlistsCADENCE_DOCS(merge.py:11) andESSENTIALS(merge.py:70) control which dirs feed the filtered mergesscraped_docs/— per-site output directories named bysanitize_folder_name()(scraper.py:95) (e.g.,github_com_onflow_flow_ft/)merged_docs/— final merged artifacts consumed downstream.github/workflows/scrape-and-push.yml— daily cron (0 0 * * *) that installs deps, runsscraper.pythenmerge.py, andgit add -A && git push
- Edit the site list in
scraper.pyat theSITESconstant (scraper.py:12). Watch the trailing entries:"https://github.com/fixes-world/fixes"at scraper.py:67 is missing a trailing comma, so Python implicitly concatenates it with the discussions URL on scraper.py:71 — fix this before adding new entries between them scrape_github_file_blob()(scraper.py:210) only saves files whose extension is inALLOWED_EXTENSIONS = {".cdc", ".md", ".sol"}(scraper.py:81) or whose basename containsreadme; the filter itself is at scraper.py:223- BFS is bounded:
MAX_PAGES_PER_SITE = 200per site (scraper.py:75) andMAX_DISCUSSION_THREADS = 500for the discussions crawl (scraper.py:85) - Non-
.mdfiles are wrapped in triple backticks before being saved (scraper.py:234) so merged output keeps code blocks intact; do not double-wrap when adding new handlers merge.pyemits a literal------------ FILE_DIVIDER ------------separator between files (merge.py:97); downstream consumers rely on this markermerge.pyCADENCE_DOCS/ESSENTIALSlists use sanitized folder names (lowercase, non-alphanumerics collapsed to_, persanitize_folder_name()at scraper.py:95). When adding a site toSITES, run the scraper once and copy the exact folder name fromscraped_docs/before adding it to these lists- The daily workflow runs
git add -Aonubuntu-latest— any untracked file in the repo root will be committed by CI
scraped_docs/— regenerated every run ofscraper.pymerged_docs/— regenerated every run ofmerge.py