-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Note: generated with assistance from Claude AI
Prompt: Draft a github issue to scrape the web for NF research tools (animal models, cell lines, antibodies, genetic reagents, biobanks) and compile information into tables on Synapse such as these: https://www.synapse.org/Synapse:syn26338068/tables/
Problem
Researchers waste time hunting across multiple websites, catalogs, and databases to find NF (Neurofibromatosis) research tools—animal models, cell lines, antibodies, genetic reagents, and biobanks. Information is scattered, incomplete, and quickly becomes outdated. We need automated scraping and curation to compile these resources into searchable, structured tables on Synapse (similar to https://www.synapse.org/Synapse:syn26338068/tables/).
Goal
Build a web scraping pipeline that:
- Discovers and extracts NF-relevant research tools from public repositories, vendor catalogs, and resource centers.
- Structures the data into normalized tables (one per resource type).
- Publishes to Synapse with automatic updates and provenance tracking.
- Enables discovery via filters, search, and direct links to original sources.
User stories
- Researcher: "I need an NF1-deficient mouse model for schwannoma studies" → searches the Animal Models table, filters by NF1 + schwannoma, finds MGI/MMRRC entries with ordering links.
- Core facility manager: "Which anti-neurofibromin antibodies are validated?" → opens Antibodies table, sorts by validation status and citations.
- Data curator: New resources appear on JAX, ATCC, or Antibodypedia → scraper auto-detects, extracts metadata, appends to Synapse tables, logs provenance.
- Consortium coordinator: Exports the full Biobanks table as CSV for a grant application or resource-sharing agreement.
Scope (MVP)
1. Resource types & target sources
| Resource type | Target sources (examples) |
|---|---|
| Animal models | MGI, JAX, MMRRC, EMMA, NF-specific model repositories |
| Cell lines | ATCC, Cellosaurus, CVCL, NF Foundation cell repositories, published supplementary tables |
| Antibodies | Antibodypedia, CiteAb, vendor catalogs (Abcam, CST, etc.), validation databases |
| Genetic reagents | Addgene, DNASU, Horizon/PerkinElmer (shRNA/CRISPR), published plasmid repositories |
| Biobanks | CTF Biobank, NF registries, dbGaP/EGA metadata, institutional repositories |
2. Extraction & normalization
For each resource, capture:
Common fields (all types)
- Resource ID: unique identifier (MGI ID, ATCC number, Addgene ID, etc.)
- Name/Description: official name, aliases
- NF relevance: NF1, NF2, Schwannomatosis, or general
- Gene/Mutation: gene symbol, specific mutation/variant (e.g., NF1 exon 31 deletion)
- Source URL: direct link to catalog entry
- Availability: commercial, academic, restricted, discontinued
- Last verified: date scraped/updated
Type-specific fields
Animal models
- Species, strain background
- Genetic modification (knockout, knock-in, conditional, reporter)
- Phenotype summary
- Repository & stock number
- RRID (Research Resource Identifier)
Cell lines
- Species, tissue/organ
- Disease status (tumor-derived, patient-derived, engineered)
- Culture conditions (adherent/suspension, media)
- STR profile/authentication status
- RRID
Antibodies
- Target protein/epitope
- Host species, clonality (monoclonal/polyclonal)
- Applications (WB, IHC, IF, IP, etc.)
- Validation evidence (citations, images, KO controls)
- Vendor(s) & catalog numbers
- RRID
Genetic reagents
- Reagent type (plasmid, shRNA, CRISPR guide, cDNA)
- Vector backbone, selectable markers
- Sequence availability (GenBank, full sequence file)
- Addgene/DNASU ID, RRID
Biobanks
- Institution/network
- Sample types (tissue, DNA, blood, iPSCs, etc.)
- Cohort size, NF subtype distribution
- Access procedure (open, application-based, consortium-only)
- Associated datasets/accessions (if any)
3. Scraping architecture
Pipeline components:
- Crawler: Headless browser (Playwright/Puppeteer) or HTTP client (requests/httpx) for static pages.
- Parsers: Per-source extractors (BeautifulSoup, lxml, or vendor APIs where available).
- NF filter: Keyword/entity recognition to flag NF-relevant entries (gene symbols: NF1, NF2, SMARCB1, LZTR1; disease terms: neurofibromatosis, schwannoma, MPNST, etc.).
- Deduplication: Match by RRID, exact name, or sequence similarity (for genetic reagents).
- Change detection: Compare to previous snapshots; flag new/updated/discontinued resources.
- Synapse uploader: Append rows to existing tables or create new versions; log provenance (source URL, scrape timestamp).
Scheduling:
- Initial bulk scrape for each source.
- Weekly/monthly incremental updates (configurable per source).
- Manual trigger via API or admin UI.
4. Synapse table schema
One table per resource type, e.g.:
syn_NF_AnimalModelssyn_NF_CellLinessyn_NF_Antibodiessyn_NF_GeneticReagentssyn_NF_Biobanks
Columns match normalized fields above + provenance columns:
sourceURL,scrapedDate,lastVerified,changeLog
Access: Open (read) by default; curators have edit rights to flag errors or add manual entries.
5. User interface (minimal MVP)
- Browse & filter: Synapse table view with column filters, sorting, keyword search.
- Export: CSV/JSON download.
- Provenance links: Each row links back to original catalog entry.
- Change log: Per-row history (new, updated fields, deprecated).
6. Provenance & quality
- Every row includes
sourceURLandscrapedDate. - Automated validation checks:
- Non-null required fields (ID, name, source).
- Valid URLs (HTTP 200 check).
- RRID format validation.
- Flagging system for curators to mark errors or request manual review.
Acceptance criteria
- ✅ Scraper successfully extracts ≥50 NF-relevant entries per resource type from at least 2 target sources each.
- ✅ Data is normalized and loaded into 5 Synapse tables (one per resource type) with all required fields populated.
- ✅ Each row includes provenance (source URL, scrape date).
- ✅ Incremental update run detects ≥1 new or changed entry and appends/updates the table without duplicating existing rows.
- ✅ Tables are publicly browsable and support filtering by NF subtype, gene, and availability.
- ✅ Export to CSV works for all tables.
- ✅ Change log or version history is visible (Synapse native versioning or custom field).
- ✅ No broken source URLs in the final tables (automated link checker passes).
Nice-to-haves (post-MVP)
- API access: REST or GraphQL endpoint for programmatic queries.
- Ontology mapping: Link to standard ontologies (MGI, CLO, BTO, EFO) for interoperability.
- Citation extraction: Auto-link to PubMed IDs for validation papers or original publications.
- Community contributions: Allow researchers to submit new resources or corrections via a form → curator review queue.
- Integration with NF Data Portal: Cross-link tools to related datasets, publications, or clinical trials.
- Alerts: Notify users when new resources matching their saved filters appear.
- Validation scoring: Rank antibodies/cell lines by validation strength (e.g., # citations, knockout controls).
Risks & mitigations
| Risk | Mitigation |
|---|---|
| Website structure changes break scrapers | Modular parsers per source; automated smoke tests; fallback to manual review. |
| Access restrictions (paywalls, CAPTCHAs) | Respect robots.txt; use official APIs where possible; manual seed for restricted sites. |
| Data licensing/terms of use violations | Review each source's ToS; only scrape publicly accessible data; attribute sources. |
| False positives (non-NF resources) | Refine NF keyword filter; manual review sample before bulk upload. |
| Stale data | Automated monthly refresh; flag entries not re-verified in >6 months. |
| Duplicate entries across sources | RRID-based deduplication; manual curator merge tool. |
Open questions
- Which Synapse project/folder should host these tables? (Coordinate with existing NF portal structure.)
- Who are the designated curators for manual review and corrections?
- Do we need IRB/data-use agreements for biobank metadata scraping?
- Should we scrape vendor pricing, or link only?
- Preferred notification channel for new/updated resources (email digest, Slack bot, RSS)?
- Integration priority: Should this feed into the consensus evidence synthesis system (issue #XXX)?
Related issues / references
- Example Synapse tables: https://www.synapse.org/Synapse:syn26338068/tables/
- RRID initiative: https://scicrunch.org/resources