idea: Automated web scraping and cataloging of NF research tools

Note: generated with assistance from Claude AI

Prompt: Draft a github issue to scrape the web for NF research tools (animal models, cell lines, antibodies, genetic reagents, biobanks) and compile information into tables on Synapse such as these: https://www.synapse.org/Synapse:syn26338068/tables/

## Problem
Researchers waste time hunting across multiple websites, catalogs, and databases to find NF (Neurofibromatosis) research tools—animal models, cell lines, antibodies, genetic reagents, and biobanks. Information is scattered, incomplete, and quickly becomes outdated. We need automated scraping and curation to compile these resources into searchable, structured tables on Synapse (similar to https://www.synapse.org/Synapse:syn26338068/tables/).

## Goal
Build a web scraping pipeline that:
1. **Discovers and extracts** NF-relevant research tools from public repositories, vendor catalogs, and resource centers.
2. **Structures the data** into normalized tables (one per resource type).
3. **Publishes to Synapse** with automatic updates and provenance tracking.
4. **Enables discovery** via filters, search, and direct links to original sources.

## User stories
- **Researcher**: "I need an NF1-deficient mouse model for schwannoma studies" → searches the Animal Models table, filters by NF1 + schwannoma, finds MGI/MMRRC entries with ordering links.
- **Core facility manager**: "Which anti-neurofibromin antibodies are validated?" → opens Antibodies table, sorts by validation status and citations.
- **Data curator**: New resources appear on JAX, ATCC, or Antibodypedia → scraper auto-detects, extracts metadata, appends to Synapse tables, logs provenance.
- **Consortium coordinator**: Exports the full Biobanks table as CSV for a grant application or resource-sharing agreement.

## Scope (MVP)

### 1. Resource types & target sources

| Resource type       | Target sources (examples)                                                                 |
|---------------------|-------------------------------------------------------------------------------------------|
| **Animal models**   | MGI, JAX, MMRRC, EMMA, NF-specific model repositories                                     |
| **Cell lines**      | ATCC, Cellosaurus, CVCL, NF Foundation cell repositories, published supplementary tables  |
| **Antibodies**      | Antibodypedia, CiteAb, vendor catalogs (Abcam, CST, etc.), validation databases           |
| **Genetic reagents**| Addgene, DNASU, Horizon/PerkinElmer (shRNA/CRISPR), published plasmid repositories        |
| **Biobanks**        | CTF Biobank, NF registries, dbGaP/EGA metadata, institutional repositories                |

### 2. Extraction & normalization

For each resource, capture:

#### Common fields (all types)
- **Resource ID**: unique identifier (MGI ID, ATCC number, Addgene ID, etc.)
- **Name/Description**: official name, aliases
- **NF relevance**: NF1, NF2, Schwannomatosis, or general
- **Gene/Mutation**: gene symbol, specific mutation/variant (e.g., *NF1* exon 31 deletion)
- **Source URL**: direct link to catalog entry
- **Availability**: commercial, academic, restricted, discontinued
- **Last verified**: date scraped/updated

#### Type-specific fields

**Animal models**
- Species, strain background
- Genetic modification (knockout, knock-in, conditional, reporter)
- Phenotype summary
- Repository & stock number
- RRID (Research Resource Identifier)

**Cell lines**
- Species, tissue/organ
- Disease status (tumor-derived, patient-derived, engineered)
- Culture conditions (adherent/suspension, media)
- STR profile/authentication status
- RRID

**Antibodies**
- Target protein/epitope
- Host species, clonality (monoclonal/polyclonal)
- Applications (WB, IHC, IF, IP, etc.)
- Validation evidence (citations, images, KO controls)
- Vendor(s) & catalog numbers
- RRID

**Genetic reagents**
- Reagent type (plasmid, shRNA, CRISPR guide, cDNA)
- Vector backbone, selectable markers
- Sequence availability (GenBank, full sequence file)
- Addgene/DNASU ID, RRID

**Biobanks**
- Institution/network
- Sample types (tissue, DNA, blood, iPSCs, etc.)
- Cohort size, NF subtype distribution
- Access procedure (open, application-based, consortium-only)
- Associated datasets/accessions (if any)

### 3. Scraping architecture

**Pipeline components:**
1. **Crawler**: Headless browser (Playwright/Puppeteer) or HTTP client (requests/httpx) for static pages.
2. **Parsers**: Per-source extractors (BeautifulSoup, lxml, or vendor APIs where available).
3. **NF filter**: Keyword/entity recognition to flag NF-relevant entries (gene symbols: *NF1*, *NF2*, *SMARCB1*, *LZTR1*; disease terms: neurofibromatosis, schwannoma, MPNST, etc.).
4. **Deduplication**: Match by RRID, exact name, or sequence similarity (for genetic reagents).
5. **Change detection**: Compare to previous snapshots; flag new/updated/discontinued resources.
6. **Synapse uploader**: Append rows to existing tables or create new versions; log provenance (source URL, scrape timestamp).

**Scheduling:**
- Initial bulk scrape for each source.
- Weekly/monthly incremental updates (configurable per source).
- Manual trigger via API or admin UI.

### 4. Synapse table schema

One table per resource type, e.g.:
- `syn_NF_AnimalModels`
- `syn_NF_CellLines`
- `syn_NF_Antibodies`
- `syn_NF_GeneticReagents`
- `syn_NF_Biobanks`

**Columns** match normalized fields above + provenance columns:
- `sourceURL`, `scrapedDate`, `lastVerified`, `changeLog`

**Access**: Open (read) by default; curators have edit rights to flag errors or add manual entries.

### 5. User interface (minimal MVP)

- **Browse & filter**: Synapse table view with column filters, sorting, keyword search.
- **Export**: CSV/JSON download.
- **Provenance links**: Each row links back to original catalog entry.
- **Change log**: Per-row history (new, updated fields, deprecated).

### 6. Provenance & quality

- Every row includes `sourceURL` and `scrapedDate`.
- Automated validation checks:
  - Non-null required fields (ID, name, source).
  - Valid URLs (HTTP 200 check).
  - RRID format validation.
- Flagging system for curators to mark errors or request manual review.

## Acceptance criteria

1. ✅ Scraper successfully extracts ≥50 NF-relevant entries **per resource type** from at least 2 target sources each.
2. ✅ Data is normalized and loaded into 5 Synapse tables (one per resource type) with all required fields populated.
3. ✅ Each row includes provenance (source URL, scrape date).
4. ✅ Incremental update run detects ≥1 new or changed entry and appends/updates the table without duplicating existing rows.
5. ✅ Tables are publicly browsable and support filtering by NF subtype, gene, and availability.
6. ✅ Export to CSV works for all tables.
7. ✅ Change log or version history is visible (Synapse native versioning or custom field).
8. ✅ No broken source URLs in the final tables (automated link checker passes).

## Nice-to-haves (post-MVP)

- **API access**: REST or GraphQL endpoint for programmatic queries.
- **Ontology mapping**: Link to standard ontologies (MGI, CLO, BTO, EFO) for interoperability.
- **Citation extraction**: Auto-link to PubMed IDs for validation papers or original publications.
- **Community contributions**: Allow researchers to submit new resources or corrections via a form → curator review queue.
- **Integration with NF Data Portal**: Cross-link tools to related datasets, publications, or clinical trials.
- **Alerts**: Notify users when new resources matching their saved filters appear.
- **Validation scoring**: Rank antibodies/cell lines by validation strength (e.g., # citations, knockout controls).

## Risks & mitigations

| Risk                                      | Mitigation                                                                 |
|-------------------------------------------|---------------------------------------------------------------------------|
| **Website structure changes break scrapers** | Modular parsers per source; automated smoke tests; fallback to manual review. |
| **Access restrictions (paywalls, CAPTCHAs)** | Respect robots.txt; use official APIs where possible; manual seed for restricted sites. |
| **Data licensing/terms of use violations** | Review each source's ToS; only scrape publicly accessible data; attribute sources. |
| **False positives (non-NF resources)**    | Refine NF keyword filter; manual review sample before bulk upload.         |
| **Stale data**                            | Automated monthly refresh; flag entries not re-verified in >6 months.      |
| **Duplicate entries across sources**      | RRID-based deduplication; manual curator merge tool.                       |

## Open questions

1. **Which Synapse project/folder** should host these tables? (Coordinate with existing NF portal structure.)
2. **Who are the designated curators** for manual review and corrections?
3. **Do we need IRB/data-use agreements** for biobank metadata scraping?
4. **Should we scrape vendor pricing**, or link only?
5. **Preferred notification channel** for new/updated resources (email digest, Slack bot, RSS)?
6. **Integration priority**: Should this feed into the consensus evidence synthesis system (issue #XXX)?

## Related issues / references

- Example Synapse tables: https://www.synapse.org/Synapse:syn26338068/tables/
- RRID initiative: https://scicrunch.org/resources

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

idea: Automated web scraping and cataloging of NF research tools #24

Problem

Goal

User stories

Scope (MVP)

1. Resource types & target sources

2. Extraction & normalization

Common fields (all types)

Type-specific fields

3. Scraping architecture

4. Synapse table schema

5. User interface (minimal MVP)

6. Provenance & quality

Acceptance criteria

Nice-to-haves (post-MVP)

Risks & mitigations

Open questions

Related issues / references

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Resource type	Target sources (examples)
Animal models	MGI, JAX, MMRRC, EMMA, NF-specific model repositories
Cell lines	ATCC, Cellosaurus, CVCL, NF Foundation cell repositories, published supplementary tables
Antibodies	Antibodypedia, CiteAb, vendor catalogs (Abcam, CST, etc.), validation databases
Genetic reagents	Addgene, DNASU, Horizon/PerkinElmer (shRNA/CRISPR), published plasmid repositories
Biobanks	CTF Biobank, NF registries, dbGaP/EGA metadata, institutional repositories

Risk	Mitigation
Website structure changes break scrapers	Modular parsers per source; automated smoke tests; fallback to manual review.
Access restrictions (paywalls, CAPTCHAs)	Respect robots.txt; use official APIs where possible; manual seed for restricted sites.
Data licensing/terms of use violations	Review each source's ToS; only scrape publicly accessible data; attribute sources.
False positives (non-NF resources)	Refine NF keyword filter; manual review sample before bulk upload.
Stale data	Automated monthly refresh; flag entries not re-verified in >6 months.
Duplicate entries across sources	RRID-based deduplication; manual curator merge tool.

idea: Automated web scraping and cataloging of NF research tools #24

Description

Problem

Goal

User stories

Scope (MVP)

1. Resource types & target sources

2. Extraction & normalization

Common fields (all types)

Type-specific fields

3. Scraping architecture

4. Synapse table schema

5. User interface (minimal MVP)

6. Provenance & quality

Acceptance criteria

Nice-to-haves (post-MVP)

Risks & mitigations

Open questions

Related issues / references

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions