Date: December 7-8, 2025 Processing Time: ~14 minutes for 13 papers System: AI-CoScientist DD-RAPTOR Pipeline with Multi-Provider LLM
Successfully processed 13 out of 14 NeurIPS 2025 papers through the DD-RAPTOR pipeline, creating a comprehensive hierarchical knowledge base with:
- 1,161 Level 0 chunks (512-token sections with overlap)
- 53 Level 1 summaries (section-level abstractions)
- 13 Level 2 summaries (paper-level abstractions)
- Total: 1,227 searchable knowledge nodes
All processed papers are stored in:
- JSON format:
/home/juke/git/AI-CoScientist/data/reference_papers/neurips_2025_processed/ - ChromaDB:
/home/juke/git/AI-CoScientist/chromadb_data_neurips2025/(30 MB)
-
Brain Foundation Models: A Survey on
- File:
2503.00580_brain_foundation_models_survey.pdf - Chunks: 42 L0, 2 L1
- Total Words: 10,308
- Status: ✅ Success
- File:
-
Foundation Model in Biomedicine
- File:
2503.02104_biomedical_foundation_model_survey.pdf - Chunks: 69 L0, 5 L1
- Total Words: 5,641
- Status: ✅ Success
- File:
-
MMaDA: Multimodal Large Diffusion Language Models
- File:
2505.15809_mmada_multimodal_diffusion.pdf - Chunks: 62 L0, 2 L1
- Total Words: 15,918
- Status: ✅ Success
- File:
-
Brain Imaging Foundation Models, Are We There Yet?
- File:
2506.13306_brain_imaging_foundation_models.pdf - Chunks: 74 L0, 3 L1
- Total Words: 18,913
- Status: ✅ Success
- File:
-
Foundation and Large-Scale AI Models in Neuroscience
- File:
2510.16658_foundation_ai_neuroscience.pdf - Chunks: 106 L0, 3 L1
- Total Words: 24,853
- Status: ✅ Success
- File:
-
Eagle 2.5: Boosting Long-Context Post-Training
- File:
2504.15271_Eagle2.5.pdf - Chunks: 66 L0, 4 L1
- Total Words: 16,190
- Status: ✅ Success
- File:
-
ModuLM: Enabling Modular and Multimodal
- File:
2506.00880_ModuLM.pdf - Chunks: 47 L0, 4 L1
- Total Words: 10,699
- Status: ✅ Success
- File:
-
NeurIPS 2025 E2LM Competition: Early Training
- File:
2506.07731_E2LM.pdf - Chunks: 49 L0, 2 L1
- Total Words: 10,483
- Status: ✅ Success
- File:
-
Training a Scientific Reasoning Model for Chemistry
- File:
2506.17238_ScientificReasoningChemistry.pdf - Chunks: 70 L0, 6 L1
- Total Words: 18,226
- Status: ✅ Success
- File:
-
The Evolving Role of Large Language Models in Scientific Innovation
- File:
2507.11810_LLMsScientificInnovation.pdf - Chunks: 186 L0, 5 L1
- Total Words: 36,410
- Status: ✅ Success
- File:
-
Cross-Domain EEG
- File:
2508.15716_CrossDomainEEG.pdf - Chunks: 44 L0, 4 L1
- Total Words: 11,028
- Status: ✅ Success
- File:
-
A Survey of Scientific Large Language Models
- File:
2508.21148_ScientificLLMsSurvey.pdf(Largest: 34 MB PDF) - Chunks: 267 L0, 10 L1
- Total Words: 96,629
- Status: ✅ Success
- File:
-
NeurIPT: Foundation Model for Neural Interfaces
- File:
2510.16548_NeurIPT.pdf - Chunks: 79 L0, 3 L1
- Total Words: 17,681
- Status: ✅ Success
- File:
- PRIMT
- File:
2509.15607_PRIMT.pdf(17 MB) - Status: ❌ Failed
- Reason: All LLM providers failed during section parsing
- Notes: Gemini blocked content (Finish reason: 2), Anthropic organization disabled
- File:
| Level | Description | Count | Average per Paper |
|---|---|---|---|
| L0 | Text Chunks (512 tokens) | 1,161 | 89.3 |
| L1 | Section Summaries | 53 | 4.1 |
| L2 | Paper Summaries | 13 | 1.0 |
| Total | All Nodes | 1,227 | 94.4 |
- Total Words: 292,969
- Average per Paper: 22,536 words
- Largest Paper: Scientific LLMs Survey (96,629 words)
- Smallest Paper: Biomedical Foundation Model (5,641 words)
- JSON Files: 34 MB total
- ChromaDB: 30 MB
- Average JSON per Paper: ~2.6 MB
The DD-RAPTOR pipeline creates a 3-level hierarchical knowledge structure:
Level 2 (Paper Summary)
│
├─ Level 1 (Section Summaries)
│ ├─ Abstract Summary
│ ├─ Introduction Summary
│ ├─ Methods Summary
│ └─ ...
│
└─ Level 0 (Text Chunks)
├─ Chunk 1 (512 tokens)
├─ Chunk 2 (512 tokens, 50-token overlap)
└─ ...
Due to API issues, the system used OpenAI GPT-4o as the primary fallback:
- Gemini 2.5 Pro: ❌ Blocked content (empty response, finish reason: 2)
- Anthropic Claude Sonnet: ❌ Organization disabled
- OpenAI GPT-4o: ✅ Primary provider (all summaries)
- DeepSeek:
⚠️ Not needed (OpenAI succeeded)
- Model: SciBERT (
allenai/scibert_scivocab_uncased) - Embeddings Generated: 1,227 total
- Dimensionality: 768 (SciBERT standard)
The ChromaDB instance contains 3 collections:
| Collection | Name | Description | Documents |
|---|---|---|---|
| L0 | neurips_2025_L0 |
Original text chunks | 1,161 |
| L1 | neurips_2025_L1 |
Section summaries | 53 |
| L2 | neurips_2025_L2 |
Paper summaries | 13 |
/home/juke/git/AI-CoScientist/chromadb_data_neurips2025/
├── 92b93fd7-ea54-4ca3-9086-5a18ef4ef486/ (neurips_2025_L0)
├── 9527efca-ea60-499a-8cc5-f1abb3c41325/ (neurips_2025_L1)
├── bf0f76d4-5c94-4475-b941-77ad2cbcf453/ (neurips_2025_L2)
└── chroma.sqlite3 (26 MB metadata database)
Each chunk includes rich metadata:
paper_id: Unique identifierpaper_title: Full paper titlesection: Section namesection_order: Order in paperjournal: "NeurIPS 2025"year: 2025chunk_index: Position in sectiontotal_chunks: Total chunks in section
The ChromaDB knowledge base now enables:
- Semantic Search: Find relevant papers by meaning, not keywords
- Multi-Level Retrieval: Search at chunk, section, or paper level
- Contextual Queries: Locate specific concepts across all papers
- Hierarchical Navigation: Drill down from summaries to detailed chunks
from chromadb import PersistentClient
# Connect to ChromaDB
client = PersistentClient(path="chromadb_data_neurips2025")
# Search Level 2 (paper summaries)
collection_l2 = client.get_collection("neurips_2025_L2")
results = collection_l2.query(
query_texts=["brain foundation models for neuroimaging"],
n_results=5
)
# Search Level 0 (detailed chunks)
collection_l0 = client.get_collection("neurips_2025_L0")
results = collection_l0.query(
query_texts=["transformer architecture for EEG signals"],
n_results=10
)- Success Rate: 92.9% (13/14 papers)
- Processing Speed: ~1.1 minutes per paper
- Chunks per Paper: 89.3 average
- Quality: All 13 papers have complete 3-level hierarchies
- ✅ All embeddings successfully generated (SciBERT)
- ✅ All L0 chunks have proper metadata
- ✅ All L1 summaries linked to parent chunks
- ✅ All L2 summaries linked to section summaries
- ✅ ChromaDB integrity verified
- Retry PRIMT Paper: Re-run with different LLM settings or manual processing
- Test Search Functionality: Validate retrieval quality with sample queries
- Integration: Connect ChromaDB to main AI-CoScientist RAG system
- Cross-Reference Analysis: Link related concepts across papers
- Citation Extraction: Parse and index paper citations
- Figure/Table Processing: Extract and index visual elements
- Quality Metrics: Implement RAGAS evaluation on retrieval
-
/home/juke/git/AI-CoScientist/scripts/process_neurips_2025_papers.py- Main processing pipeline
- Multi-provider LLM integration
- RAPTOR hierarchy construction
-
/home/juke/git/AI-CoScientist/scripts/load_neurips_2025_chromadb.py- ChromaDB collection creation
- JSON to vector store loading
-
JSON Files (13 files, 34 MB total)
- Location:
/home/juke/git/AI-CoScientist/data/reference_papers/neurips_2025_processed/ - Format: Complete paper data with embeddings
- Location:
-
Processing Results
- Location:
/home/juke/git/AI-CoScientist/data/reference_papers/neurips_2025_processed/processing_results.json - Contains: Success/failure tracking, statistics
- Location:
-
ChromaDB Instance (30 MB)
- Location:
/home/juke/git/AI-CoScientist/chromadb_data_neurips2025/ - Collections: 3 (L0, L1, L2)
- Total Documents: 1,227
- Location:
The NeurIPS 2025 paper processing pipeline successfully created a comprehensive, searchable knowledge base covering state-of-the-art research in:
- Brain foundation models and neuroimaging
- Biomedical AI and foundation models
- Scientific LLMs and reasoning
- Multimodal diffusion models
- Neural interfaces and EEG processing
- Scientific innovation with LLMs
This knowledge base is now ready for integration with the AI-CoScientist research automation system, enabling advanced literature review, hypothesis generation, and cross-domain knowledge synthesis.
Total Processing Time: ~14 minutes Total Knowledge Nodes: 1,227 Total Coverage: 292,969 words across 13 cutting-edge papers
Report Generated: December 8, 2025, 02:43 UTC