Multimodal KS: Image-based Dataset Discovery via OCR & LLM Refinement

Currently, the KnowledgeSpace AI agent primarily relies on text-based queries. However, much valuable neuroscience metadata is "trapped" within non-editable formats like research paper figures, tables, and presentation screenshots. I propose a multimodal feature that allows users to upload images and automatically extract refined search queries to discover datasets in the KnowledgeSpace ecosystem.

---

## 1. Proposed Solution

I have developed a prototype for an end-to-end pipeline that includes:

- Frontend  
  - React-based upload interface with a dynamic, auto-expanding search area to handle long scientific queries.

- Backend  
  - FastAPI endpoint that utilizes Pytesseract for raw text extraction.

### Intelligence Layer 

- Intelligence Layer:  
  - A refinement step using Gemini 2.0 Flash-Lite to perform zero-shot Named Entity Recognition (NER).

- Refinement Logic:  
  - The prompt is optimized to move away from conversational explanations (e.g., “Here are your options…”) and instead produce a clean, comma-separated list of extracted entities.

- Impact:  
  - Prevents query pollution by ensuring the backend search engine receives only high-signal scientific terms.  
  - Significantly improves the relevance and precision of discovered datasets.


---
## 2. Technical Stack

- Python 3.12 & FastAPI  
- React (TypeScript)  
- Pytesseract & Pillow  
- Google GenAI SDK (Gemini 2.0 Flash-Lite)

---
## 3. Example Use Case

### Example 1

- Input Image (from a research paper table):

<img width="966" height="168" alt="Image" src="https://github.com/user-attachments/assets/5aaea6e3-bef0-4b23-b158-565b663d16e8" />

- Refined OCR Output:
  - Rattus norvegicus  
  - Striatum  
  - Two-photon  
  - Sprague-Dawley  
  - Motor Cortex  
  - Patch-clamp  

<img width="1916" height="975" alt="Image" src="https://github.com/user-attachments/assets/2b7b983f-bd03-47f8-90ac-dba5382aedad" />

- Assistant Discovery Results:
  - IonChannelGenealogy: Kir2 channel model (Rattus norvegicus, Striatum)  
  - EBRAINS: MiniVess 3D vasculature (Rattus norvegicus, Two-photon)  
  - NeuroMorpho.Org: Neuron morphology (Sprague-Dawley, frontal neocortex)

---

### Example 2

- Input Image( from actual research paper) :

<img width="972" height="148" alt="Image" src="https://github.com/user-attachments/assets/3cc1e621-8647-48e5-98f9-d84d03c485f9" />


- Refined OCR Output:
  - Sst-Cre  
  - Ai32  
  - Vip-Cre  
  - CA1  

<img width="1919" height="975" alt="Image" src="https://github.com/user-attachments/assets/4fbd3273-08c3-4840-a6fb-db7ae3f0ba01" />

- Assistant Discovery Results:
  - IonChannelGenealogy: Relevant ion channel and electrophysiology datasets (Sst-Cre, CA1)  
  - EBRAINS: Modeling and experimental datasets (Vip-Cre, Ai32)

---


**The refined entities are injected directly into the existing keyword + vector retrieval pipeline.**

I am planning to apply for GSoC 2026 with INCF and would like to lead the implementation of this feature. I have already set up a local development environment and verified the core logic. I would appreciate any feedback from the mentors and am ready to submit a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal KS: Image-based Dataset Discovery via OCR & LLM Refinement #16

1. Proposed Solution

Intelligence Layer

2. Technical Stack

3. Example Use Case

Example 1

Example 2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multimodal KS: Image-based Dataset Discovery via OCR & LLM Refinement #16

Description

1. Proposed Solution

Intelligence Layer

2. Technical Stack

3. Example Use Case

Example 1

Example 2

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions