Skip to content

Multimodal KS: Image-based Dataset Discovery via OCR & LLM Refinement #16

@HARSHDIPSAHA

Description

@HARSHDIPSAHA

Currently, the KnowledgeSpace AI agent primarily relies on text-based queries. However, much valuable neuroscience metadata is "trapped" within non-editable formats like research paper figures, tables, and presentation screenshots. I propose a multimodal feature that allows users to upload images and automatically extract refined search queries to discover datasets in the KnowledgeSpace ecosystem.


1. Proposed Solution

I have developed a prototype for an end-to-end pipeline that includes:

  • Frontend

    • React-based upload interface with a dynamic, auto-expanding search area to handle long scientific queries.
  • Backend

    • FastAPI endpoint that utilizes Pytesseract for raw text extraction.

Intelligence Layer

  • Intelligence Layer:

    • A refinement step using Gemini 2.0 Flash-Lite to perform zero-shot Named Entity Recognition (NER).
  • Refinement Logic:

    • The prompt is optimized to move away from conversational explanations (e.g., “Here are your options…”) and instead produce a clean, comma-separated list of extracted entities.
  • Impact:

    • Prevents query pollution by ensuring the backend search engine receives only high-signal scientific terms.
    • Significantly improves the relevance and precision of discovered datasets.

2. Technical Stack

  • Python 3.12 & FastAPI
  • React (TypeScript)
  • Pytesseract & Pillow
  • Google GenAI SDK (Gemini 2.0 Flash-Lite)

3. Example Use Case

Example 1

  • Input Image (from a research paper table):
Image
  • Refined OCR Output:
    • Rattus norvegicus
    • Striatum
    • Two-photon
    • Sprague-Dawley
    • Motor Cortex
    • Patch-clamp
Image
  • Assistant Discovery Results:
    • IonChannelGenealogy: Kir2 channel model (Rattus norvegicus, Striatum)
    • EBRAINS: MiniVess 3D vasculature (Rattus norvegicus, Two-photon)
    • NeuroMorpho.Org: Neuron morphology (Sprague-Dawley, frontal neocortex)

Example 2

  • Input Image( from actual research paper) :
Image
  • Refined OCR Output:
    • Sst-Cre
    • Ai32
    • Vip-Cre
    • CA1
Image
  • Assistant Discovery Results:
    • IonChannelGenealogy: Relevant ion channel and electrophysiology datasets (Sst-Cre, CA1)
    • EBRAINS: Modeling and experimental datasets (Vip-Cre, Ai32)

The refined entities are injected directly into the existing keyword + vector retrieval pipeline.

I am planning to apply for GSoC 2026 with INCF and would like to lead the implementation of this feature. I have already set up a local development environment and verified the core logic. I would appreciate any feedback from the mentors and am ready to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions