Skip to content

Memory inefficiency in RDF parsing and DuckDB ingestion pipeline #35

Description

@savacano28

Context

The current RDF -> COTTAS conversion pipeline may rely on fully materializing parsed RDF data (triples/quads) in memory before ingestion into DuckDB.

This leads to high memory usage when processing large RDF datasets (potentially millions or billions of triples), which limits scalability and can cause performance degradation or memory exhaustion.

This is a technical improvement to the pipeline, but it directly impacts end users working with large RDF files by improving stability, scalability, and usability.

Use case

A user wants to convert large RDF files into the COTTAS format.

During this process:

  • The RDF file is parsed
  • Triples/quads are temporarily stored in memory
  • Data is then loaded into DuckDB and exported

For large inputs, this workflow may exceed available memory, making the conversion unreliable or impossible on standard machines.

Current Workaround

Users currently need to:

  • Manually split RDF datasets into smaller chunks
  • Process each chunk separately

This workaround is inconvenient, error-prone, and adds significant overhead when working with large-scale datasets.

Proposed Solution

Refactor the RDF parsing and ingestion pipeline to support streaming or batched processing:

  • Introduce iterator-based or streaming RDF parsing instead of fully materializing all triples/quads in memory
  • Avoid collecting all parsed data into large in-memory structures (e.g., Vec)
  • Implement batch insertion into DuckDB during parsing
  • Ensure memory usage remains stable regardless of RDF input size

This would allow processing of arbitrarily large RDF datasets without requiring proportional memory increases.

Additional Information

Areas to investigate:

  • parse_rdf_file implementation (does it currently return a full in-memory collection?)
  • load_into_duckdb interface (does it require full dataset materialization?)
  • Potential use of streaming RDF parsers or chunked iterators

Recommended benchmarking:

  • Memory usage before vs after changes
  • Execution time impact
  • Throughput on large RDF datasets

Metadata

Metadata

Assignees

No one assigned

    Labels

    Fields

    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions