Memory inefficiency in RDF parsing and DuckDB ingestion pipeline

## Context

The current RDF -> COTTAS conversion pipeline may rely on fully materializing parsed RDF data (triples/quads) in memory before ingestion into DuckDB.

This leads to high memory usage when processing large RDF datasets (potentially millions or billions of triples), which limits scalability and can cause performance degradation or memory exhaustion.

This is a technical improvement to the pipeline, but it directly impacts end users working with large RDF files by improving stability, scalability, and usability.

## Use case

A user wants to convert large RDF files into the COTTAS format.

During this process:
- The RDF file is parsed
- Triples/quads are temporarily stored in memory
- Data is then loaded into DuckDB and exported

For large inputs, this workflow may exceed available memory, making the conversion unreliable or impossible on standard machines.

## Current Workaround

Users currently need to:
- Manually split RDF datasets into smaller chunks
- Process each chunk separately

This workaround is inconvenient, error-prone, and adds significant overhead when working with large-scale datasets.

## Proposed Solution

Refactor the RDF parsing and ingestion pipeline to support streaming or batched processing:

- Introduce iterator-based or streaming RDF parsing instead of fully materializing all triples/quads in memory
- Avoid collecting all parsed data into large in-memory structures (e.g., `Vec`)
- Implement batch insertion into DuckDB during parsing
- Ensure memory usage remains stable regardless of RDF input size

This would allow processing of arbitrarily large RDF datasets without requiring proportional memory increases.

## Additional Information

Areas to investigate:
- `parse_rdf_file` implementation (does it currently return a full in-memory collection?)
- `load_into_duckdb` interface (does it require full dataset materialization?)
- Potential use of streaming RDF parsers or chunked iterators

Recommended benchmarking:
- Memory usage before vs after changes
- Execution time impact
- Throughput on large RDF datasets


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory inefficiency in RDF parsing and DuckDB ingestion pipeline #35

Context

Use case

Current Workaround

Proposed Solution

Additional Information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Memory inefficiency in RDF parsing and DuckDB ingestion pipeline #35

Description

Context

Use case

Current Workaround

Proposed Solution

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions