Context
The current RDF -> COTTAS conversion pipeline may rely on fully materializing parsed RDF data (triples/quads) in memory before ingestion into DuckDB.
This leads to high memory usage when processing large RDF datasets (potentially millions or billions of triples), which limits scalability and can cause performance degradation or memory exhaustion.
This is a technical improvement to the pipeline, but it directly impacts end users working with large RDF files by improving stability, scalability, and usability.
Use case
A user wants to convert large RDF files into the COTTAS format.
During this process:
- The RDF file is parsed
- Triples/quads are temporarily stored in memory
- Data is then loaded into DuckDB and exported
For large inputs, this workflow may exceed available memory, making the conversion unreliable or impossible on standard machines.
Current Workaround
Users currently need to:
- Manually split RDF datasets into smaller chunks
- Process each chunk separately
This workaround is inconvenient, error-prone, and adds significant overhead when working with large-scale datasets.
Proposed Solution
Refactor the RDF parsing and ingestion pipeline to support streaming or batched processing:
- Introduce iterator-based or streaming RDF parsing instead of fully materializing all triples/quads in memory
- Avoid collecting all parsed data into large in-memory structures (e.g.,
Vec)
- Implement batch insertion into DuckDB during parsing
- Ensure memory usage remains stable regardless of RDF input size
This would allow processing of arbitrarily large RDF datasets without requiring proportional memory increases.
Additional Information
Areas to investigate:
parse_rdf_file implementation (does it currently return a full in-memory collection?)
load_into_duckdb interface (does it require full dataset materialization?)
- Potential use of streaming RDF parsers or chunked iterators
Recommended benchmarking:
- Memory usage before vs after changes
- Execution time impact
- Throughput on large RDF datasets
Context
The current RDF -> COTTAS conversion pipeline may rely on fully materializing parsed RDF data (triples/quads) in memory before ingestion into DuckDB.
This leads to high memory usage when processing large RDF datasets (potentially millions or billions of triples), which limits scalability and can cause performance degradation or memory exhaustion.
This is a technical improvement to the pipeline, but it directly impacts end users working with large RDF files by improving stability, scalability, and usability.
Use case
A user wants to convert large RDF files into the COTTAS format.
During this process:
For large inputs, this workflow may exceed available memory, making the conversion unreliable or impossible on standard machines.
Current Workaround
Users currently need to:
This workaround is inconvenient, error-prone, and adds significant overhead when working with large-scale datasets.
Proposed Solution
Refactor the RDF parsing and ingestion pipeline to support streaming or batched processing:
Vec)This would allow processing of arbitrarily large RDF datasets without requiring proportional memory increases.
Additional Information
Areas to investigate:
parse_rdf_fileimplementation (does it currently return a full in-memory collection?)load_into_duckdbinterface (does it require full dataset materialization?)Recommended benchmarking: