This Scala application provides a workflow for converting RDF/Turtle data to JSON-LD and Parquet format for use in a data lake and LDES server.
src/main/scala/
├── TurtleTransformer.scala # Main Scala application with all conversion logic
├── OwlToShaclGenerator.scala # Generates SHACL shapes from OWL ontology
├── ShaclValidator.scala # Validates RDF models against SHACL shapes
└── OntologySorter.scala # Sorts ontologies and creates structural/disjoint subsets
src/main/resources/
├── be/vlaanderen/omgeving/riepr/
│ └── data/
│ ├── id/
│ │ ├── jsonld/
│ │ │ ├── frame.json # JSON-LD framing configuration
│ │ │ └── context.json # JSON-LD context configuration
│ │ └── rule/
│ │ └── domain-range-subproperty.rules # Reasoning rules
│ └── ns/
│ └── riepr/
│ ├── riepr.ttl # RIEPR domain ontology
│ └── rieprAlignments.ttl # RIEPR ontology alignments
├── generated-shapes.ttl # Auto-generated SHACL shapes
├── logback.xml # Logging configuration
├── be/ # Additional ontology files
├── net/ # Network-related ontologies
└── org/ # Organization-related ontologies
src/main/input/
├── activiteit/ # Activity data
├── bedrijf/ # Company data
├── exploitant/ # Operator data
└── installatie/ # Installation data
src/main/output/
├── json/ # JSON output
├── jsonld/ # JSON-LD output
├── parquet/ # Parquet output
└── turtle/ # Inferred Turtle output
src/test/scala/
├── ShaclGenerationTest.scala # SHACL generation tests
├── ShaclGenerationAndValidationTest.scala # Combined SHACL tests
├── ShaclTestRunner.scala # SHACL test runner
├── ShaclGenerationSpec.scala # SHACL generation specifications
├── ShaclTestUtil.scala # SHACL test utilities
├── ShaclValidatorTest.scala # SHACL validator tests
└── TurtleTransformerTest.scala # Turtle transformer tests
The application follows this conversion workflow:
-
Ontology Sorting and Subsetting
- Uses
OntologySorterto create structural and disjoint subsets from complete ontology - Generates three ontology variants: complete, structural subset, and disjoint subset
- Structural subset contains: subPropertyOf, subClassOf, inverseOf, domain, and range relationships
- Disjoint subset contains: disjointWith relationships
- Uses
-
Turtle Processing with Reasoning
- Loads ontologies from
.ttlfiles insrc/main/resources/including:- Combined ontologies (SSN-SOSA, PROV-O, P-Plan, GeoSPARQL, DBO)
- RIEPR domain ontology from
src/main/resources/be/vlaanderen/omgeving/riepr/data/ns/riepr/riepr.ttl - RIEPR ontology alignments from
src/main/resources/be/vlaanderen/omgeving/riepr/data/ns/riepr/rieprAlignments.ttl
- Loads reasoning rules from
src/main/resources/be/vlaanderen/omgeving/riepr/data/id/rule/domain-range-subproperty.rules - Processes turtle files from
src/main/input/recursively - Applies reasoning using Jena's GenericRuleReasoner
- Writes inferred triples to
src/main/output/turtle/
- Loads ontologies from
-
JSON-LD Conversion
- Converts inferred RDF to JSON-LD
- Applies JSON-LD framing using
src/main/resources/be/vlaanderen/omgeving/riepr/data/id/jsonld/frame.json - Uses JSON-LD context from
src/main/resources/be/vlaanderen/omgeving/riepr/data/id/jsonld/context.json - Writes JSON-LD files to
src/main/output/jsonld/
-
JSON and Parquet Conversion
- Extracts
@grapharrays from JSON-LD files - Writes JSON arrays to
src/main/output/json/ - Converts JSON arrays to Parquet files in
src/main/output/parquet/ - Generates and saves Parquet schema as
_schema.jsonin each output directory
- Extracts
-
SHACL Validation
- Generates SHACL shapes from OWL ontology using
OwlToShaclGenerator - Validates inferred models against generated SHACL shapes using
ShaclValidator - Provides detailed validation reports with error messages
- Saves generated SHACL shapes to
src/main/resources/generated-shapes.ttl
- Generates SHACL shapes from OWL ontology using
-
OWL Validation
- Validates models using Jena's OWL Mini Reasoner
- Checks logical consistency and class hierarchies
- Provides detailed error messages for non-conformant data
# Check for java 11
java --version
# If not 11 => something like
sudo apt install openjdk-11-jdk
echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> ~/.bashrc
echo 'export PATH=$JAVA_HOME/bin:$PATH' >> ~/.bashrc
. ~/.bashrcThe application is a standalone Scala program that can be run directly:
export PATH=$JAVA_HOME/bin:$PATH
mvn compile exec:javaImportant Note: This application requires Java 11 to run due to Spark 3.5.1 compatibility. If you encounter UnsupportedOperationException: getSubject is not supported errors, ensure you're using Java 11 instead of Java 17 or 21.
The application uses Apache Spark for Parquet conversion. Ensure you have Spark properly configured in your environment.
The application uses the following key resources:
- Ontologies: Located in
src/main/resources/including:- Combined ontologies: SSN-SOSA, PROV-O, P-Plan, GeoSPARQL, DBO
- RIEPR domain ontology:
src/main/resources/be/vlaanderen/omgeving/riepr/data/ns/riepr/riepr.ttl - RIEPR ontology alignments:
src/main/resources/be/vlaanderen/omgeving/riepr/data/ns/riepr/rieprAlignments.ttl
- JSON-LD Frame:
src/main/resources/be/vlaanderen/omgeving/riepr/data/id/jsonld/frame.json - JSON-LD Context:
src/main/resources/be/vlaanderen/omgeving/riepr/data/id/jsonld/context.json - Reasoning Rules:
src/main/resources/be/vlaanderen/omgeving/riepr/data/id/rule/domain-range-subproperty.rules - Generated SHACL Shapes:
src/main/resources/generated-shapes.ttl - Logging Configuration:
src/main/resources/logback.xml - Input Data:
src/main/input/(recursively processes all.ttlfiles) - Output Data:
src/main/output/(json, jsonld, parquet, turtle directories) - Ontology Subsets: Automatically created by
OntologySorter(complete, structural, disjoint)
Key dependencies include:
- Scala 2.13 - Programming language
- Apache Jena - RDF processing and reasoning
- JSONLD-Java - JSON-LD conversion
- Jackson - JSON processing
- Apache Spark - Parquet file format support
- Apache Parquet - Parquet file format
mvn clean compileexport JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH
mvn exec:java -Dexec.mainClass="be.vlaanderen.omgeving.riepr.TurtleTransformer"Turtle Files (+ Ontologies + Rules)
↓ (Jena Reasoning)
Inferred Turtle Files
↓ (JSON-LD Conversion)
JSON-LD Files
↓ (JSON Extraction)
JSON Array Files
↓ (Spark Parquet Conversion)
Parquet Files
- Ontology Sorting: Automatically sorts and subsets ontologies for efficient processing
- Recursive File Processing: Automatically finds and processes all
.ttlfiles in the input directory and subdirectories - Reasoning: Applies Jena reasoning rules to infer additional triples
- JSON-LD Framing: Uses JSON-LD framing to create structured JSON output
- Parquet Conversion: Converts JSON data to efficient Parquet format using Spark with schema preservation
- Multiple Output Formats: Generates Turtle, JSON, JSON-LD, and Parquet outputs
- SHACL Validation: Generates SHACL shapes from OWL ontology and validates data
- OWL Validation: Validates models against OWL reasoning
- Comprehensive Testing: Includes Scala tests, Java integration tests, and Python utility scripts
- Error Handling: Graceful handling of empty or invalid inputs with detailed error messages
The application consists of four main Scala files:
The main application file containing the core conversion logic:
loadFrame(): Loads JSON-LD framing configurationloadOntology(): Loads RDF ontologieslistTurtleFiles(): Recursively finds all.ttlfilesparseTurtle(): Parses Turtle files into Jena modelsinferTriples(): Applies reasoning to infer additional triplesmodelToJsonLd(): Converts RDF models to JSON-LDframeJsonLd(): Applies JSON-LD framingextractGraph(): Extracts@grapharrays from framed JSON-LDwriteGraphToParquet(): Converts JSON to Parquet using Spark and saves schemawriteModelToTurtle(): Writes inferred models to Turtle formatwriteJson(): Writes JSON output filesvalidateModel(): Validates models using OWL reasoningprocessModel(): Complete processing pipeline for individual modelsmain(): Main entry point with complete workflow
Generates SHACL shapes from OWL ontology:
generate(): Main method that creates SHACL model from OWL ontologygenerateNodeShape(): Creates SHACL NodeShape for each OWL classgeneratePropertyShape(): Creates SHACL property shapes from OWL restrictionscreatePath(): Handles property paths including inverse propertiescreateOrList(): Creates SHACL OR constraints from OWL union classes
Validates RDF models against SHACL shapes:
loadShapes(): Loads SHACL shapes from filevalidate(): Validates model against SHACL shapesprintReport(): Prints validation results in readable format
Sorts and subsets ontologies for efficient processing:
completeOntology: Loads and combines all ontology filesstructuralSubset: Extracts structural relationships (subPropertyOf, subClassOf, inverseOf, domain, range)disjointSubset: Extracts disjointness relationships (disjointWith)extractStructuralSubset(): Private method for structural subset extractionextractDisjointSubset(): Private method for disjoint subset extraction
The application performs two types of validation:
Uses Jena's OWL reasoner to validate the inferred model against the reasoning ontology. This validation checks for logical consistency and class hierarchies.
The application automatically generates SHACL shapes from the OWL ontology using the OwlToShaclGenerator and then validates the inferred model against these shapes using the ShaclValidator. This process includes:
-
Shape Generation: The
OwlToShaclGenerator.generate()method creates SHACL shapes from OWL restrictions:- Converts OWL classes to SHACL NodeShapes
- Translates OWL property restrictions to SHACL property shapes
- Handles inverse properties using SHACL inversePath
- Creates OR constraints from OWL union classes
-
Validation: The
ShaclValidator.validate()method validates the inferred model against the generated SHACL shapes and produces a detailed validation report. -
Reporting: The
ShaclValidator.printReport()method outputs validation results in a readable format, showing conformance status and detailed error messages for non-conformant data.
Both validation processes provide detailed error messages when validation fails, including:
- Focus nodes (the specific RDF nodes that failed validation)
- Property paths where validation failed
- Descriptive error messages
- The application preserves the directory structure of input files in output directories
- Each input Turtle file generates corresponding output files in all formats
- The workflow ensures that the final Parquet files contain the complete inferred data
- Reasoning includes domain/range inference, subproperty inference, and other rules defined in the rules file
- The application handles empty or invalid inputs gracefully by returning
None/Optiontypes - Validation results are logged with detailed error messages for debugging
- Parquet files include schema information saved as
_schema.jsonin each output directory - Ontology sorting improves performance by creating focused subsets for different processing needs
- The application supports both individual file processing and consolidated processing of all input data