Skip to content

Conversation

@Bankso
Copy link
Collaborator

@Bankso Bankso commented Sep 9, 2025

This PR adds three scripts to handle inputs and outputs associated with LLM-based agents that apply graph-based RAG via serialized RDF triples

  • Added script build_template_ttl.py

    • Converts a metadata template CSV info to a ttl file defining the template. ttl file can be used as input for the arachne agent and would be available as a target. Only requirements for the CSV is that the column names are listed in the first row.
  • Added script csv_to_ttl.py

    • Converts a schematic data model CSV to RDF triples and serializes them to a ttl file. ttl file can be used as a graph input for the arachne agent.
  • Added script process_arachne_mapping.py

    • Takes a JSON formatted mapping, generated by the arachne agent, as input. Requires access to the metadata manifest processed by the arachne agent.
    • Returns a TSV file of metadata, built based on the mapping JSON.
  • updated requirements.txt to reflect current environment

Bankso added 30 commits May 8, 2025 16:30
Current expectation is that this information will be pulled from Synapse tables, using an input CSV that denotes "Component", "Table Synapse Id"
Example input metadata reference sheet for map_to_crdc.py
Add else: continue to column populating loop, to decrease indent count
Replace valid_values with "_" in for loop variable assignment
Addresses an issue where creating/adding files to a Dataset is "Forbidden", which appears to show up when large numbers of files are being added to a single dataset. The changes in this commit will identify the number of groups required to fit within a pre-defined file max, create additional dataset(s) required for the file groups, then add one file group to each dataset.
Don't track utils/example_files
Additional updates to add large numbers of files to datasets. The script will now create an appropriate number of groups based on the number of files, taking into account the number of datasets that will be used to store the files. Newly created datasets are added as new rows to the DSP and include the values from the source row.
New features are being logged in a separate branch
Initial version, created by ChatGPT, based off of example JSON mapping file.
Initial version, generates valid ttl input for arachne agent, based on this input CSV: https://docs.google.com/spreadsheets/d/1LLpSIFAh12YdKnGfzXMxGpoKCaEH90nDx-QvncaIJlk/edit?gid=264257960#gid=264257960

The sheet linked above contains a version of the MC2 Center data model, which has been expanded and packaged for conversion to triples.

Currently planning an update to the script that allows the use of a standard schematic-compatible data model CSV as the primary input.
Add try/except block to create output dir if not defined in arachne mapping output
Add a function that generates serialized RDF triples, representing a metadata template, from a metadata template CSV.
Added a couple functions to support extraction of RDF content directly from a schematic data model CSV. Functions use the upcoming version of the MC2 Center data model (v12.0.0) which includes the new columnType column and CDE:Public Id references in the Property column, where applicable.
- update typing and docstrings
- allow base tag to be defined at input; use base tag to construct other RDF tags that are not external references
No longer being developed
- add function to convert TSV from CRDC Model Navigator into serialized RDF
- add function to encode CRDC Type to valid RDF types
- add code to document valid/permissible values in TTL triples
- add input routing to direct CRDC models and schematic models to specific processing functions
Retain CDE full name in description for CRDC model attributes
@Bankso Bankso force-pushed the add-crdc-mapping-script branch from 4c43fab to 2a76a40 Compare October 3, 2025 20:12
Bankso added 8 commits October 3, 2025 13:31
This is now a common function, since it relies on the subset argument, not the processed input.
-bg flag toggles if the template is visualized and saved as a PNG
In a schematic-based model, if 'primary_key' is listed in 'Properties', 'is_key' will be 'true'
If the primary key of another model is listed (form component_id, e.g., 'Study_id') in Properties, a triple will be created to document the linkage. This is used by graphing software to render diagrams.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants