Add scripts to support metadata mapping via LLM #120

Bankso · 2025-09-09T22:43:01Z

This PR adds three scripts to handle inputs and outputs associated with LLM-based agents that apply graph-based RAG via serialized RDF triples

Added script build_template_ttl.py
- Converts a metadata template CSV info to a ttl file defining the template. ttl file can be used as input for the arachne agent and would be available as a target. Only requirements for the CSV is that the column names are listed in the first row.
Added script csv_to_ttl.py
- Converts a schematic data model CSV to RDF triples and serializes them to a ttl file. ttl file can be used as a graph input for the arachne agent.
Added script process_arachne_mapping.py
- Takes a JSON formatted mapping, generated by the arachne agent, as input. Requires access to the metadata manifest processed by the arachne agent.
- Returns a TSV file of metadata, built based on the mapping JSON.
updated requirements.txt to reflect current environment

Current expectation is that this information will be pulled from Synapse tables, using an input CSV that denotes "Component", "Table Synapse Id"

Example input metadata reference sheet for map_to_crdc.py

Add else: continue to column populating loop, to decrease indent count Replace valid_values with "_" in for loop variable assignment

Addresses an issue where creating/adding files to a Dataset is "Forbidden", which appears to show up when large numbers of files are being added to a single dataset. The changes in this commit will identify the number of groups required to fit within a pre-defined file max, create additional dataset(s) required for the file groups, then add one file group to each dataset.

Don't track utils/example_files

Additional updates to add large numbers of files to datasets. The script will now create an appropriate number of groups based on the number of files, taking into account the number of datasets that will be used to store the files. Newly created datasets are added as new rows to the DSP and include the values from the source row.

New features are being logged in a separate branch

Initial version, created by ChatGPT, based off of example JSON mapping file.

Initial version, generates valid ttl input for arachne agent, based on this input CSV: https://docs.google.com/spreadsheets/d/1LLpSIFAh12YdKnGfzXMxGpoKCaEH90nDx-QvncaIJlk/edit?gid=264257960#gid=264257960 The sheet linked above contains a version of the MC2 Center data model, which has been expanded and packaged for conversion to triples. Currently planning an update to the script that allows the use of a standard schematic-compatible data model CSV as the primary input.

Add try/except block to create output dir if not defined in arachne mapping output

Add a function that generates serialized RDF triples, representing a metadata template, from a metadata template CSV.

Added a couple functions to support extraction of RDF content directly from a schematic data model CSV. Functions use the upcoming version of the MC2 Center data model (v12.0.0) which includes the new columnType column and CDE:Public Id references in the Property column, where applicable.

- update typing and docstrings - allow base tag to be defined at input; use base tag to construct other RDF tags that are not external references

No longer being developed

- add function to convert TSV from CRDC Model Navigator into serialized RDF - add function to encode CRDC Type to valid RDF types - add code to document valid/permissible values in TTL triples - add input routing to direct CRDC models and schematic models to specific processing functions

Retain CDE full name in description for CRDC model attributes

For GC manifest TSVs

Model can be rendered as a multidirectional graph in an interactive window (by passing arg -ig) OR it can be displayed as a table diagram and saved as a PNG (by passing arg -bg)

Can now build visualizations of graphs based on selected components of a schematic model, by passing one or more component names to arg "-s" at runtime

Fixes syntax errors in dot strings, allows rendering of graph

node_name is used to label output ttl

This is now a common function, since it relies on the subset argument, not the processed input.

-bg flag toggles if the template is visualized and saved as a PNG

In a schematic-based model, if 'primary_key' is listed in 'Properties', 'is_key' will be 'true' If the primary key of another model is listed (form component_id, e.g., 'Study_id') in Properties, a triple will be created to document the linkage. This is used by graphing software to render diagrams.

Bankso added 30 commits May 8, 2025 16:30

Create map_to_crdc.py

07ecd83

Generalize functions and simplify inputs

0b61f16

Current expectation is that this information will be pulled from Synapse tables, using an input CSV that denotes "Component", "Table Synapse Id"

Update .gitignore

fd29230

Add script description

4aa1226

Create example_input_map_to_crdc.csv

7d6ede0

Example input metadata reference sheet for map_to_crdc.py

Update map_to_crdc.py

de410d5

Add else: continue to column populating loop, to decrease indent count Replace valid_values with "_" in for loop variable assignment

Update example_input_map_to_crdc.csv

1f9a2bf

Update map_to_crdc.py

1dc4136

Refactor code to pull info from datasets

161ac28

Update example_input_map_to_crdc.csv

3cca9a5

Update table_to_annotations.py

7b37f9d

Update build_datasets.py

4979407

Update .gitignore

5343535

Don't track utils/example_files

Moved crdc example input to example_files folder

589dde0

Update map_to_crdc.py

b0fbf78

Revert changes to build_datasets.py

7419461

New features are being logged in a separate branch

Create process_arachne_mapping.py

998de6e

Initial version, created by ChatGPT, based off of example JSON mapping file.

Update process_arachne_mapping.py

bcdf917

Improve handling when value is None

031502e

Update process_arachne_mapping.py

a8e137b

Add try/except block to create output dir if not defined in arachne mapping output

Create build_template_ttl.py

77f8dee

Add a function that generates serialized RDF triples, representing a metadata template, from a metadata template CSV.

Update csv_to_ttl.py

d0762ae

- update typing and docstrings - allow base tag to be defined at input; use base tag to construct other RDF tags that are not external references

Remove map_to_crdc.py from branch

1ce7c68

No longer being developed

Update requirements.txt

7ae79ea

Update csv_to_ttl.py

2802068

Retain CDE full name in description for CRDC model attributes

Bankso added 21 commits September 10, 2025 16:28

Give 1.0.0 as default

16a6f5d

Move file once all RDF has been added

c8eb704

Update naming conventions

5c57047

Update naming for output directory

628d390

Parse name and version from filename

8e09dc6

For GC manifest TSVs

Update description

d100205

Update description

b8b1372

Add missing " ."

8c8301d

Fix Valid Value parsing to build list

4bd93d2

Adjust quoting to address read errors

8657a5e

Fix quoting and don't write CDE if TBD

d157e7b

Remove "default_literal" reference

47b79f0

Add functions to render model

eda3a4c

Model can be rendered as a multidirectional graph in an interactive window (by passing arg -ig) OR it can be displayed as a table diagram and saved as a PNG (by passing arg -bg)

Add code to extract schematic model components

352b96e

Can now build visualizations of graphs based on selected components of a schematic model, by passing one or more component names to arg "-s" at runtime

Refactor to use prefixes in RDF

0df5b93

Continue RDF prefix refactor

5fea4f4

Adjust node formatting

059ec73

Fixes syntax errors in dot strings, allows rendering of graph

Replace CRDC tags with CURIEs, adjust attribute filtering

8ed4f09

Update subset function to handle conditional attributes

791bb69

Adjust subset output functions and docstrings

424271d

Adjust output naming

2a76a40

node_name is used to label output ttl

Bankso force-pushed the add-crdc-mapping-script branch from 4c43fab to 2a76a40 Compare October 3, 2025 20:12

Bankso added 8 commits October 3, 2025 13:31

Adjust input argument references

b2aab9f

Move node_name function to main, adjust ins and outs

c4bf3b9

This is now a common function, since it relies on the subset argument, not the processed input.

Pass base_ref to ttl header

fe13a79

Refactor to use prefixes

ce07275

Add visualization function

6aab19d

Update docstrings, add bg flag

9785d33

-bg flag toggles if the template is visualized and saved as a PNG

Use pydotplus; add graph engine input

0d301d9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add scripts to support metadata mapping via LLM #120

Add scripts to support metadata mapping via LLM #120

Uh oh!

Bankso commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Add scripts to support metadata mapping via LLM #120

Are you sure you want to change the base?

Add scripts to support metadata mapping via LLM #120

Uh oh!

Conversation

Bankso commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants