Skip to content

nf-osi/portal-cleanup-crew

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synapse Data Curation Assistant

This project provides an interactive command-line tool to assist with data curation and management tasks in Synapse.org data portals. It uses an AI-powered agentic system (built with CrewAI) to identify and correct metadata annotation errors.

Features

  • Interactive Correction: An interactive workflow to correct Synapse data annotations based on a provided data model (e.g., a JSON-LD schema).
  • AI-Powered Suggestions: Uses a Large Language Model (LLM) to intelligently find and suggest corrections for invalid or non-standard annotation values.
  • User-in-the-Loop: Puts the human user in control. The user reviews, modifies, and approves every change before it is applied to Synapse.
  • Configurable: Easily configured to work with different Synapse views, data models, and LLMs.

Annotation Correction Workflow

The primary workflow is for correcting annotations in a Synapse File View. It follows these steps:

  1. Column Iteration: The tool iterates through each column of the specified Synapse File View.
  2. Agent Investigation: For each column, an AI agent: a. Determines the list of valid values from the linked data model. b. Finds all unique values in the Synapse column. c. Compares the two lists and generates a correction plan for any discrepancies (e.g., typos, non-standard terms).
  3. Interactive Review: The tool presents the agent's plan to the user. For each proposed change, the user can:
    • Accept the suggestion.
    • Provide a different correction.
    • Reject the suggestion. The user can also provide corrections for values the agent couldn't map, or choose to skip them.
  4. Final Approval: After the review is complete, the tool presents a summary of all the changes that will be made. The user must give a final 'yes' to proceed.
  5. Execution: Upon approval, the tool executes the plan, updating all relevant entities in Synapse in parallel.

Setup

  1. Clone the repository and install dependencies:

    git clone <repository-url>
    cd <repository-directory>
    pip install -r requirements.txt
  2. Configure Credentials:

    • Make a copy of example_creds.yaml and name it creds.yaml.
    • Edit creds.yaml and add your API key for the LLM provider. The default is configured for OpenRouter.
      # creds.yaml
      llm:
        credentials:
          OPENROUTER_API_KEY: "YOUR_API_KEY_HERE"
  3. Configure the Tool:

    • Open config.yaml.
    • Under annotation_corrector, set the main_fileview to the Synapse ID of the view you want to curate.
    • Ensure the data_model_url points to the correct JSON-LD data model for your project.
  4. Log in to Synapse:

    • The tool can use an existing Synapse configuration file (~/.synapseConfig). You can create one by running synapse login in your terminal and following the prompts.
    • Alternatively, the tool will prompt you to enter your Synapse username and password/personal access token when it starts.

Usage

Run the main script from the root of the project:

python src/main.py

The application will start, and you can select the "Correct Synapse Annotations" task from the menu to begin the workflow.

  • Python 3.10+
  • Flask
  • pandas (1.3.0 - 1.5.x)
  • numpy
  • difflib (standard library)

Limitations

  • Large files (>100MB) may be slow to process in the browser
  • Complex validation rules beyond simple value matching are not supported
  • The application does not validate relationship constraints or cross-field validations

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •