Orion/config/vmware_labbench.yaml at main · Genentech/Orion · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
# Formerly called: anthropic_filemap.yaml
# This template is heavily inspired by anthropic's computer use demo, but you can use
# it with any LM.
agent:
  run_mode: swe_gui
  max_trajectory_length: 200
  use_workflow_memory: false
  update_workflow_memory: false
  templates:
    system_template: |-
      You are a helpful assistant that can interact with a computer to solve tasks. You are good at understanding scientific contents, using softwares, browsing the internet like a biologist.
      Try to be efficient as possible and gives the final answer once you have enough information. Don't perform excessive double-checking.
      When you are finished, generate this in your response: use bash echo to print:
      The answer is A/B/C/D: [actual answer text]. **Task Completed**
    instance_template: |-
      <working_dir>
      {{working_dir}}
      </working_dir>
      Consider the following problem statement:

      <pr_description>
      {{problem_statement}}
      </pr_description>

      Whenever you view a image, you must analyze it and write down your key observations in the subsequent thought, before proceeding to the next step.

    next_step_template: |-
      OBSERVATION:
      {{observation}}
    next_step_no_output_template: |-
      Your command ran successfully and did not produce any output.
  tools:
    type: vmware
    execution_timeout: 1200
    total_execution_timeout: 72000
    env_variables:
      PAGER: cat
      MANPAGER: cat
      LESS: -R
      PIP_PROGRESS_BAR: 'off'
      TQDM_DISABLE: '1'
      GIT_PAGER: cat
      LOCAL_DEV_PATH: "/home/user/Documents/WorkingDir"
      PL_POLARS_FILE_CACHING: '0'
    bundles:
      - path: tools/registry
      - path: tools/edit_anthropic
      - path: tools/image_tools
      # - path: tools/cellprofiler
      # - path: tools/napari_viewer_tools
      # - path: tools/pdf_tools
      # - path: tools/jump
      - path: tools/x11_utils
      # - path: tools/module_search
    # Enable subagents for GUI software automation
    enable_subagents: true
    subagent_bundles:
      # - path: subagents/cellprofiler
      - path: subagents/chromium
      - path: subagents/image_viewer
    registry_variables:
      USE_FILEMAP: 'true'
      SUBMIT_REVIEW_MESSAGES:
        - |
          Thank you for your work on this issue. Please carefully follow the steps below to help review your changes.

          1. If you made any changes to your code after running the reproduction script, please run the reproduction script again.
            If the reproduction script is failing, please revisit your changes and make sure they are correct.
            If you have already removed your reproduction script, please ignore this step.
          2. Remove your reproduction script (if you haven't done so already).
          3. If you have modified any TEST files, please revert them to the state they had before you started fixing the issue.
            You can do this with `git checkout -- /path/to/test/file.py`. Use below <diff> to find the files you need to revert.
          4. Run the submit command again to confirm.

          Here is a list of all of your changes:

          <diff>
          {{diff}}
          </diff>
    enable_bash_tool: true
    parse_function:
      type: function_calling
  history_processors:
    - type: keep_important
      last_n_messages: 50
      last_n_tagged_messages: 3

  # Desktop agent configuration for GUI automation
  desktop_agent:
    platform: "ubuntu"
    model: "claude-sonnet-4-5-20250929" #"bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0" #"claude-sonnet-4-5-20250929"
    max_tokens: 1500
    top_p: 0.9
    temperature: 0.5
    action_space: "computer_13"  # "computer_13" or "pyautogui"
    observation_type: "screenshot"
    max_trajectory_length: 5
    a11y_tree_max_tokens: 10000

  reflection_template: |
    PREVIOUS ATTEMPT ANALYSIS - QuPath Learning

    Based on the given trajectory, provide a structured reflection on:

    1. Workflow Learned: Summarize what QuPath workflow or techniques were learned
    2. QUALITY ASSESSMENT: Give a detailed assessment of the quality of the learning attempt, including what was done well and what could be improved
    3. NEXT STEPS: Specific improvements and things to learn for the next attempt

    Provide a concise technical summary and reflection focusing on issues and improvements:

  summary_template: |
    Generate in one single sentence in this format:
    The answer is A/B/C/D: [actual answer text].
    This summary should concisely capture the final answer to the problem statement based on all the information gathered during the trajectory.

  subagent_summary_template: |
    Based on the subagent interactions in the trajectory, provide a structured summary in this format:
    Generate a summary that includes all the files that have previously been generated and observations from the GUI interactions.
    Try to include all the details that would be useful for a biologist to understand what was observed. Include any knowledge-intensive observations for future reference.

  workflow_template: |
    WORKFLOW EXTRACTION - QuPath using GUI

    Based on the successful trajectory, extract a structured workflow in this format:

    Step 1: Description of step 1
    Step 2: Description of step 2
    ...

  evaluation:
    type: cellprofiler
    ground_truth_dir: "/Users/machang/Documents/research-work/CellMMAgent/metrics/ground_truth"
    compare_dir: "/Users/machang/Documents/research-work/CellMMAgent/CellDev/measurements"
    threshold: 0.5

  # retry_loop:
  #   type: score
  #   max_attempts: 1
  #   accept_score: 0.8
  #   cost_limit: 20.0
  #   max_trajectory_length: 100
  #   model:
  #     name: claude-sonnet-4-20250514
  #     per_instance_cost_limit: 50.0
  #     temperature: 0.0
  #     max_output_tokens: 500
  #   reviewer_config:
  #     type: reviewer
  #     system_template: "You are a helpful assistant."
  #     instance_template: |
  #       This is a tutorial learning task, therefore the score is not important and act as a placeholder.
  #       The goal is to provide a score between 0.0 and 1.0 based on the unit test results in format <score>{your score here}</score>.

  #       Unit test results:
  #       {{test}}

  #       Generate Review:
  #     traj_formatter:
  #       filter: []
  #       output_filter: []
  #       item_template: "Model: {{response}}\n\nObservation: {{observation}}"
  #       only_show_last_n_output: 0