This is an experiment made to show that noWorkflow can work in collaborative environments, preserving its functions while helping scientists working on the same experiment.
We decided to simulate an experiment where two scientists, Bob and Alice, work on a modified code that checks if precipitation in Rio de Janeiro remains constant across years. The change is for the experiment to run successfully on the last two trials.
The steps we took in this simulation are listed below:
- First, we set up the server and create the experiment on it
- Then Bob runs the first trial on his local computer and pushes it to the server
- With the experiment and its first trial on the server, Alice pulls it, modifies the code, runs it on her local computer, and finally pushes it to the server
- Seeing that there is a new trial on the server, Bob pulls it, but he restores the experiment to the first trial. Then he modifies the code, runs it with a parameter, and pushes it to the server.
- This time, Alice is the one who notices the new trial and pulls the experiment on her local machine. Afterward, she runs the experiment with a parameter different than the one Bob used, and she pushes the trial to the server, ending our simulation.
Despite the fact that our experiment was successful when simulating a collaborative environment, we need a way to judge if the provenance collected meets the researchers' needs. So, we decided to use the First Provenance Challenge and its questions to see if noWorkflow's collected provenance is useful even if we are in a collaborative environment.
However, there are two problems with using the First Provenance Challenge and its questions directly. The first one is that the challenge is about a specific workflow that is different from ours. The second one is that the questions are about that workflow. The key word being 'workflow', because it's not a script like our experiment. Therefore, we decided to adapt the questions in the First Provenance Challenge to include scripts in general.
Before adapting the First Provenance Challenge's questions to scripts, we propose some notations to differentiate between the aspects of a script. They are:
- T is the set of all trials in a experiment
- s is a script
- F is the set of all functions in a script(s)
- f is one funtion of F (f∈F)
- PF is the set of all parameters in all functions(F) in a script(s)
- pf is one parameter of PF (pf∈PF)
- P is the set of all parameters a script(s) receives when it's called
- p is one parameter of P(p∈P)
- OS is the set of all outputs of a script(s)
- os one output of OS
- OSF is the set of all outputs(return) of a function of a script
- osf is one output(return) of OSF
- d is a date
- a is an annotation
- ac is an annotation(a) content
| Query number | First Provenance Challenge Query | Query change to general script |
|---|---|---|
| 1 | Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed etc. | Given an output os(os∈OS) of a script s, get only the functions F'(F'∈F) that generated or changed os. |
| 2 | Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean. | Given an output os(os∈OS) of a script s, get the functions F'(F'∈F) that generated or changed os excluing everything prior to a function f(f∈F and f∈F'). |
| 3 | Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic. | --------- |
| 4 | Find all invocations of procedure align_warp using a twelfth order nonlinear 1365 parameter model (see model menu describing possible values of parameter "-m 12" of align_warp) that ran on a Monday. | Given a function f(f∈F) of a script s, get all calls of f that had pf(pf∈PF) as a parameter and were executed on the date d. |
| 5 | Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers had an entry global maximum=4095. The contents of a header file can be extracted as text using the scanheader AIR utility. | Get all the outputs OS of all trials T that had the parameter p(p∈P) as an input. |
| 6 | Find all output averaged images of softmean (average) procedures, where the warped images taken as input were align_warped using a twelfth order nonlinear 1365 parameter model, i.e. "where softmean was preceded in the workflow, directly or indirectly, by an align_warp procedure with argument -m 12." | Find all outputs(return) OSF of f(f∈F) if f is preceded by f'(f'∈F) and f' received the parameter pf(pf∈PF) |
| 7 | A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant. | Given two scripts contemplating the same experiment, s and s', find the difference between s and s' executions and functions. |
| 8 | A user has annotated some anatomy images with a key-value pair center=UChicago. Find the outputs of align_warp where the inputs are annotated with center=UChicago. | Find the outputs(return) OSF of f(f∈F) where f received a parameter pf(pf∈PF) and pf was annotated with a specific annotation a. |
| 9 | A user has annotated some atlas graphics with key-value pair where the key is studyModality. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files. | Find all the outputs OS of all trials T where the outputs were annotated. These annotations must have an specific content ac. Return all the other annotations of all OS that fufill the requirement. |
Notice that we didn't adapt the Query number 3 to contemplate scripts. We didn't because Query number 3 specifies Stages exclusive to the First Provenance Challenge's workflow. We could arbitrarily separate the script used in our experiment into stages, but that would make the new query not include scripts in general but just ours.
Now that we have the First Provenance Challenge Queries adapted to script, we can use noWorkflow's collected provenance and features to answer them.
1 - Given an output os(os∈OS) of a script s, get only the functions F'(F'∈F) that generated or changed os.
dataflow + wdf
2 - Given an output os(os∈OS) of a script s, get the functions F'(F'∈F) that generated or changed os excluing everything prior to a function f(f∈F and f∈F').
dataflow + wdf + click
4 - Given a function f(f∈F) of a script s, get all calls of f that had pf(pf∈PF) as a parameter and were executed on the date d.
To answer this query, we first must select a trial. In our case, we selected the trial with the id "1700f88d-4c0b-4ee3-afaa-045606f7823f". Then, we must select the date d. We chose 07/18/2024 at 9 a.m. We also must select a parameter and a function; we went with the parameter "p14.dat" and the function "read".
So, we want to find all calls of "read" with the parameter "p14.dat" that were executed in "07/18/2024 at 9 a.m." by the trial "1700f88d-4c0b-4ee3-afaa-045606f7823f". For that, we can use a SQL query. First, we must query the table evaluation to get "p14.dat" as an evaluation, then we must check which evaluations depend on it. After getting its code_component id, we get the code_component name (how the function is called in the code) and finally query all its calls with a join with the table trial so we can specify the date.
SELECT *
FROM code_component as c
LEFT JOIN trial as t on c.trial_id == t.id
WHERE t.start LIKE "2024-07-18 09%" AND c.name == (
SELECT c.name
FROM code_component as c
WHERE c.id IN
(SELECT e1.code_component_id
FROM evaluation as e1
WHERE e1.id IN
(SELECT d.dependent_id
FROM dependency as d
WHERE d.dependency_id IN
(SELECT e.id
FROM evaluation as e
WHERE e.repr == "'p14.dat'" AND e.trial_id == "1700f88d-4c0b-4ee3-afaa-045606f7823f") AND d.trial_id == "1700f88d-4c0b-4ee3-afaa-045606f7823f" AND d.type == "argument")
AND e1.trial_id == "1700f88d-4c0b-4ee3-afaa-045606f7823f") AND trial_id == "1700f88d-4c0b-4ee3-afaa-045606f7823f" AND name LIKE "%read%")
Before answering this query, we must decide the value of the parameter first. In our experiment, only two trials had parameters passed to them when their scripts were called. These parameters are "2" and "1". Arbitrarily, we chose the parameter "2".
After deciding the parameters, a SQL query in three tables can answer our Query number 5. We need to query the table file_access to get the outputs, because it's the table that registers the access to files, including writing, meaning the outputs from a trial. We also need to query the table activation. This table registers function calls. We must make sure that the file accesses we are querying are the right ones. The table argument must be checked too. After all, this is the table responsible for checking if the trial received a parameter when it was executed and which parameter it is.
SELECT *
FROM file_access as f, activation as ac, argument as ar
WHERE ac.id == f.activation_id AND ar.trial_id == f.trial_id AND f.trial_id == ac.trial_id AND (f.mode == "w" OR f.mode == "w+b") AND f.name NOT LIKE "nul"
AND ar.name == "argv" AND ar.value LIKE "%'2'%"
6 - Find all outputs(return) OSF of f(f∈F) if f is preceded by f'(f'∈F) and f' received the parameter pf(pf∈PF)
7 - Given two scripts contemplating the same experiment, s and s', find the difference between s and s' executions and functions.
shift+click trial node
8 - Find the outputs(return) OSF of f(f∈F) where f received a parameter pf(pf∈PF) and pf was annotated with a specific annotation a.
This is a query that presently noWorkflow can't answer. It doesn't have the feature to put annotations in files, nor does it support it. Meaning it can't know if a parameter of a function was annontated (or not), nor can it annotate it.
9 - Find all the outputs OS of all trials T where the outputs were annotated. These annotations must have an specific content ac. Return all the other annotations of all OS that fufill the requirement.
This is a query that presently noWorkflow can't answer. It doesn't have the feature to put annotations in files, nor does it support it. Meaning it can't know if a parameter of a function was annontated (or not), can't annotate it, nor can read the annotation's content.