Document NGFF Fileset replacement workflow

# NGFF generation

Generation takes place on `pilot-zarr1-dev` or `pilot-zarr2-dev` machines.

We need to generate NGFF data with https://github.com/IDR/bioformats2raw/releases/tag/v0.6.0-24 which has ZarrReader fixes, including those required for `.pattern` file data.

Install bioformats2raw via conda:

```
conda create -n bioformats2raw python=3.9
conda activate bioformats2raw
conda install -c ome bioformats2raw
```

This is actually just for getting the dependencies installed. Get the actual bioformats2raw from the link above and just unzip it into your home directory.

We need to generate NGFF Filesets under `/data` volume.
Create a directory for the idr project and memo files (if it’s not already there), and change into the idr directory. For example for idr0051:

```
cd /data
sudo mkdir idr0051
sudo chown yourname idr0051
sudo mkdir memo
sudo chown yourname memo
cd idr0051
```

Find out where the pattern, screen or companion files are. For example: `/nfs/bioimage/drop/idr0051-fulton-tailbudlightsheet/patterns/`

Then run the conversion (using the bioformat2raw from above) in a `screen` (long running):

NB: it may be useful to convert a single Fileset to zarr initially to determine the size of this on disk and to tell whether you have enough space to convert all the others at once.
If not, might have to do a smaller number, zip and upload to BioStudies before deleting to make space available. 

NB: please make sure that the `--memo-directory` specified here is writable by you.

```
screen -S idr0051ngff

for i in `ls /nfs/bioimage/drop/idr0051-fulton-tailbudlightsheet/patterns/`; do echo $i; ~/bioformats2raw-0.6.0-24/bin/bioformats2raw --memo-directory ../memo /nfs/bioimage/drop/idr0051-fulton-tailbudlightsheet/patterns/$i ${i%.*}.ome.zarr; done
```

(`$i` is the pattern file, `${i%.*}.ome.zarr` strips the .pattern file extension and adds `.ome.zarr`; this should work for pattern, screen and also companion file extensions)


# Upload to EBI s3 for testing

Upload 1 or 2 Plates or Images to EBI's s3, so we can validate that the data can be viewed and imported on s3.

Create a bucket from local `aws` install:
Once installed `aws` just do `aws configure` and enter Access key and Secret key - use defaults for other options.

```
$ aws --endpoint-url https://uk1s3.embassy.ebi.ac.uk s3 mb s3://idr0010
make_bucket: idr0010
```

And update policy and CORS config as at https://github.com/IDR/deployment/blob/master/docs/object-store.md#policy (NB: replace idr0000 with e.g. idr0010 in the sample config etc)

Upload the data using `mc`, installed on `dev` servers where data is generated:
```
$ ssh pilot-zarr1-dev
$ wget https://dl.min.io/client/mc/release/linux-amd64/mc
$ ./mc config host add uk1s3 https://uk1s3.embassy.ebi.ac.uk
Enter Access Key: X8GE11ZK************
Enter Secret Key: 
Added `uk1s3` successfully.

$ /home/wmoore/mc cp -r idr0010/ uk1s3/idr0010/zarr
```

You should now be able to view and do some validation of the data with `ome-ngff-validator` and `vizarr`.
E.g.
https://ome.github.io/ome-ngff-validator/?source=https://uk1s3.embassy.ebi.ac.uk/idr0025/zarr/10x+images+plate+3.ome.zarr

https://hms-dbmi.github.io/vizarr/?source=https://uk1s3.embassy.ebi.ac.uk/idr0025/zarr/10x+images+plate+3.ome.zarr


# Submission to BioStudies

Once the NGFF data has been validated to your satisfaction, we can upload to BioStudies.

We need to create a `.zip` file for each `.ome.zarr` Fileset.
It can be useful where space is short to use `-m` to move files into the zip and delete the original.

For a single zarr, this looks like `$ zip -mr image.ome.zarr.zip image.ome.zarr`.

Convert all the zarr Filesets for a study:
E.g:

```
screen -S idr0010_zip
cd idr0010
for i in */; do zip -mr "${i%/}.zip" "$i"; done
```
This will create zips in the same dir as the zarrs, but we want a directory that contains *just* the zips for upload...
```
mkdir idr0010
mv *.zip idr0010/
```

Upload via Aspera, using the "secret directory". 
Login to BioStudies with the IDR account. 
Click on the `FTP/Aspera` button at https://www.ebi.ac.uk/biostudies/submissions/files

```
# install...
$ wget https://ak-delivery04-mul.dhe.ibm.com/sar/CMA/OSA/08q6g/0/ibm-aspera-cli-3.9.6.1467.159c5b1-linux-64-release.sh
$ chmod +x ibm-aspera-cli-3.9.6.1467.159c5b1-linux-64-release.sh 
$ bash ibm-aspera-cli-3.9.6.1467.159c5b1-linux-64-release.sh 
$ cd .aspera/cli/bin

$ ./ascp -P33001 -i ../etc/asperaweb_id_dsa.openssh -d /path/to/idr00xx bsaspera_w@hx-fasp-1.ebi.ac.uk:xx/xxxxxxxxxxxxxxxxxxxxxxx
```

Some JavaScript you can run in browser console to get the file names in the submission table:
```
let names = [];
[].forEach.call(document.querySelectorAll("div [role='row'] .ag-cell[col-id='name']"), function(div) {
  names.push(div.innerHTML.trim());
});
console.log(names.join("\n"));
console.log(names.length);
```

Create a `tsv` file that lists all the filesets for the submission with the first column named `Files`. See https://www.ebi.ac.uk/bioimage-archive/help-file-list/

E.g. `idr0054_files.tsv`:
```
Files
idr0054/Tonsil 1.ome.zarr.zip
idr0054/Tonsil 2.ome.zarr.zip
idr0054/Tonsil 3.ome.zarr.zip
```

Upload this to the same location as above (via FTP or using the web UI).
This is used to specify which files to be used in the submission.
You should be able to see all the uploaded files at https://www.ebi.ac.uk/biostudies/submissions/files

Create a new submission at https://www.ebi.ac.uk/biostudies/submissions/

 - TBD: Name submission `idr00xx NGFF...`
 - Check for existing submission for this study (with raw data). Existing IDR studies can be found with https://www.ebi.ac.uk/biostudies/BioImages/studies?facet.link_type=image+data+resource. 
   - These are [idr0079](https://www.ebi.ac.uk/biostudies/BioImages/studies/S-BIAD599), [idr0043](https://www.ebi.ac.uk/biostudies/BioImages/studies/S-BIAD540), [idr0042](https://www.ebi.ac.uk/biostudies/BioImages/studies/S-BIAD590), [idr0112](https://www.ebi.ac.uk/biostudies/BioImages/studies/S-BIAD511), [idr0117](https://www.ebi.ac.uk/biostudies/BioImages/studies/S-BIAD510), [idr0082](https://www.ebi.ac.uk/biostudies/BioImages/studies/S-BIAD509), [idr0100](https://www.ebi.ac.uk/biostudies/BioImages/studies/S-BIAD493), [idr0085](https://www.ebi.ac.uk/biostudies/BioImages/studies/S-BIAD500), [idr0094](https://www.ebi.ac.uk/biostudies/BioImages/studies/S-BIAD29)
 - Add links from this submission:
   - To the IDR itself - using `link_type: image data resource`
   - To the existing BIA submission if it exists (link_type?)
 - The `idr00XX_files.tsv` file list created above can be added to the submission under the `Study Component` section, which is at the bottom of the submission form.

Once submitted, we need to ask EBI to process the submission, unzip each zarr and upload data to `s3`
BioStudies will assign a uuid to each.
They will provide a mapping from each zip file to uuid.zarr as csv:

Spreadsheet for keeping track of the submissions status:
https://docs.google.com/spreadsheets/d/1P3dn-uL9KzE9O7XAKhpL8fUMTG3LWedMgjzSdnfAjQ4/edit#gid=0

```
Tonsil 2.ome.zarr.zip, https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD704/36cb5355-5134-4bdc-bde6-4e693055a8f9/36cb5355-5134-4bdc-bde6-4e693055a8f9.zarr/0
Tonsil 1.ome.zarr.zip, https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD704/5583fe0a-bbe6-4408-ab96-756e8e96af55/5583fe0a-bbe6-4408-ab96-756e8e96af55.zarr/0
Tonsil 3.ome.zarr.zip, https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD704/3b4a8721-1a28-4bc4-8443-9b6e145efbe9/3b4a8721-1a28-4bc4-8443-9b6e145efbe9.zarr/0
```

This needs to be used to create the necessary symlinks below.

If not already done, mount the `bia-integrator-data` bucket on the server machine and check to see if files are available:

```
$ sudo mkdir /bia-integrator-data && sudo /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other bia-integrator-data /bia-integrator-data

$ ls /bia-integrator-data/S-BIAD704
36cb5355-5134-4bdc-bde6-4e693055a8f9  3b4a8721-1a28-4bc4-8443-9b6e145efbe9  5583fe0a-bbe6-4408-ab96-756e8e96af55
```

# Make NGFF Filesets

Work *In progress* 
Use https://github.com/joshmoore/omero-mkngff to create filesets based on the mounted s3 NGFF Filesets.

See https://github.com/IDR/idr-utils/pull/56 as a script for generating inputs required for `omero-mkngff`.

```
conda create -n mkngff -c conda-forge -c ome omero-py bioformats2raw
conda activate mkngff
pip install 'omero-mkngff @ git+https://github.com/joshmoore/omero-mkngff@main'
omero login demo@localhost

omero mkngff setup > setup.sql
omero mkgnff sql --secret=$SECRET 5287125 a.ome.zarr/ > my.sql
sudo -u postgres psql idr < setup.sql
sudo -u postgres psql idr < my.sql

sudo -u omero-server mkdir /data/OMERO/ManagedRepository/demo_2/Blitz-0-Ice.ThreadPool.Server-2/2023-06/22/12-46-39.975_converted/
mv a.ome.zarr /tmp
ln -s /tmp/a.ome.zarr /data/OMERO/ManagedRepository/demo_2/Blitz-0-Ice.ThreadPool.Server-2/2023-06/22/12-46-39.975_converted/a.ome.zarr
omero render test Image:14834721 # Failing here
```

# Validation

See https://github.com/IDR/idr-utils/pull/55
Checkout that branch of `idr-utils` (if not merged yet etc).

The script there allows us to check the pixel data for the lowest resolution of each image in a study, validating that each plane is identical to the corresponding one in IDR.

This could take a while, so lets run as a screen...

```
sudo -u omero-server -s
screen -S idr0012_check_pixels
source /opt/omero/server/venv3/bin/activate
omero login demo@localhost
cd /uod/idr/metadata/idr-utils/scripts
python check_pixels.py Plate:4299 /tmp/check_pixels_idr0012.log
```




# Archived workflow below

The sections below were using a previous workflow (prior to the `omero-mkngff` approach)

# Make a metadata-only copy of the data

Since we want to import NGFF data without chunks, we need to create a copy of the data without chunks for import. The easiest way to do this is to use `aws` to sync the data, ignoring chunks.

We want these to be owned by `omero-server` user in a location they can access, so they can be imported. Location at import time isn't too important.

```
$ screen -S idr0010_aws_sync      # can take a while if lots of data    
$ mkdir idr0010
$ cd idr0010
$ aws s3 sync --no-sign-request --exclude '*' --include "*/.z*" --include "*.xml" --endpoint-url https://uk1s3.embassy.ebi.ac.uk s3://idr0010/zarr .

$ sudo mv -f ./* /ngff/idr0010/
$ cd /ngff/
$ sudo chown -R omero-server idr0010/
```

# Import metadata-only data

We can now perform a regular import as usual. Use a for loop to iterate through each plate in the directory instead of creating bulk import config, using `name` (removing `.ome.zarr` or `.zarr` for e.g. idr0036) so that data isn't named METADATA.ome.xml and Plate names match the original data. Could also add a target Screen or Dataset target (not shown) or move into container with webclient UI after import:

```
sudo -u omero-server -s
screen -S idr0010_ngff
source /opt/omero/server/venv3/bin/activate
export OMERODIR=/opt/omero/server/OMERO.server
omero login demo@localhost

cd /ngff/idr0010
for dir in *; do
  omero import --transfer=ln_s --depth=100 --name=${dir/.ome.zarr/} --skip=all $dir --file /tmp/$dir.log  --errs /tmp/$dir.err;
done
```

# Update symlinks

Mount the s3 bucket on IDR server machine: (idr0125-pilot or idr0138-pilot)
```
sudo mkdir /idr0010 && sudo /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0010 /idr0010
```

See https://github.com/IDR/idr-utils/pull/54
Checkout that branch of `idr-utils` (if not merged yet etc).

We need to specify the container (e.g. Screen, Plate, Dataset, Image or Fileset) and the path where the data is mounted:
If the path to the data in each Fileset is e.g. `filesetPrefix/plate1.zarr/..` and the path to each mounted plate is e.g. `/path/to/plates/plate1.zarr` we can run the following command to create 1 symlink for each plate from `/ManagedRepository/filesetPrefix/plate1.zarr` to `/path/to/plates/plate1.zarr`

The script also renders a single Image from each Fileset before updating symlinks, which avoids subsequent ResouceErrors.
The script can be run repeatedly on the same data without issue, e.g. if it fails part-way through and needs a re-run to complete.

A `--repo` option with default value is `/data/OMERO/ManagedRepository`.
Can also use `--dry-run` and `--report` options:

```
$ sudo -u omero-server -s
$ source /opt/omero/server/venv3/bin/activate
$ omero login demo@localhost
$ python idr-utils/scripts/managed_repo_symlinks.py Screen:123 /path/to/plates/ --report

Fileset: 5286929 /data/OMERO/ManagedRepository/demo_2/Blitz-0-Ice.ThreadPool.Server-6/2023-04/25/13-53-43.777/
fs_contents ['10-34.ome.zarr']
Link from /data/OMERO/ManagedRepository/demo_2/Blitz-0-Ice.ThreadPool.Server-6/2023-04/25/13-53-43.777/10-34.ome.zarr to /idr0010/zarr/10-34.ome.zarr
...
```

# Swap Filesets

See https://github.com/IDR/idr-utils/pull/53
Checkout that branch of `idr-utils` (if not merged yet etc).

The first Object (Screen, Plate, Image, Fileset) is the original data that we want to update to use NGFF Fileset, and the second is the NGFF data we imported above. In the case of Screens, Filesets are swapped between pairs of Plates matched by name (you should check that Plate names match before running this script).
The 3rd required argument is a file where you can write the sql commands that are required to update Pixels objects (we can't yet update these via the OMERO API).
The script supports `--dry-run` and `--report` flags.

```
$ source /opt/omero/server/venv3/bin/activate
$ omero login demo@localhost
$ python idr-utils/scripts/swap_filesets.py Screen:1202 Screen:3204 /tmp/idr0012_filesetswap.sql --report
```
This will write a psql command for each Fileset that we then need to execute...

```
$ export OMERODIR=/opt/omero/server/OMERO.server
$ omero config get --show-password

# Use the password, host etc to run the sql file generated above...
$ PGPASSWORD=****** psql -U omero -d idr -h 192.168.10.102 -f /tmp/idr0012_filesetswap.sql
```

psql commands are 1 per Fileset and are like:
```
UPDATE pixels SET name = '.zattrs', path = 'demo_2/Blitz-0-Ice.ThreadPool.Server-16/2023-04/12/10-20-20.483/10x_images_plate_2.ome.zarr' where image in (select id from Image where fileset = 5286921);
```

You can then view Images from the original data which is now using an NGFF Fileset!

# Cleanup

We can now delete the uk1s3 data and buckets created above for testing.
The original Filesets will remain as "orphans".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Document NGFF Fileset replacement workflow #656

NGFF generation

Upload to EBI s3 for testing

Submission to BioStudies

Make NGFF Filesets

Validation

Archived workflow below

Make a metadata-only copy of the data

Import metadata-only data

Update symlinks

Swap Filesets

Cleanup

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Document NGFF Fileset replacement workflow #656

Description

NGFF generation

Upload to EBI s3 for testing

Submission to BioStudies

Make NGFF Filesets

Validation

Archived workflow below

Make a metadata-only copy of the data

Import metadata-only data

Update symlinks

Swap Filesets

Cleanup

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions