Skip to content

NXLauetof Panel dependent dataset #102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from
Closed

NXLauetof Panel dependent dataset #102

wants to merge 14 commits into from

Conversation

YooSunYoung
Copy link
Member

@YooSunYoung YooSunYoung commented Jan 27, 2025

Related to #60 and #90

@aaronfinke can you please try this and see if it crashes or not....?

Comment on lines 187 to 189
_create_dataset_from_var(
name="data",
root_entry=nx_detector,
var=sc.scalar(0, unit=''), # TODO: Add real data
var=sc.fold(dg["counts"].data, dim='id', sizes={'x': num_x, 'y': num_y}),
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronfinke And this file has all the exporting methods.

Can you please review this part if the fields are correct...?

For example, this block, is folding pixel counts into [n_x_pixels, n_y_pixels, n_tof_bins]
and then save it as data under instrument/{detector_name}.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure the counts are unsigned int and not float since I think that is what is expected from the EFU (they were floats before but they shouldn't be).

Also need to think about how to set the detector geometries in the file. I think the way it was done before (fast_axis, slow_axis, origin as 3D vectors) is fine. If we do that, we can remove the polar_angle and azimuthal_angle. I need to think about this though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronfinke I pushed a change to export them as uint.
Just note that original McStas data is floating numbers so if you don't give the workflow big enough MaximumProbability, it might just wipe out all the information.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually the MaximumProbability is set to 10000 which seems to be big enough. In any case, DIALS reads the data in as ints so we might as well control that input ourselves.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As to whether counts should be uint or signed int- some detectors output -1 as a count to define defective pixels, sometimes it's MAX_VALUE-1. I don't know what our detector will do but once I know I will let you know

@aaronfinke
Copy link
Collaborator

Doesn't crash for me :) (also does not give the correct processed data, but I guess it's not supposed to yet?)

@YooSunYoung
Copy link
Member Author

Doesn't crash for me :) (also does not give the correct processed data, but I guess it's not supposed to yet?)

What do you mean by correct processed data? The counts should be correct.
There are some fields that need to be filled though.

@YooSunYoung
Copy link
Member Author

@scipp/ess-maintainers @aaronfinke

Sorry I couldn't really make it pretty yet and ended up making a PR with many fixes.

workflow_chunk.ipynb has an example of processing data chunk by chunk.

I had to move around some providers order,
i.e) we have to scale the floating number probability with maximum existing probability and maximum counts to be set by users manually. Therefore maximum existing probability had to be also accumulated (That's the accumulator I wrote in the notebook) and then fed into the provider.
i.e.2) The events had to be histogrammed both into pixel id and time to be accumulated efficiently.

I'll be away until 11th of February so if anyone's interested, please pick this up and continue.
(Feel free to discard what I have done if needed...!)

TODO:

  • Update the docs dependencies to have essreduce.
  • Move MaxAccumulator somewhere in a package.
  • Test

@aaronfinke
Copy link
Collaborator

from ess.reduce.streaming import EternalAccumulator, Accumulator
from ess.reduce.streaming import StreamProcessor

These are not part of essnmx; where are they from?

@jokasimr
Copy link
Contributor

jokasimr commented Feb 3, 2025

These are not part of essnmx; where are they from?

They are from https://github.com/scipp/essreduce.

@aaronfinke
Copy link
Collaborator

Tried on a mcstas dataset, got this error:

---------------------------------------------------------------------------
DatasetError                              Traceback (most recent call last)
Cell In[23], line 38
     36     if any(da.sizes.values()) == 0:
     37         continue
---> 38     results = processor.add_chunk({RawEventData: da})
     40 proton_charge, histogram = results[ProtonCharge], results[NMXHistogram]
     41 reduced = format_nmx_reduced_data(
     42     counts=histogram,
     43     proton_charge=proton_charge,
   (...)
     46     crystal_rotation=crystal_rotation,
     47 )

File ~/.conda/envs/scipp/lib/python3.13/site-packages/ess/reduce/streaming.py:216, in StreamProcessor.add_chunk(self, chunks)
    214 to_accumulate = self._process_chunk_workflow.compute(self._accumulators)
    215 for key, processed in to_accumulate.items():
--> 216     self._accumulators[key].push(processed)
    217     self._finalize_workflow[key] = self._accumulators[key].value
    218 return self._finalize_workflow.compute(self._target_keys)

File ~/.conda/envs/scipp/lib/python3.13/site-packages/ess/reduce/streaming.py:62, in Accumulator.push(self, value)
     60 if self._preprocess is not None:
     61     value = self._preprocess(value)
---> 62 self._do_push(value)

File ~/.conda/envs/scipp/lib/python3.13/site-packages/ess/reduce/streaming.py:99, in EternalAccumulator._do_push(self, value)
     97     self._value = value.copy()
     98 else:
---> 99     self._value += value

DatasetError: Mismatch in coordinate 't' in operation 'add_equals':
(t: 51)    float64              [s]  [0.122206, 0.122209, ..., 0.122345, 0.122348]
vs
(t: 51)    float64              [s]  [0.078339, 0.0792191, ..., 0.121464, 0.122344]

@aaronfinke
Copy link
Collaborator

I would suggest to take advantage of the fact that these HDF5 data are stored in compressed chunks, and the chunk size is already optimal for i/o.

e.g. in one mcstas file:

with h5py.File('scratch/aaronfinke/nmx_mcstas_rubredoxin_phiseries/5e11_x11000SPLIT_x10_omega_0_180/1/mccode.h5') as fp:
    dset = fp['entry1/data']['Detector_0_event_signal_dat_list_p_x_y_n_id_t']['events']
    fp_shape = dset.chunks
fp_shape
(1332354, 6)

then you can do something like

for s in dset.iter_chunks():
   <process code>

That is how I am working with the McStas data directly and it is quite fast and memory-economical.

@SimonHeybrock
Copy link
Member

My guess is that the chunks in the file are much smaller than the optimal chunk size for the data reduction. There will be a very significant per-pixel cost, so if there are too few events per chunk then performance will be dominated by the nchunk*npixel terms.

But yes, we can try too choose the chunk size as a multiple of the HDF5 chunk size of the dataset, as an added bonus.

@YooSunYoung
Copy link
Member Author

Tried on a mcstas dataset, got this error:


DatasetError Traceback (most recent call last)
Cell In[23], line 38
36 if any(da.sizes.values()) == 0:
37 continue
---> 38 results = processor.add_chunk({RawEventData: da})
40 proton_charge, histogram = results[ProtonCharge], results[NMXHistogram]
41 reduced = format_nmx_reduced_data(
42 counts=histogram,
43 proton_charge=proton_charge,
(...)
46 crystal_rotation=crystal_rotation,
47 )

File ~/.conda/envs/scipp/lib/python3.13/site-packages/ess/reduce/streaming.py:216, in StreamProcessor.add_chunk(self, chunks)
214 to_accumulate = self._process_chunk_workflow.compute(self._accumulators)
215 for key, processed in to_accumulate.items():
--> 216 self._accumulators[key].push(processed)
217 self._finalize_workflow[key] = self._accumulators[key].value
218 return self._finalize_workflow.compute(self._target_keys)

File ~/.conda/envs/scipp/lib/python3.13/site-packages/ess/reduce/streaming.py:62, in Accumulator.push(self, value)
60 if self._preprocess is not None:
61 value = self._preprocess(value)
---> 62 self._do_push(value)

File ~/.conda/envs/scipp/lib/python3.13/site-packages/ess/reduce/streaming.py:99, in EternalAccumulator._do_push(self, value)
97 self._value = value.copy()
98 else:
---> 99 self._value += value

DatasetError: Mismatch in coordinate 't' in operation 'add_equals':
(t: 51) float64 [s] [0.122206, 0.122209, ..., 0.122345, 0.122348]
vs
(t: 51) float64 [s] [0.078339, 0.0792191, ..., 0.121464, 0.122344]

@aaronfinke
This error is probably because the time binning parameter is not properly set ...?
I updated the notebook to set it at the beginning. That should fix the issue.

And I also added an argument that allows auto chunk based on the iter_chunks.
You have to set CHUNK_SIZE to be 0.
It might be quicker to work this offline together...

@YooSunYoung
Copy link
Member Author

  • We need a helper that decides time of flight range for a big size file.

@SimonHeybrock
Copy link
Member

  • We need a helper that decides time of flight range for a big size file.

Why can't this be a user input, or pre-configured, as in other techniques/instruments?

@aaronfinke
Copy link
Collaborator

Why can't this be a user input, or pre-configured, as in other techniques/instruments?

It will be pre-configured for real data, once we know precisely the tof bin limits for each chopper setting (determined experimentally). But not for McStas simulations ;)

@SimonHeybrock
Copy link
Member

Why can't this be a user input, or pre-configured, as in other techniques/instruments?

It will be pre-configured for real data, once we know precisely the tof bin limits for each chopper setting (determined experimentally). But not for McStas simulations ;)

So we can make it a user input then?

@aaronfinke
Copy link
Collaborator

So we can make it a user input then?

This will cause problems in DIALS if the boxes (or tof_min, tof_max) are not accurately determined; refinement of wavelength in DIALS tends to fail with larger bin sizes, for example. A helper script to determine it directly from the tof data is a better option. This is what I do now and it works. I'm not sure what the negative is, aside from taking longer to compute/having to read the file twice (when loading into memory is not an option). If processing takes twice as long, that's still significantly shorter than it takes to collect NMX data (~10 min vs hours).

I have sent Sunyoung the helper I have written for chunked data, but I will add it here for posterity:

def min_max_probability_tof(self) -> tuple[float]:
        """
        Get the minumum and maxiumum tof and probability.
        Ignore zero tof and zero probabilities (likely errors in McStas).
        Done in chunks to avoid memory errors for large files.
        """
        max_p = 0.0
        max_t, min_t = 0.0, np.inf

        with h5py.File(self.input_file,"r") as fp:
            for i in range(self.num_detectors):
                print(f"minmax_t frame {i}")
                dset = fp["entry1/data"][f"Detector_{i}_event_signal_dat_list_p_x_y_n_id_t"]["events"]

                for s in dset.iter_chunks():
                    chunk = dset[s]
                    chunk_t = chunk.T[5]
                    chunk_p = chunk.T[0]
                    tempmax_t = np.max(chunk_t)
                    tempmin_t = np.min(chunk_t[chunk_t > 0])
                    tempmax_p = np.max(chunk_p)
                    if tempmax_t > max_t:
                        max_t = tempmax_t
                    if tempmax_p > max_p:
                        max_p = tempmax_p
                    if tempmin_t < min_t:
                        min_t = tempmin_t
        logger.info(f"max_t: {max_t}, min_t: {min_t}")
        logger.info(f"max_p: {max_p}")
        return (max_t,min_t,max_p)

@YooSunYoung
Copy link
Member Author

It's re-written in #115

@YooSunYoung YooSunYoung deleted the lauetof-2 branch February 27, 2025 10:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants