Skip to content

Evaluating each compressor on multiple error bounds#15

Merged
treigerm merged 23 commits intomainfrom
error_bounds
Apr 3, 2025
Merged

Evaluating each compressor on multiple error bounds#15
treigerm merged 23 commits intomainfrom
error_bounds

Conversation

@treigerm
Copy link
Copy Markdown
Member

This is a draft PR to address #11 . On a high-level this makes the following adjustments:

  • For each dataset, store the required error bounds in datasets-error-bounds/{dataset_name}/error_bounds.json.
  • Adjust the build method to take arguments data_min, data_max (minimum and maximum value in the data) and one of either abs_error, an absolute error bound, or rel_error, a relative error bound. If an absolute error bound is passed to a compressor that can only handle relative error bounds, the information from data_min, data_max is used to compute the most stringent relative error bound which ensures that the specified absolute error bound won't be exceeded (and the same vice versa, if the compressor can only handle absolute error but relative error is specified).
  • The decompressed dataset is now stored in compressed-datasets/{dataset_name}/{error_bound_name}={error_bound}/{compressor_name}

I'm happy to convinced of another design, the reasons I picked this structure is:

  • With the work on the ERA5 ensembles, the error bounds are a property of the dataset, i.e. for each dataset/variable we derive an error bound that should be adhered to.
  • For a given dataset, all the compressors should be given the same error bound type and value. How this error bound is satisfied, in the end, is a choice of the compressor. If a given compressor cannot handle e.g. absolute error bounds, then how we adapt the compressor settings to make it work with absolute error bounds is a design choice we have to make. This choice should be consistent between different datasets.

I have marked this as a draft because the whole structure/approach might change. I just wanted to have a working example to guide the discussion. Let me know what you think @juntyr !

Comment thread src/climatebenchpress/compressor/compressors/tthresh.py Outdated
Comment thread src/climatebenchpress/compressor/compressors/abc.py Outdated
@treigerm treigerm requested a review from juntyr March 25, 2025 15:09
Copy link
Copy Markdown
Collaborator

@juntyr juntyr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @treigerm! I left some initial comments, mostly about how we compute relative error bounds.

For data that is all-positive or all-negative, we could also use a less pessimistic combination of the https://numcodecs-wasm.readthedocs.io/en/latest/api/numcodecs_wasm_log/ codec (and https://numcodecs-wasm.readthedocs.io/en/latest/api/numcodecs_wasm_fixed_offset_scale/ for all-negative) to get proper relative errors.

For data that's both negative and positive, we'd have to go another route

Comment thread src/climatebenchpress/compressor/compressors/sz3.py Outdated
Comment thread src/climatebenchpress/compressor/compressors/abc.py Outdated
Comment thread src/climatebenchpress/compressor/compressors/zfp.py Outdated
Comment thread src/climatebenchpress/compressor/compressors/tthresh.py Outdated
Comment thread scripts/create_error_bounds.py
Comment thread src/climatebenchpress/compressor/compressors/abc.py Outdated
Comment thread src/climatebenchpress/compressor/compressors/bitround.py Outdated
Comment thread src/climatebenchpress/compressor/compressors/bitround.py Outdated
@treigerm
Copy link
Copy Markdown
Member Author

I like the idea of using the log transformation to have less pessimistic error bound transformation. Let's discuss with @milankl tomorrow what the best route forward is!

@juntyr juntyr mentioned this pull request Mar 27, 2025
Comment thread src/climatebenchpress/compressor/compressors/abc.py Outdated
@treigerm
Copy link
Copy Markdown
Member Author

@juntyr I have now significantly updated the PR to take the "iterator approach" and have updated it with the added JPEG2000 compressor as well. The Compressor.build function can now return multiple "variants" of the same compressor. I have done some refactoring so that all of the logic for creating different variants and converting between different error bounds is captured in the base class. Each class implementation for a compressor now only needs to specify how to construct a Codec for either an absolute or relative error bound (or both).

This PR is already quite large so I would suggest we leave the actual implementations of different ways to convert between relative to absolute error (and vice versa) to the next PR.

I have checked that this PR works on all the "tiny" datasets that we currently have in the data-loader.

Copy link
Copy Markdown
Collaborator

@juntyr juntyr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@treigerm Thank you again for all of your work on this PR! I left a few more comments, some minor, some require a design decision that we just need to commit to

Comment thread scripts/compress.py Outdated
Comment thread scripts/compress.py Outdated
Comment thread scripts/compress.py
Comment thread scripts/compress.py
Comment thread scripts/create_error_bounds.py
Comment thread src/climatebenchpress/compressor/compressors/abc.py Outdated
Comment thread src/climatebenchpress/compressor/compressors/abc.py Outdated
Comment thread src/climatebenchpress/compressor/compressors/jpeg2000.py Outdated
Comment thread src/climatebenchpress/compressor/compressors/sz3.py
Comment thread src/climatebenchpress/compressor/compressors/zfp.py
treigerm and others added 5 commits April 1, 2025 14:30
Co-authored-by: Juniper Tyree <50025784+juntyr@users.noreply.github.com>
Co-authored-by: Juniper Tyree <50025784+juntyr@users.noreply.github.com>
@treigerm
Copy link
Copy Markdown
Member Author

treigerm commented Apr 2, 2025

@juntyr Thanks a lot for all the feedbacl! I finally got around to changing the behaviour for datasets with multiple variables (and addressing the other minor points).

Based on our discussion we now build a separate codec for each variable in the dataset. Crucially, each variable can also have a different error bound as well. This makes the logic in the Compressor.build function a bit more complicated. I have added comments and ample type annotations to make it a bit more clear but let me know if anything looks wrong!

Because we can now have separate error bounds for each variable I have also adjusted the create_error_bounds.py script. It now saves separate error bounds for each variable. I also made compute some actual data dependent error bounds. Specifically, for each variable it creates three error bounds data_range * 0.01, data_range * 0.001, data_range * 0.0001. The multiplicative factors are pulled out of a hat but these error bounds should at least be a bit better than just having the same absolute error bound for all variables and we want to replace the manner with which we choose the error bounds soon anyway.

Copy link
Copy Markdown
Collaborator

@juntyr juntyr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly small nits now. I think with the small nits addressed we should merge. I'll want to have a go at the API myself but that will be a lot easier once we have your implementation working and can test it in practice

Comment thread src/climatebenchpress/compressor/compressors/abc.py Outdated
Comment thread scripts/create_error_bounds.py Outdated
Comment thread src/climatebenchpress/compressor/compressors/abc.py Outdated
Comment thread src/climatebenchpress/compressor/compressors/abc.py Outdated
Comment thread src/climatebenchpress/compressor/compressors/abc.py Outdated
Comment thread src/climatebenchpress/compressor/compressors/jpeg2000.py Outdated
@treigerm
Copy link
Copy Markdown
Member Author

treigerm commented Apr 3, 2025

@juntyr okay, so I managed to simplify things a bit further in the abstract compressor class. It still feels more messy then it should be. As I mentioned in my comment above I think part of the reason it's messy because the data structures I have chosen for the input/output of functions is not ideal. You're very welcome to have a go at refactoring it into something cleaner if you want (no need to keep any of the structure or types that I introduced).

For now, each codec that we generate should be identified by a tuple (error_bound_name, variant_name, variable_name). error_bound_name is the name of the original error bound (e.g. abs_error=0.1, right now the names are generated automatically but it might be more user friendly for the user to have to specify a concrete name); we need this name to group all compressors together which should (roughly) lead to the same error bound. variant_name contains information about whether we transformed the original error bound and if yes, how we transformed it. variable_name is the name of the variable needing to be compressed. Part of the mess of the code is that at different stages I am working with dicts or lists which group the error bounds/codecs according to one of these three categories. But it might be easier to just work with these tuples/thruples directly.

rate = 10.0 # x10 factor compression
def abs_bound_codec(dtype, error_bound):
precision = error_bound
max_pixel_val = 2**25 - 1 # maximum pixel value for our integer encoding.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still unsure here. Our current "linear quantisation" just divides the data by eb and then rounds. So the integer range we generate is round(min/eb) <= x <= round(max/eb). If the min goes below -224 or the max goes above 224 - 1 the JPEG2000 codec will error.

Comment thread src/climatebenchpress/compressor/compressors/abc.py Outdated
Comment thread src/climatebenchpress/compressor/compressors/abc.py Outdated
Comment thread src/climatebenchpress/compressor/compressors/abc.py
Copy link
Copy Markdown
Collaborator

@juntyr juntyr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the refactoring @treigerm, this PR looks good now. I left just some minor nits, feel free to squash and merge afterwards.

@treigerm treigerm merged commit 2b3d895 into main Apr 3, 2025
3 checks passed
@juntyr juntyr changed the title [Draft] Evaluating each compressor on multiple error bounds Evaluating each compressor on multiple error bounds Apr 10, 2025
@juntyr juntyr deleted the error_bounds branch April 10, 2025 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants