Evaluating each compressor on multiple error bounds by treigerm · Pull Request #15 · ClimateBenchPress/compressor

treigerm · 2025-03-25T15:03:47Z

This is a draft PR to address #11 . On a high-level this makes the following adjustments:

For each dataset, store the required error bounds in datasets-error-bounds/{dataset_name}/error_bounds.json.
Adjust the build method to take arguments data_min, data_max (minimum and maximum value in the data) and one of either abs_error, an absolute error bound, or rel_error, a relative error bound. If an absolute error bound is passed to a compressor that can only handle relative error bounds, the information from data_min, data_max is used to compute the most stringent relative error bound which ensures that the specified absolute error bound won't be exceeded (and the same vice versa, if the compressor can only handle absolute error but relative error is specified).
The decompressed dataset is now stored in compressed-datasets/{dataset_name}/{error_bound_name}={error_bound}/{compressor_name}

I'm happy to convinced of another design, the reasons I picked this structure is:

With the work on the ERA5 ensembles, the error bounds are a property of the dataset, i.e. for each dataset/variable we derive an error bound that should be adhered to.
For a given dataset, all the compressors should be given the same error bound type and value. How this error bound is satisfied, in the end, is a choice of the compressor. If a given compressor cannot handle e.g. absolute error bounds, then how we adapt the compressor settings to make it work with absolute error bounds is a design choice we have to make. This choice should be consistent between different datasets.

I have marked this as a draft because the whole structure/approach might change. I just wanted to have a working example to guide the discussion. Let me know what you think @juntyr !

juntyr

Thanks for working on this @treigerm! I left some initial comments, mostly about how we compute relative error bounds.

For data that is all-positive or all-negative, we could also use a less pessimistic combination of the https://numcodecs-wasm.readthedocs.io/en/latest/api/numcodecs_wasm_log/ codec (and https://numcodecs-wasm.readthedocs.io/en/latest/api/numcodecs_wasm_fixed_offset_scale/ for all-negative) to get proper relative errors.

For data that's both negative and positive, we'd have to go another route

treigerm · 2025-03-26T10:56:23Z

I like the idea of using the log transformation to have less pessimistic error bound transformation. Let's discuss with @milankl tomorrow what the best route forward is!

Co-authored-by: Juniper Tyree <50025784+juntyr@users.noreply.github.com>

treigerm · 2025-03-31T13:33:02Z

@juntyr I have now significantly updated the PR to take the "iterator approach" and have updated it with the added JPEG2000 compressor as well. The Compressor.build function can now return multiple "variants" of the same compressor. I have done some refactoring so that all of the logic for creating different variants and converting between different error bounds is captured in the base class. Each class implementation for a compressor now only needs to specify how to construct a Codec for either an absolute or relative error bound (or both).

This PR is already quite large so I would suggest we leave the actual implementations of different ways to convert between relative to absolute error (and vice versa) to the next PR.

I have checked that this PR works on all the "tiny" datasets that we currently have in the data-loader.

juntyr

@treigerm Thank you again for all of your work on this PR! I left a few more comments, some minor, some require a design decision that we just need to commit to

Co-authored-by: Juniper Tyree <50025784+juntyr@users.noreply.github.com>

treigerm · 2025-04-02T10:24:41Z

@juntyr Thanks a lot for all the feedbacl! I finally got around to changing the behaviour for datasets with multiple variables (and addressing the other minor points).

Based on our discussion we now build a separate codec for each variable in the dataset. Crucially, each variable can also have a different error bound as well. This makes the logic in the Compressor.build function a bit more complicated. I have added comments and ample type annotations to make it a bit more clear but let me know if anything looks wrong!

Because we can now have separate error bounds for each variable I have also adjusted the create_error_bounds.py script. It now saves separate error bounds for each variable. I also made compute some actual data dependent error bounds. Specifically, for each variable it creates three error bounds data_range * 0.01, data_range * 0.001, data_range * 0.0001. The multiplicative factors are pulled out of a hat but these error bounds should at least be a bit better than just having the same absolute error bound for all variables and we want to replace the manner with which we choose the error bounds soon anyway.

juntyr

Mostly small nits now. I think with the small nits addressed we should merge. I'll want to have a go at the API myself but that will be a lot easier once we have your implementation working and can test it in practice

treigerm · 2025-04-03T09:23:49Z

@juntyr okay, so I managed to simplify things a bit further in the abstract compressor class. It still feels more messy then it should be. As I mentioned in my comment above I think part of the reason it's messy because the data structures I have chosen for the input/output of functions is not ideal. You're very welcome to have a go at refactoring it into something cleaner if you want (no need to keep any of the structure or types that I introduced).

For now, each codec that we generate should be identified by a tuple (error_bound_name, variant_name, variable_name). error_bound_name is the name of the original error bound (e.g. abs_error=0.1, right now the names are generated automatically but it might be more user friendly for the user to have to specify a concrete name); we need this name to group all compressors together which should (roughly) lead to the same error bound. variant_name contains information about whether we transformed the original error bound and if yes, how we transformed it. variable_name is the name of the variable needing to be compressed. Part of the mess of the code is that at different stages I am working with dicts or lists which group the error bounds/codecs according to one of these three categories. But it might be easier to just work with these tuples/thruples directly.

juntyr · 2025-04-03T10:04:38Z

-        rate = 10.0  # x10 factor compression
+    def abs_bound_codec(dtype, error_bound):
+        precision = error_bound
+        max_pixel_val = 2**25 - 1  # maximum pixel value for our integer encoding.


I'm still unsure here. Our current "linear quantisation" just divides the data by eb and then rounds. So the integer range we generate is round(min/eb) <= x <= round(max/eb). If the min goes below -224 or the max goes above 224 - 1 the JPEG2000 codec will error.

juntyr

Thank you for the refactoring @treigerm, this PR looks good now. I left just some minor nits, feel free to squash and merge afterwards.

treigerm added 2 commits March 25, 2025 12:23

Add directory to store error bounds

4e585fe

Allow to configure each compressor with an error bound

e07f66c

treigerm commented Mar 25, 2025

View reviewed changes

Comment thread src/climatebenchpress/compressor/compressors/tthresh.py Outdated

treigerm commented Mar 25, 2025

View reviewed changes

Comment thread src/climatebenchpress/compressor/compressors/abc.py Outdated

treigerm requested a review from juntyr March 25, 2025 15:09

juntyr requested changes Mar 25, 2025

View reviewed changes

Pass dtype to build fn, add docstrings, create bitround helper fn

1f43aed

juntyr mentioned this pull request Mar 27, 2025

Add a Jpeg2000 compressor #17

Merged

juntyr reviewed Mar 27, 2025

View reviewed changes

Comment thread src/climatebenchpress/compressor/compressors/abc.py Outdated

treigerm and others added 8 commits March 28, 2025 09:05

Enforce named arguments

ce5a989

Co-authored-by: Juniper Tyree <50025784+juntyr@users.noreply.github.com>

Refactor compressors to allow multiple error bound conversions

7c733b3

Refactor compressor to have transformation logic in base class

5ebe08b

Add docstring

a3530ba

Merge remote-tracking branch 'origin/main' into error_bounds

cd9ef48

Adjust JPEG2000 for absolute error bounds

8ba9df2

Refine docstring

9b0001c

Fix compressed datasets path

1ec0fe3

juntyr requested changes Mar 31, 2025

View reviewed changes

treigerm and others added 5 commits April 1, 2025 14:30

Fix grammar in docstring

d4f16f3

Co-authored-by: Juniper Tyree <50025784+juntyr@users.noreply.github.com>

Fix relative error bound conversion bug

de6fd17

Co-authored-by: Juniper Tyree <50025784+juntyr@users.noreply.github.com>

Improved comments and error handling

84c113a

Generate separate codec for each variable

3e6777b

Fix JPEG2000 maximum pixel value

6309efb

Save full stacktrace when error occurs

be3805a

juntyr reviewed Apr 2, 2025

View reviewed changes

treigerm added 2 commits April 2, 2025 18:48

Clarify control flow, address PR review comments

bcf7f3f

Simplify control flow further

388ac13

treigerm added 2 commits April 3, 2025 09:51

Adjust JPEG2000 precision

445df64

Clarifying comments

1383d9a

juntyr reviewed Apr 3, 2025

View reviewed changes

Comment thread src/climatebenchpress/compressor/compressors/abc.py Outdated

juntyr reviewed Apr 3, 2025

View reviewed changes

Comment thread src/climatebenchpress/compressor/compressors/abc.py Outdated

juntyr reviewed Apr 3, 2025

View reviewed changes

Comment thread src/climatebenchpress/compressor/compressors/abc.py

juntyr approved these changes Apr 3, 2025

View reviewed changes

treigerm added 2 commits April 3, 2025 14:10

Comment about input transformation

970f5ea

Rename dataclasses with more detailed names

d2a864d

treigerm merged commit 2b3d895 into main Apr 3, 2025
3 checks passed

treigerm mentioned this pull request Apr 3, 2025

JPEG2000 input transformation #18

Closed

juntyr mentioned this pull request Apr 10, 2025

Multiple error thresholds for each compressor #11

Closed

juntyr changed the title ~~[Draft] Evaluating each compressor on multiple error bounds~~ Evaluating each compressor on multiple error bounds Apr 10, 2025

juntyr deleted the error_bounds branch April 10, 2025 13:58

Conversation

treigerm commented Mar 25, 2025

Uh oh!

Uh oh!

Uh oh!

juntyr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

treigerm commented Mar 26, 2025

Uh oh!

Uh oh!

treigerm commented Mar 31, 2025

Uh oh!

juntyr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

treigerm commented Apr 2, 2025

Uh oh!

juntyr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

treigerm commented Apr 3, 2025

Uh oh!

juntyr Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

juntyr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants