Evaluating each compressor on multiple error bounds#15
Conversation
juntyr
left a comment
There was a problem hiding this comment.
Thanks for working on this @treigerm! I left some initial comments, mostly about how we compute relative error bounds.
For data that is all-positive or all-negative, we could also use a less pessimistic combination of the https://numcodecs-wasm.readthedocs.io/en/latest/api/numcodecs_wasm_log/ codec (and https://numcodecs-wasm.readthedocs.io/en/latest/api/numcodecs_wasm_fixed_offset_scale/ for all-negative) to get proper relative errors.
For data that's both negative and positive, we'd have to go another route
|
I like the idea of using the log transformation to have less pessimistic error bound transformation. Let's discuss with @milankl tomorrow what the best route forward is! |
Co-authored-by: Juniper Tyree <50025784+juntyr@users.noreply.github.com>
|
@juntyr I have now significantly updated the PR to take the "iterator approach" and have updated it with the added JPEG2000 compressor as well. The This PR is already quite large so I would suggest we leave the actual implementations of different ways to convert between relative to absolute error (and vice versa) to the next PR. I have checked that this PR works on all the "tiny" datasets that we currently have in the |
Co-authored-by: Juniper Tyree <50025784+juntyr@users.noreply.github.com>
Co-authored-by: Juniper Tyree <50025784+juntyr@users.noreply.github.com>
|
@juntyr Thanks a lot for all the feedbacl! I finally got around to changing the behaviour for datasets with multiple variables (and addressing the other minor points). Based on our discussion we now build a separate codec for each variable in the dataset. Crucially, each variable can also have a different error bound as well. This makes the logic in the Because we can now have separate error bounds for each variable I have also adjusted the |
juntyr
left a comment
There was a problem hiding this comment.
Mostly small nits now. I think with the small nits addressed we should merge. I'll want to have a go at the API myself but that will be a lot easier once we have your implementation working and can test it in practice
|
@juntyr okay, so I managed to simplify things a bit further in the abstract compressor class. It still feels more messy then it should be. As I mentioned in my comment above I think part of the reason it's messy because the data structures I have chosen for the input/output of functions is not ideal. You're very welcome to have a go at refactoring it into something cleaner if you want (no need to keep any of the structure or types that I introduced). For now, each codec that we generate should be identified by a tuple |
| rate = 10.0 # x10 factor compression | ||
| def abs_bound_codec(dtype, error_bound): | ||
| precision = error_bound | ||
| max_pixel_val = 2**25 - 1 # maximum pixel value for our integer encoding. |
There was a problem hiding this comment.
I'm still unsure here. Our current "linear quantisation" just divides the data by eb and then rounds. So the integer range we generate is round(min/eb) <= x <= round(max/eb). If the min goes below -224 or the max goes above 224 - 1 the JPEG2000 codec will error.
This is a draft PR to address #11 . On a high-level this makes the following adjustments:
datasets-error-bounds/{dataset_name}/error_bounds.json.buildmethod to take argumentsdata_min, data_max(minimum and maximum value in the data) and one of eitherabs_error, an absolute error bound, orrel_error, a relative error bound. If an absolute error bound is passed to a compressor that can only handle relative error bounds, the information fromdata_min, data_maxis used to compute the most stringent relative error bound which ensures that the specified absolute error bound won't be exceeded (and the same vice versa, if the compressor can only handle absolute error but relative error is specified).compressed-datasets/{dataset_name}/{error_bound_name}={error_bound}/{compressor_name}I'm happy to convinced of another design, the reasons I picked this structure is:
I have marked this as a draft because the whole structure/approach might change. I just wanted to have a working example to guide the discussion. Let me know what you think @juntyr !