Skip to content

Sub-batches with equal flux in photon pooling#513

Open
welucas2 wants to merge 9 commits intomainfrom
u/welucas2/equal-flux-subbatches
Open

Sub-batches with equal flux in photon pooling#513
welucas2 wants to merge 9 commits intomainfrom
u/welucas2/equal-flux-subbatches

Conversation

@welucas2
Copy link
Copy Markdown
Collaborator

Currently, the sub-batching in photon pooling is performed object-by-object. So, for example, if there are 100 objects in the batch, and 20 sub-batches, then the first five objects will go in the first sub-batch, the next five in the second, and so on. This means that if the objects vary a lot in brightness (and they do) then the flux per sub-batch will also vary a lot. In a typical run and setup provided by Jim, I found that most sub-batches contained a few thousand photons, and a handful contained several tens of millions. This leads to regular memory spikes when those sub-batches are processed in each new photon pool.

This PR smooths those memory spikes out by creating sub-batches with roughly equal fluxes. This is an implementation of the bin packing problem with fragmentation: our goal is to split the total flux in the pool across the sub-batches, and to achieve this we have to be able to fragment those extremely bright objects across however many sub-batches.

At the same time, each object fragmentation means another object lookup, and we had determined previously that with photon pooling these can mount up and become quite costly. For this reason, I'm allowing the flux in each sub-batch to vary slightly, letting objects to fill up over the limit to 105% of the expected per sub-batch flux if it prevents a fragmentation.

This is ready to go, but needs #511 to be merged in first and then this should be rebased onto it - so for now I'm leaving this as a draft.

@welucas2 welucas2 force-pushed the u/welucas2/equal-flux-subbatches branch from 04a30a2 to 02b52ba Compare March 26, 2026 10:46
@welucas2
Copy link
Copy Markdown
Collaborator Author

This is ready for review, though CI failed earlier on today during Conda setup: CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/linux-64/repodata.json>. Could someone with permission please re-run CI to see if this gets past it?

@welucas2 welucas2 marked this pull request as ready for review March 26, 2026 15:15
Copy link
Copy Markdown
Contributor

@rmjarvis rmjarvis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm confused about some of this.

Comment thread imsim/photon_pooling.py
# Calculate current sub-batch flux.
# It would be nicer to store rather then recalculate.
subbatch = subbatches[current_subbatch]
subbatch_flux = sum(sb_obj.phot_flux for sb_obj in subbatch)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you say, it would be nicer to keep this as a running total, rather than O(N^2) recalculating it each time. Is that hard for some reason? I'm thinking the easiest way would be to keep an array with the flux in each batch so far.

subbatch_fluxes = [0. for _ in range(nsubbatch)]

Then update it each time you add something to subbatches.

There are also patterns that only keep the running sum for the current batch, but the advantage of this is there is a nice debug statement you could put at the end with the list of realized fluxes per batch, which would be handy to be able to easily see when running.

Comment thread imsim/photon_pooling.py
if remaining_flux <= available_flux:
# This sub-batch can contain the entirety of the remaining
# object flux. Place entirety it all in here, the continue
# without advancing in case more can go in.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected the first test to be if remaining_flux <= loose_photons_per_subbatch - subbatch_flux, in which case put it in entirely. And if remaining_flux >= photons_per_subbatch - subbatch_flux then bump to the next batch.

I think this cleanly gets what you now have both here and the start of the else branch.

# In general there are multiple ways to split the batch. Assert that each
# object appears with its original flux across however many sub-batches it
# appears in, that the total flux across all sub-batches equals the total
# batch flux, and that the most full batch contains <= 1.1 * the flux in the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.05?

Oh, actually, with the rhs using the minimum, it won't be this. And in fact, I don't think the code makes any guarantee about that ratio. There can be arbitrarily many batches up to 5% bigger than nominal, so the minimum can be arbitrarily small (even 0 I believe). So your use of 1.1 here is I think just something that happens to work for this particular case.

If we want to avoid that "failure mode" (of having a very small batch), then I think we could lessen the likelihood of it by recalculating the nominal and loose levels for the rest of the batches any time one batch goes over the nominal level. Not sure how important that is, but it's something we could implement if we see that there is often a batch at the end that is very low flux.

Comment thread imsim/photon_pooling.py
dataclasses.replace(obj, phot_flux=available_flux)
)
remaining_flux -= available_flux
current_subbatch = (1 + current_subbatch) % nsubbatch
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this algorithm correctly, I don't think the % nsubbatch is ever necessary. Is this a relic from a previous implementation? Or am I missing something here?

]
run_subbatch_test("fragmentation in first sub-batch", batch, 2)

# Make sure the sub-batcher can go backwards (i.e. assign to subbatches
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think the algorithm ever did go backwards? Again, am I missing something here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants