feat: add per-SP piece cleanup job to bound storage growth by Chaitu-Tatipamula · Pull Request #346 · FilOzone/dealbot

Chaitu-Tatipamula · 2026-03-10T22:10:06Z

Summary

Adds a piece_cleanup job type that periodically deletes the oldest pieces from each SP once total stored data exceeds a configurable quota (MAX_DATASET_STORAGE_SIZE_BYTES, default 24 GiB). Cleanup runs on the existing sp.work queue with singleton-per-SP semantics, ensuring it doesn't overlap with deal/retrieval work.

Closes #281

What it does

Quota-based cleanup: Deletes oldest pieces (FIFO) until SP is back under quota
Over-quota gating: Skips new deal creation for SPs that exceed the storage quota
Idempotent deletion: Treats contract reverts for already-deleted pieces as success and continues
Runtime cap: Aborts cleanup after MAX_PIECE_CLEANUP_RUNTIME_SECONDS (default 300s)
Rate-based scheduling: JOB_PIECE_CLEANUP_PER_SP_PER_HOUR (default 1), consistent with DEALS_PER_SP_PER_HOUR pattern

Changes

Config & Entity

New env vars: MAX_DATASET_STORAGE_SIZE_BYTES, JOB_PIECE_CLEANUP_PER_SP_PER_HOUR, MAX_PIECE_CLEANUP_RUNTIME_SECONDS
Deal entity: cleaned_up (bool) and cleaned_up_at (timestamptz) columns
piece_cleanup added to JobType union

Core Service (`piece-cleanup/`)

PieceCleanupService — cleanup loop, quota check, Synapse SDK integration
isProviderOverQuota() — used by deal handler to gate new deals
Shared Synapse instance with on-demand fallback
No-progress bail-out to prevent infinite loops when all deletions fail

Jobs Integration (`jobs/`)

handlePieceCleanupJob with AbortController timeout
Over-quota check in handleDealJob
Rate-based interval in getIntervalSecondsForRates
Always-scheduled in ensureScheduleRows

Documentation

docs/environment-variables.md — 3 new env vars
docs/jobs.md — piece_cleanup in job types table, capacity formula
docs/runbooks/jobs.md — pause/resume/trigger SQL examples
docs/checks/production-configuration-and-approval-methodology.md — updated FAQ

Testing

24 new tests in piece-cleanup.service.spec.ts
Updated jobs.service.spec.ts for new config structure
292/292 tests pass, biome ci clean

things to be considered

Soft-delete vs hard-delete: Currently soft-deletes deal rows (cleaned_up=true).
Failure backoff: Currently logs piece_cleanup_no_progress and retries next hour. maybe we can alert when this event appears like 3+ times in a cleanup interval (1 hour)

Copilot

Pull request overview

Adds a new “piece cleanup” mechanism to automatically delete old pieces from storage providers when they exceed a configured quota, and integrates it into the pg-boss job scheduler (including optional deal-job gating when over quota).

Changes:

Introduces PieceCleanupService (+ module + tests) to compute stored bytes per SP and delete oldest pieces until back under quota.
Extends pg-boss scheduling/worker handling with a new per-SP piece_cleanup job type and adds deal-job over-quota gating.
Adds new env/config knobs and documentation for quota size, cleanup rate, and cleanup runtime cap.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
docs/environment-variables.md	Documents new env vars for quota + cleanup scheduling/runtime.
apps/backend/src/piece-cleanup/piece-cleanup.service.ts	Implements quota check + cleanup loop and Synapse delete calls.
apps/backend/src/piece-cleanup/piece-cleanup.service.spec.ts	Adds unit tests for stored-bytes calc, cleanup looping, abort/idempotency.
apps/backend/src/piece-cleanup/piece-cleanup.module.ts	Wires cleanup service with TypeORM Deal repository + Wallet SDK.
apps/backend/src/jobs/jobs.service.ts	Adds `piece_cleanup` job handling, schedules it per SP, and gates deal jobs when over quota.
apps/backend/src/jobs/jobs.service.spec.ts	Updates schedule-row expectations and constructor deps for new service/config.
apps/backend/src/jobs/jobs.module.ts	Imports `PieceCleanupModule` so JobsService can use it.
apps/backend/src/database/entities/job-schedule-state.entity.ts	Extends `JobType` union to include `piece_cleanup`.
apps/backend/src/database/entities/deal.entity.ts	Adds `cleanedUp` / `cleanedUpAt` fields to track deletions.
apps/backend/src/config/app.config.ts	Adds Joi validation + config loading for new piece cleanup env vars.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

BigLep · 2026-03-12T15:31:02Z

NOte we may be delayed looking at this since it's not a priority for the M4.1 mainnet readiness milestone we're working on.

SgtPooki

@Chaitu-Tatipamula the main concern I have with this approach is that we’re using our DB as the source of truth for how much data an SP currently has stored. I think that needs to be inverted. But there are two other main concerns below as well

We should use synapse-sdk / filecoin-pin first to determine the provider’s actual live stored size, and only if the provider is over quota should we query our DB to decide which pieces are eligible for cleanup. The DB is useful for candidate selection and bookkeeping, but it shouldn’t be the thing that decides whether we’re over quota, because it can drift from chain/provider reality.

I’d also want the cleanup path to reconcile DB state after deletion so we don’t keep records around for pieces that are no longer live. That matters not just for quota accounting, but also because those rows are what we use to decide what retrieval work to run. If a piece is already gone from the provider, we should clean up the DB state as part of reconciliation rather than continuing to treat it as stored data.

Concretely, I think the flow should be:

Ask Synapse/filecoin-pin for the SP’s current live usage.
If over quota, query the DB for the oldest eligible pieces to delete.
After each successful delete, update/remove the DB record so retrieval scheduling stays correct.
Add tests around DB/provider drift so we don’t block deals or schedule retrievals based on stale rows.

I don’t think this should run hourly by default. #281 doesn’t require that cadence, and I’d prefer cleanup to run once per day unless we have evidence we need it to be more aggressive. The current PR models this as runs-per-hour with a default of 1, which feels too frequent for a maintenance/reconciliation job.

I think this needs a second threshold, not just a max quota. When we detect an SP is over quota, we should clean down to a configurable target or create a configurable amount of free headroom, not stop at “barely under quota.” That target needs to be chosen relative to the cleanup cadence and expected growth rate. Otherwise a daily/weekly cleanup can end up either thrashing near the threshold or spending an unbounded amount of time deleting pieces in one run. I’d like to see configurable high-water/low-water behavior here, plus a bounded per-run cleanup budget.

Chaitu-Tatipamula · 2026-03-17T07:53:56Z

We should use synapse-sdk / filecoin-pin first to determine the provider’s actual live stored size, and only if the provider is over quota should we query our DB to decide which pieces are eligible for cleanup. The DB is useful for candidate selection and bookkeeping, but it shouldn’t be the thing that decides whether we’re over quota, because it can drift from chain/provider reality.

will implement using filecoin-pin's calculateActualStorage() for the quota decision.

i've gone through both libraries and found filecoin-pin has a purpose-built calculateActualStorage() function that:

takes a Synapse instance + filtered dataset list
queries each dataset's actual pieces from the provider
returns { totalBytes, pieceCount, dataSetCount } with abort/timeout support and provider-scoped

combined with listDataSets(synapse, { filter: ds => ds.serviceProvider === spAddress && ds.isLive }) to scope to a single SP, the updated flow will be:

listDataSets(synapse, { filter }) → get live datasets for this SP
calculateActualStorage(synapse, spDatasets, { signal }) → get actual totalBytes from the provider
If totalBytes > MAX_THRESHOLD → query db for oldest cleanup candidates
db SUM(piece_size) remains for candidate selection only, not quota decisions

The PieceInfo type also has a status enum (ACTIVE, PENDING_REMOVAL, ONCHAIN_ORPHANED, OFFCHAIN_ORPHANED) which will be useful for the reconciliation concern as well.

I’d also want the cleanup path to reconcile DB state after deletion so we don’t keep records around for pieces that are no longer live. That matters not just for quota accounting, but also because those rows are what we use to decide what retrieval work to run. If a piece is already gone from the provider, we should clean up the DB state as part of reconciliation rather than continuing to treat it as stored data.

this is partially implemented, will add explicit drift tests.

the current implementation already handles reconciliation:

deletePiece() marks cleanedUp = true on success
contract reverts like "Can only schedule removal of live pieces" are caught and treated as idempotent success — the deal is still marked cleanedUp = true so it's excluded from future quota calculations and retrieval scheduling
both getCleanupCandidates() and getStoredBytesForProvider()already filter on cleanedUp = false

i can add explicit tests covering the DB/provider drift scenario:
test: piece already deleted on provider → deletePiece catches the revert → cleanedUp = true → retrieval scheduler won't pick it up
test: verify getStoredBytesForProvider() excludes cleaned-up deals
combined with the first concern, this means stale db rows wont affect the quota decision at all - theyll only remain until the next cleanup run discovers and reconciles them.

or would you suggest removing the DB record altogether?

I don’t think this should run hourly by default. #281 doesn’t require that cadence, and I’d prefer cleanup to run once per day unless we have evidence we need it to be more aggressive. The current PR models this as runs-per-hour with a default of 1, which feels too frequent for a maintenance/reconciliation job.

will change from 1/hour to 1/day.

I think this needs a second threshold, not just a max quota. When we detect an SP is over quota, we should clean down to a configurable target or create a configurable amount of free headroom, not stop at “barely under quota.” That target needs to be chosen relative to the cleanup cadence and expected growth rate. Otherwise a daily/weekly cleanup can end up either thrashing near the threshold or spending an unbounded amount of time deleting pieces in one run. I’d like to see configurable high-water/low-water behavior here, plus a bounded per-run cleanup budget.

will add a configurable low-water mark (TARGET) and document the headroom rationale

New env var: TARGET_DATASET_STORAGE_SIZE_BYTES defaulted to 20 GiB. when cleanup triggers (live usage exceeds the 24 GiB MAX), it will delete oldest pieces until usage drops below the 20 GiB TARGET, creating ~4 GiB of headroom.

at the current growth rate (4 deals/SP/hour × 10 MiB = ~960 MiB/day), 4 GiB of headroom gives ~4 days before the high-water mark is hit again, which aligns well with a daily cleanup cadence.

For the bounded budget:

runtime budget: already have MAX_PIECE_CLEANUP_RUNTIME_SECONDS (default 300s)
piece-count budget: I can add MAX_PIECES_PER_CLEANUP_RUN (default e.g., 500) as a hard cap on deletions per run
this gives operators three points: time budget, piece-count budget, and target headroom. do you think both budget types, or is just the runtime cap sufficient?

let me know if i am on right path @SgtPooki !

SgtPooki · 2026-03-27T15:54:06Z

sorry i didn't get to this. heading on vacation so i wont be able to look until 2026 APR 8+

@silent-cipher can you take over?

silent-cipher

We generally use standardized field names in structured logs to help with filtering in SPs:

providerAddress
providerId
providerName

It would be good to align logs with these conventions. I’ve added a few suggestions, but it may be worth applying this consistently across all log statements.

The current deletion flow looks potentially incorrect.
When creating a storage context, the sdk resolves the dataset using either:

dataset metadata (via options), or
dataSetId (preferred for explicit resolution)

In this case, since no metadata is provided, the SDK may resolve a dataset with empty metadata, which is risky. We should explicitly pass dataSetId to ensure correct dataset resolution before calling deletePiece.

silent-cipher

job-schedule.repository.ts likely needs to be revisited.

deleteSchedulesForInactiveProviders should include the piece_cleanup job type
countBossJobStates should also account for piece_cleanup jobs
minBossJobAgeSecondsByState should include piece_cleanup in its query as well

Chaitu-Tatipamula · 2026-04-07T20:44:58Z

job-schedule.repository.ts likely needs to be revisited.

deleteSchedulesForInactiveProviders should include the piece_cleanup job type

countBossJobStates should also account for piece_cleanup jobs

minBossJobAgeSecondsByState should include piece_cleanup in its query as well

i am adding piece_cleanup to deleteSchedulesForInactiveProviders since this hardcodes the jobType. for countBossJobStates and minBossJobAgeSecondsByState, these already handle it: piece_cleanup jobs go through the shared sp.work queue with jobType being "piece_cleanup" wdyt @silent-cipher ?

silent-cipher

Overall, this lgtm!

One last suggestion: could we pass a consistent log context throughout the piece cleanup process, similar to what we do for deal, retrieval, and dataset creation flows?
This would help keep logging consistent and make debugging easier.

- Add PieceCleanupService with while-loop cleanup until under quota - Implement idempotent deletion (not found / already deleted = success) - Add no-progress bail-out guard to prevent infinite loops - Add isProviderOverQuota() for deal creation gating - Add over-quota gating in handleDealJob (skip deal if SP above quota) - Use jobs.maxPieceCleanupRuntimeSeconds for timeout in handlePieceCleanupJob - Add PieceCleanupModule to JobsModule and AppModule imports - Update jobs.service.spec.ts config mock for new structure - Add 23 unit tests covering all edge cases

- Removed unused `PIECE_CLEANUP_QUEUE` constant. Piece cleanup jobs correctly run on the shared `sp.work` singleton queue. - Fixed a timeout leak in `handleDealJob` by moving the SP over-quota check before setting up the `AbortController` timer. - Optimized `cleanupPiecesForProvider` to create a single `StorageContext` before the batch deletion loop.

- Drop dataSetId from the deletePiece - Align getStoredBytesForProvider with getCleanupCandidates - Add unit tests for over-quota gating in handleDealJob

…ounter pieceSize is set atomically with pieceId and DEAL_CREATED status by the upload pipeline. There is no real scenario where a cleanup candidate has pieceSize = 0.

…ot a function" warnings

- Query filecoin-pin's calculateActualStorage() for real-time SP usage instead of relying on DB SUM(piece_size) for quota checks - Fall back to DB when live query is unavailable - Add high-water (MAX) / low-water (TARGET) thresholds so cleanup deletes down to 20 GiB instead of stopping at 24 GiB boundary - Change default cleanup cadence from hourly to daily - Exclude cleaned-up deals from retrieval scheduling queries - Add drift tests for deal blocking and retrieval selection - Update environment-variables docs

…gration - Fix off-by-one JobsServiceDeps indices after pieceCleanupService insertion at [8]

- Add Joi cross-validation ensuring TARGET < MAX storage thresholds - Fix pieceCleanupPerSpPerHour default mismatch (1 → 1/24) - Type cleanedUpAt as Date | null to match nullable column - Use createSynapseFromConfig for session key + RPC URL support - Import StorageContext from @filoz/synapse-sdk/storage - Pass dataSetId explicitly to createContext to prevent wrong-dataset deletion - Add dataSetId null guard in deletePiece - Normalize SP addresses to lowercase in listDataSets filter - Standardize log fields to use providerAddress

Replace all spAddress log fields with providerAddress and use static messages with structured data fields for consistent log filtering.

- Extract checkProviderQuota() to DRY up isProviderOverQuota and cleanupPiecesForProvider (both had identical live→DB fallback logic) - Filter getCleanupCandidates and getStoredBytesForProvider by wallet_address to prevent cleaning up pieces from other wallets - Add piece_cleanup to deleteSchedulesForInactiveProviders so cleanup schedules are properly removed when providers become inactive

# Conflicts: # apps/backend/src/jobs/jobs.module.ts # apps/backend/src/jobs/jobs.service.spec.ts # apps/backend/src/jobs/jobs.service.ts # docs/jobs.md # docs/runbooks/jobs.md

Use provider-reported storage as the quota source of truth instead of falling back to DB totals. Thread the cleanup abort signal through the live storage calculation and treat timed-out actual-storage results as indeterminate. Leave terminated datasets and blocked-provider cleanup behavior unchanged pending a follow-up reconciliation policy, since FWSS still allows piece removals after service termination until the PDP end epoch.

SgtPooki · 2026-04-21T18:47:35Z

Over-quota gating: Skips new deal creation for SPs that exceed the storage quota

we should not prevent deal creation.. we just need to make sure cleanup is releasing the pressure of large storage.

dealbot should constantly be running jobs that perform data-storage and retrieval checks.. im fixing the merge conflicts and addressing some things and will push up on this PR shortly.

Remove over-quota gating from deal jobs so data-storage checks continue even when an SP is above the cleanup threshold. Keep cleanup responsible for creating headroom, make dataset listing abort-aware, and update docs/tests to match the cleanup-only behavior.

Use provider-reported storage first, then fall back to DB piece-size accounting when the live query fails. Keep aborts from falling back so cleanup job runtime limits still stop the run instead of authorizing deletion from DB totals after cancellation.

Treat filecoin-pin actual-storage timeouts as indeterminate cleanup state rather than live-query failures that can fall back to DB accounting. Update the deal-job regression test to guard against real cleanup service calls instead of an obsolete quota helper.

FilOzzy added this to FOC Mar 10, 2026

github-project-automation Bot moved this to 📌 Triage in FOC Mar 10, 2026

Chaitu-Tatipamula force-pushed the feat/piece-cleanup branch from c8ea542 to c98b174 Compare March 10, 2026 22:47

BigLep requested a review from Copilot March 10, 2026 23:54

BigLep moved this from 📌 Triage to 🔎 Awaiting review in FOC Mar 10, 2026

Copilot started reviewing on behalf of BigLep March 10, 2026 23:55 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

Comment thread apps/backend/src/database/entities/deal.entity.ts Outdated

Comment thread apps/backend/src/piece-cleanup/piece-cleanup.service.ts

Comment thread apps/backend/src/piece-cleanup/piece-cleanup.service.ts Outdated

Comment thread apps/backend/src/jobs/jobs.service.ts Outdated

Chaitu-Tatipamula force-pushed the feat/piece-cleanup branch 2 times, most recently from 8926d91 to a2a9bfc Compare March 11, 2026 07:52

BigLep assigned Chaitu-Tatipamula Mar 12, 2026

SgtPooki requested changes Mar 16, 2026

View reviewed changes

github-project-automation Bot moved this from 🔎 Awaiting review to ⌨️ In Progress in FOC Mar 16, 2026

Chaitu-Tatipamula force-pushed the feat/piece-cleanup branch 2 times, most recently from 672514d to bb16d11 Compare March 18, 2026 14:19

Chaitu-Tatipamula requested a review from SgtPooki March 18, 2026 14:20

BigLep moved this from ⌨️ In Progress to 🔎 Awaiting review in FOC Mar 19, 2026

Chaitu-Tatipamula force-pushed the feat/piece-cleanup branch from bb16d11 to 21e0b70 Compare March 23, 2026 18:16

silent-cipher requested changes Apr 6, 2026

View reviewed changes

github-project-automation Bot moved this from 🔎 Awaiting review to ⌨️ In Progress in FOC Apr 6, 2026

Chaitu-Tatipamula force-pushed the feat/piece-cleanup branch 2 times, most recently from 582e0bb to 2bed88a Compare April 6, 2026 18:18

Chaitu-Tatipamula requested a review from silent-cipher April 6, 2026 20:08

silent-cipher requested changes Apr 7, 2026

View reviewed changes

Comment thread apps/backend/src/piece-cleanup/piece-cleanup.service.ts Outdated

Comment thread apps/backend/src/piece-cleanup/piece-cleanup.service.ts

silent-cipher requested changes Apr 7, 2026

View reviewed changes

Chaitu-Tatipamula requested a review from silent-cipher April 7, 2026 20:46

silent-cipher approved these changes Apr 8, 2026

View reviewed changes

Chaitu-Tatipamula added 16 commits April 21, 2026 11:58

fix(piece-cleanup): address PR review feedback on cleanup service

72b22ce

- Drop dataSetId from the deletePiece - Align getStoredBytesForProvider with getCleanupCandidates - Add unit tests for over-quota gating in handleDealJob

chore(db): add migration for piece cleanup columns

3d829a8

docs: document piece_cleanup job

6412afd

refactor(piece-cleanup): remove fileSize fallback from bytesRemoved c…

60a4b7c

…ounter pieceSize is set atomically with pieceId and DEAL_CREATED status by the upload pipeline. There is no real scenario where a cleanup candidate has pieceSize = 0.

test(jobs): add default isProviderOverQuota mock to prevent "... is n…

cff1223

…ot a function" warnings

fix(piece-cleanup): align with synapse-sdk integration

cb5bc81

fix: resolve rebase conflicts with synapse-sdk and piece-cleanup inte…

2cbe7bb

…gration - Fix off-by-one JobsServiceDeps indices after pieceCleanupService insertion at [8]

style: standardize remaining log fields in piece-cleanup service

9511310

Replace all spAddress log fields with providerAddress and use static messages with structured data fields for consistent log filtering.

refactor: remove unused formatBytes and StorageContext

34986af

feat: add PieceCleanupLogContext and thread it through cleanup flow

cd5876b

refactor: accept ProviderJobContext at cleanup public API boundary

8a13e30

Chaitu-Tatipamula force-pushed the feat/piece-cleanup branch from dc05dc8 to 8a13e30 Compare April 21, 2026 06:28

Chaitu-Tatipamula and others added 3 commits April 21, 2026 12:00

fix: fixes code quality CI

fee6042

Merge remote-tracking branch 'origin/main' into codex/pr346-merge-fixes

161f1c3

# Conflicts: # apps/backend/src/jobs/jobs.module.ts # apps/backend/src/jobs/jobs.service.spec.ts # apps/backend/src/jobs/jobs.service.ts # docs/jobs.md # docs/runbooks/jobs.md

SgtPooki added 2 commits April 21, 2026 14:49

SgtPooki mentioned this pull request Apr 21, 2026

Define cleanup policy for blocked providers and terminated datasets #462

Open

SgtPooki approved these changes Apr 21, 2026

View reviewed changes

github-project-automation Bot moved this from 🔎 Awaiting review to ✔️ Approved by reviewer in FOC Apr 21, 2026

SgtPooki merged commit c9c50b2 into FilOzone:main Apr 21, 2026
7 checks passed

github-project-automation Bot moved this from ✔️ Approved by reviewer to 🎉 Done in FOC Apr 21, 2026

FilOzzy mentioned this pull request Apr 21, 2026

chore: release to production (main) #458

Open

Conversation

Chaitu-Tatipamula commented Mar 10, 2026

Summary

What it does

Changes

Config & Entity

Core Service (piece-cleanup/)

Jobs Integration (jobs/)

Documentation

Testing

things to be considered

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BigLep commented Mar 12, 2026

Uh oh!

SgtPooki left a comment

Choose a reason for hiding this comment

Uh oh!

Chaitu-Tatipamula commented Mar 17, 2026

Uh oh!

SgtPooki commented Mar 27, 2026

Uh oh!

silent-cipher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

silent-cipher left a comment

Choose a reason for hiding this comment

Uh oh!

Chaitu-Tatipamula commented Apr 7, 2026

Uh oh!

silent-cipher left a comment

Choose a reason for hiding this comment

Uh oh!

SgtPooki commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Core Service (`piece-cleanup/`)

Jobs Integration (`jobs/`)