Skip to content

chore: add actor-level retry for infra errors#2062

Open
SantiagoPittella wants to merge 3 commits into
mainfrom
santiagopittella-fix-ntx-builder-note-discarding
Open

chore: add actor-level retry for infra errors#2062
SantiagoPittella wants to merge 3 commits into
mainfrom
santiagopittella-fix-ntx-builder-note-discarding

Conversation

@SantiagoPittella
Copy link
Copy Markdown
Collaborator

@SantiagoPittella SantiagoPittella commented May 7, 2026

closes #2052.

The NTX builder was permanently dropping notes after a run of unrelated infrastructure failures because every error type counted against the per-note max_note_attempts cap.

This PR classifies NtxError into infrastructure vs. intrinsic. Infra failures now retry the same candidate after a exponential backoff (using backon) without touching per-note state. Intrinsic failures keep the existing behaviour.

This PR also introduces a dependency for backoff, backon, and note that I didn't used the retry feature (probably the most important part) because we use a tokio::select! that needs to interleave several other things around each retry. Making use of retry will require further "plumbing" changes.

Comment thread crates/ntx-builder/src/actor/mod.rs Outdated
},
ExecutionOutcome::IntrinsicFailure => {
self.reset_infra_backoff();
self.mode = ActorMode::NoViableNotes;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is incorrect? A failure doesn't mean that there are notes to consume?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three different backoff variations I can think of

  1. Actor based (this impl)
  2. Request based i.e. each request to the prover has its own backoff which resets next tx
  3. Client based i.e. shared by all requests and all actors

I'm actually unsure which we should be doing.. It sort of makes sense to me that it would be client based, but also maybe not? But if one considers a service that rate limits us, then it doesn't care about the distinction, beyond the entire ntxb should slow down.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, maybe per request makes the most sense.. otherwise we share backoffs for unrelated infrastructure e.g. prover, block-producer, validator..

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SantiagoPittella I think it might be more correct to have retries at the request level itself. At that stage, it doesn't really make sense to have an error kind like this, since you'd be managing different errors at each request point.

The current intrinsic errors would also be invoked at a specific point only, so either an is_intrinsic() -> bool or an inline match make sense to me.

We should go over the list to ensure we are doing the correct thing. Do we want to reject prover errors? Maybe? If its caused by a bug.. a problem becomes what if the bug is in the prover, and it gets fixed? Then the note won't be retried though it would now pass. I'm unsure what the correct answer is.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, sorry for the late response. I'm changing to a per-request retry

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this approach I did a better use of the retry crate that you mentioend in the feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants