chore: add actor-level retry for infra errors#2062
Conversation
| }, | ||
| ExecutionOutcome::IntrinsicFailure => { | ||
| self.reset_infra_backoff(); | ||
| self.mode = ActorMode::NoViableNotes; |
There was a problem hiding this comment.
I think this is incorrect? A failure doesn't mean that there are notes to consume?
There was a problem hiding this comment.
There are three different backoff variations I can think of
- Actor based (this impl)
- Request based i.e. each request to the prover has its own backoff which resets next tx
- Client based i.e. shared by all requests and all actors
I'm actually unsure which we should be doing.. It sort of makes sense to me that it would be client based, but also maybe not? But if one considers a service that rate limits us, then it doesn't care about the distinction, beyond the entire ntxb should slow down.
There was a problem hiding this comment.
However, maybe per request makes the most sense.. otherwise we share backoffs for unrelated infrastructure e.g. prover, block-producer, validator..
There was a problem hiding this comment.
@SantiagoPittella I think it might be more correct to have retries at the request level itself. At that stage, it doesn't really make sense to have an error kind like this, since you'd be managing different errors at each request point.
The current intrinsic errors would also be invoked at a specific point only, so either an is_intrinsic() -> bool or an inline match make sense to me.
We should go over the list to ensure we are doing the correct thing. Do we want to reject prover errors? Maybe? If its caused by a bug.. a problem becomes what if the bug is in the prover, and it gets fixed? Then the note won't be retried though it would now pass. I'm unsure what the correct answer is.
There was a problem hiding this comment.
Hey, sorry for the late response. I'm changing to a per-request retry
There was a problem hiding this comment.
With this approach I did a better use of the retry crate that you mentioend in the feature
closes #2052.
The NTX builder was permanently dropping notes after a run of unrelated infrastructure failures because every error type counted against the per-note
max_note_attemptscap.This PR classifies
NtxErrorinto infrastructure vs. intrinsic. Infra failures now retry the same candidate after a exponential backoff (usingbackon) without touching per-note state. Intrinsic failures keep the existing behaviour.This PR also introduces a dependency for backoff,
backon, and note that I didn't used theretryfeature (probably the most important part) because we use atokio::select!that needs to interleave several other things around each retry. Making use of retry will require further "plumbing" changes.