Conversation
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
I don't think that behaviour would be usable for something like CI -- in a pool with 3 boards, all 3 boards would be in use after the first 3 jobs have started. The 4th job should now be waiting on a lock until a board is free. With this behaviour it will just wait for a specified time and try again, but there is no guarantee or even likelihood that anything will be free at the times it checks (e.g. job 5 could have tried just after and gotten lucky), so it is most likely to just fail after timeout no matter how many retries and timeouts you set if there is contention. This retry + timeout thing is basically trying to implement another lock, but not doing it right. To do this properly, I think there needs to be a something like a counting semaphore on the pool -- if there are n boards in the pool, the first n jobs get into the pool and can try board locks sequentially as above. The n+1st job needs to wait on the pool lock until <= n jobs are in the pool again. This needs to be integrated with single board locks, i.e. when a single board is locked/unlocked where the board is also part of a pool, the corresponding pool lock needs to incremented/decremented accordingly. If a board is in multiple pools, all of those need to be incremented/decremented (if the board lock is free, the pool lock must be < n, but there might be race conditions between that and someone trying to get into the pool lock, so the pool lock probably always has to be taken first). |
|
The mq already works by waiting, attempting a lock, then waiting again if it fails. That's how locks on individual boards actually behave. I'm just replicating that behaviour for pools. (The When you say
by "timeout", are you referring to the timeout on a github action which will exit with failure if the job isn't complete in six hours? Or to the retry count on the machine queue scripts, where you can choose how many times lock acquisition is reattempted before the scripts exit with failure? In the latter case, I think CI jobs leave the retry count unset on their use of the mq (it's set with To be clear, Certainly contention means some jobs will have to wait for others, but the assumption is that progress is being made on at least one board in the pool, so that eventually everything will get a chance to access a board and run its job. The particular value of pools is that if something (like a human doing research) is holding the lock on one board in the pool for a long time, the jobs can still progress through the other board. On the old system the jobs would be assigned to one of the known boards and if a human happened to hold that board for a long time, everything assigned to the held board would be blocked even if the other was free. |
That's exactly it -- the individual boards use the
While you don't get full fairness guarantees from
I did mean the latter. If the default is unlimited retries that makes a big difference -- the description above did not sound like that was happening. In the worst case, one set of jobs will then wait until another set of jobs is fully complete instead of giving up on the lock. That's not always great, but not catastrophic. For the current single board locks, I do observe pretty good fairness between sets of jobs. I haven't looked at the code yet, but if the retry mechanism is local on the
That is the intention, and I'm on board with that. Once CI uses the pool, it will inevitable use all boards of the pool, though, so the contention between CI job sets will almost always be of the kind "all boards in the pool are currently taken and I need to wait until one is free". That is the case I was worried about. |
This pull request substantially changes how the machine queue scripts handle pools.
Previously, when performing an operation on a pool, the mq would immediately randomly
select one member from the pool, then proceed with the requested operation as if the
user had asked to do it on that one member only.
These changes mean that (roughly) the pools now work by retrying the requested operation
on every member of the pool. In particular, this means that when waiting for locks on a pool where all
boards are locked by someone else, you will lock the first board to be released, rather than
the old behaviour which picked just one pool member and waited until that specifc board was free.
Details of the new behaviour follow.
sem and run on individual boards
semandrunfor individual boards should work the same way it always didwith the following exception:
run -non the board,it would check that you owned the lock but not that you specified the right key.
(You would have to give the right key when unlocking the board again later).
Now, when you use
run -nto run a job on a board you've already locked,you must give the key you locked the board with as well as owning the lock.
If you give the wrong key, the scripts exit nonzero.
Behaviour for pools with
run -nmatches this behaviour for boards.sem on pools
sem -wait, the scripts will attemptlocking every member of the pool one after the other. If they succeed in locking
one of the boards, the scripts will print which board you got and exit with code 0.
If no pool member is lockable, the script will wait for the period specified by
-w, then retry locking every member of the pool one after the other.The script will retry the whole pool as many times as
-ttells it to.If no board is successfully locked within the given number of retries, the script
will exit non-zero.
and you will not own any locks you didn't before. This is the same as it used to be.
sem -signal, the scripts will attemptto unlock everything in the pool. Because of how unlocking works, this will only
unlock the locks you own, and only if you're giving the right key. It will not affect
anyone else's locks even if other people have locked boards in the same pool,
provided you don't use
-f.There are no timeouts or retries on unlocking, and it always exits 0 whether or not something
was unlocked (matching the current behaviour of
semon individuals) unlessa script error happens.
sem -infoon a pool, it prints information on the lock stateof all the boards in that pool.
run on pools
runon a pool, the script will follow thesem -waitprocedurefor attempting locking every member. Either it will run out of tries and give up,
exiting nonzero, or it will succeed in locking a board. If it succeeds, the job you
specify will proceed to run on that board. When the job is done, the board you locked
will be unlocked and the script will exit 0.
run -ron a pool, the script will follow thesem -waitprocedurefor attempting to lock every member. If it gets one, the script will wait for you to enter
Ctrl+D or Enter, and when you do it will unlock the board, clean up, and exit 0.
run -non a pool, the script will attempt to run on each board in thepool, one after the other, but only actually run on the first board it finds where
you already own the lock and have the right key. If the script finds a board where
you have the lock and right key, it will run the job then exit 0. If you don't
have the right lock/key for any of the boards in the pool, the script will continue
retrying according to your retry parameters, until either
- the lock state changes so that you have the lock and the right key on one of
the boards in the pool (the job will run then the scripts exit 0)
- the scripts have retried as many times as you told them to (the scripts will
exit nonzero)
- you manually cancel with Ctrl+C (the scripts will exit nonzero)
run, the script will stop whatever job it's beendoing, unlock any locks it has taken, and exit nonzero. This is the same as it used to be.