Skip to content

Improved pool support#21

Open
TheEarthlord wants to merge 15 commits intomasterfrom
repool
Open

Improved pool support#21
TheEarthlord wants to merge 15 commits intomasterfrom
repool

Conversation

@TheEarthlord
Copy link
Copy Markdown
Contributor

This pull request substantially changes how the machine queue scripts handle pools.
Previously, when performing an operation on a pool, the mq would immediately randomly
select one member from the pool, then proceed with the requested operation as if the
user had asked to do it on that one member only.

These changes mean that (roughly) the pools now work by retrying the requested operation
on every member of the pool. In particular, this means that when waiting for locks on a pool where all
boards are locked by someone else, you will lock the first board to be released, rather than
the old behaviour which picked just one pool member and waited until that specifc board was free.
Details of the new behaviour follow.

sem and run on individual boards

  • Using sem and run for individual boards should work the same way it always did
    with the following exception:
  • Previously if you locked a board with a key, then used run -n on the board,
    it would check that you owned the lock but not that you specified the right key.
    (You would have to give the right key when unlocking the board again later).
    Now, when you use run -n to run a job on a board you've already locked,
    you must give the key you locked the board with as well as owning the lock.
    If you give the wrong key, the scripts exit nonzero.
    Behaviour for pools with run -n matches this behaviour for boards.

sem on pools

  • When you attempt to lock a pool using sem -wait, the scripts will attempt
    locking every member of the pool one after the other. If they succeed in locking
    one of the boards, the scripts will print which board you got and exit with code 0.
    If no pool member is lockable, the script will wait for the period specified by
    -w, then retry locking every member of the pool one after the other.
    The script will retry the whole pool as many times as -t tells it to.
    If no board is successfully locked within the given number of retries, the script
    will exit non-zero.
  • If you Ctrl+C during the wait for locking on a pool, the script will exit nonzero
    and you will not own any locks you didn't before. This is the same as it used to be.
  • When you attempt to unlock a pool using sem -signal, the scripts will attempt
    to unlock everything in the pool. Because of how unlocking works, this will only
    unlock the locks you own, and only if you're giving the right key. It will not affect
    anyone else's locks even if other people have locked boards in the same pool,
    provided you don't use -f.
    There are no timeouts or retries on unlocking, and it always exits 0 whether or not something
    was unlocked (matching the current behaviour of sem on individuals) unless
    a script error happens.
  • When you use sem -info on a pool, it prints information on the lock state
    of all the boards in that pool.

run on pools

  • When you use run on a pool, the script will follow the sem -wait procedure
    for attempting locking every member. Either it will run out of tries and give up,
    exiting nonzero, or it will succeed in locking a board. If it succeeds, the job you
    specify will proceed to run on that board. When the job is done, the board you locked
    will be unlocked and the script will exit 0.
  • When you use run -r on a pool, the script will follow the sem -wait procedure
    for attempting to lock every member. If it gets one, the script will wait for you to enter
    Ctrl+D or Enter, and when you do it will unlock the board, clean up, and exit 0.
  • When you use run -n on a pool, the script will attempt to run on each board in the
    pool, one after the other, but only actually run on the first board it finds where
    you already own the lock and have the right key. If the script finds a board where
    you have the lock and right key, it will run the job then exit 0. If you don't
    have the right lock/key for any of the boards in the pool, the script will continue
    retrying according to your retry parameters, until either
    - the lock state changes so that you have the lock and the right key on one of
    the boards in the pool (the job will run then the scripts exit 0)
    - the scripts have retried as many times as you told them to (the scripts will
    exit nonzero)
    - you manually cancel with Ctrl+C (the scripts will exit nonzero)
  • If you Ctrl+C at any point during run, the script will stop whatever job it's been
    doing, unlock any locks it has taken, and exit nonzero. This is the same as it used to be.

Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
Signed-off-by: Christopher Irving <cirving@trustworthy.systems>
@TheEarthlord TheEarthlord requested a review from lsf37 July 31, 2024 03:49
@lsf37
Copy link
Copy Markdown
Member

lsf37 commented Jul 31, 2024

When you attempt to lock a pool using sem -wait, the scripts will attempt
locking every member of the pool one after the other. If they succeed in locking
one of the boards, the scripts will print which board you got and exit with code 0.
If no pool member is lockable, the script will wait for the period specified by
-w, then retry locking every member of the pool one after the other.
The script will retry the whole pool as many times as -t tells it to.
If no board is successfully locked within the given number of retries, the script
will exit non-zero

I don't think that behaviour would be usable for something like CI -- in a pool with 3 boards, all 3 boards would be in use after the first 3 jobs have started. The 4th job should now be waiting on a lock until a board is free. With this behaviour it will just wait for a specified time and try again, but there is no guarantee or even likelihood that anything will be free at the times it checks (e.g. job 5 could have tried just after and gotten lucky), so it is most likely to just fail after timeout no matter how many retries and timeouts you set if there is contention.

This retry + timeout thing is basically trying to implement another lock, but not doing it right.

To do this properly, I think there needs to be a something like a counting semaphore on the pool -- if there are n boards in the pool, the first n jobs get into the pool and can try board locks sequentially as above. The n+1st job needs to wait on the pool lock until <= n jobs are in the pool again. This needs to be integrated with single board locks, i.e. when a single board is locked/unlocked where the board is also part of a pool, the corresponding pool lock needs to incremented/decremented accordingly. If a board is in multiple pools, all of those need to be incremented/decremented (if the board lock is free, the pool lock must be < n, but there might be race conditions between that and someone trying to get into the pool lock, so the pool lock probably always has to be taken first).

@TheEarthlord
Copy link
Copy Markdown
Contributor Author

The mq already works by waiting, attempting a lock, then waiting again if it fails. That's how locks on individual boards actually behave. I'm just replicating that behaviour for pools. (The lockfile command from the debian procmail package is the core of the locking system, and it works by trying to create a semaphore file, then retrying a few seconds later if the file already existed).
This pool implementation doesn't guarantee that users contending for locks will acquire them first come first served, but the mq doesn't guarantee that for individual boards either. It's always been down to the randomness of who happens to retry first after a lock is released. In my experience virtually everyone including the CI leaves the retry period (-w) unset, so everyone will be retrying at the same default rate (which is once every 8 seconds). If there are some jobs that need to have a better chance of getting the lock than others, you can give them a shorter retry period.

When you say

it is most likely to just fail after timeout no matter how many retries and timeouts you set if there is contention

by "timeout", are you referring to the timeout on a github action which will exit with failure if the job isn't complete in six hours? Or to the retry count on the machine queue scripts, where you can choose how many times lock acquisition is reattempted before the scripts exit with failure? In the latter case, I think CI jobs leave the retry count unset on their use of the mq (it's set with -t) so they'll just keep retrying forever until they get the lock.

To be clear, -T, which the mq scripts call a timeout, is another function entirely and has nothing to do with this aspect of contention to acquire a lock.

Certainly contention means some jobs will have to wait for others, but the assumption is that progress is being made on at least one board in the pool, so that eventually everything will get a chance to access a board and run its job.

The particular value of pools is that if something (like a human doing research) is holding the lock on one board in the pool for a long time, the jobs can still progress through the other board. On the old system the jobs would be assigned to one of the known boards and if a human happened to hold that board for a long time, everything assigned to the held board would be blocked even if the other was free.

@lsf37
Copy link
Copy Markdown
Member

lsf37 commented Aug 14, 2024

The mq already works by waiting, attempting a lock, then waiting again if it fails. That's how locks on individual boards actually behave. I'm just replicating that behaviour for pools. (The lockfile command from the debian procmail package is the core of the locking system, and it works by trying to create a semaphore file, then retrying a few seconds later if the file already existed).

That's exactly it -- the individual boards use the lockfile command, which is a reasonably good lock implementation if everyone uses the defaults. At the level of the user, you can wait on that lock until it becomes free, not just a limited number of tries. Under the hood it also does retry of course, but doing this correctly is subtle. We shouldn't add our own lock implantation into the mix if there already is one.

This pool implementation doesn't guarantee that users contending for locks will acquire them first come first served, but the mq doesn't guarantee that for individual boards either. It's always been down to the randomness of who happens to retry first after a lock is released.

While you don't get full fairness guarantees from lockfile, you get pretty decent behaviour if everyone uses the default.

When you say

it is most likely to just fail after timeout no matter how many retries and timeouts you set if there is contention

by "timeout", are you referring to the timeout on a github action which will exit with failure if the job isn't complete in six hours? Or to the retry count on the machine queue scripts, where you can choose how many times lock acquisition is reattempted before the scripts exit with failure? In the latter case, I think CI jobs leave the retry count unset on their use of the mq (it's set with -t) so they'll just keep retrying forever until they get the lock.

I did mean the latter. If the default is unlimited retries that makes a big difference -- the description above did not sound like that was happening. In the worst case, one set of jobs will then wait until another set of jobs is fully complete instead of giving up on the lock. That's not always great, but not catastrophic. For the current single board locks, I do observe pretty good fairness between sets of jobs.

I haven't looked at the code yet, but if the retry mechanism is local on the tftp machine where the lock files are, then this might work. If it tries a new ssh connection each time, I don't think it would try often enough.

The particular value of pools is that if something (like a human doing research) is holding the lock on one board in the pool for a long time, the jobs can still progress through the other board. On the old system the jobs would be assigned to one of the known boards and if a human happened to hold that board for a long time, everything assigned to the held board would be blocked even if the other was free.

That is the intention, and I'm on board with that. Once CI uses the pool, it will inevitable use all boards of the pool, though, so the contention between CI job sets will almost always be of the kind "all boards in the pool are currently taken and I need to wait until one is free". That is the case I was worried about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants