Add clustered migration sync for shared disks (SYNCING barrier) by fabi200123 · Pull Request #403 · cloudbase/coriolis

fabi200123 · 2026-03-25T15:05:10Z

Introduce TASK_STATUS_SYNCING and TASK_TYPES_REQUIRING_CLUSTER_SYNC (DEPLOY_TRANSFER_DISKS, SHUTDOWN_INSTANCE) so multi-instance transfers with base_transfer_action.clustered=True wait for all peer tasks before marking COMPLETED and advancing dependents.

Add clustered boolean on base_transfer_action (DB migration 024)
Plumb clustered through create_instances_transfer, REST transfers API, deployment creation (inherits from parent transfer)
On task_completed: set SYNCING when barrier applies; when all peers are SYNCING, finalize (for deploy: dedupe volumes_info by disk_id, leader gets replicate_disk_data=True, followers False)
ReplicateDisksTask: skip provider replicate_disks for volumes with replicate_disk_data=False and merge back in export disk order
On set_task_error: abort peer tasks stuck in SYNCING for the same type

Volumes schema already allows extra properties; replicate_disk_data is consumed by replication only (default True preserves behavior).

Dany9966 · 2026-04-01T10:44:19Z

+        shared disks. Once the owner replicate task completes, any
+        waiting (SYNCING) replicate tasks are moved back to SCHEDULED
+        so they can continue their normal flow.
+        """


So as far as I understand this, this will basically block the rest of the cluster while the main one gets transferred. I mean this is fine for shared disks, but what I proposed initially was to have them all running in parallel once the syncing tasks (which was not meant to be REPLICATE_DISKS btw) are done. You can add SYNCING on this task alone while waiting for the rest of the tasks to complete (tasks like DEPLOY_REPLICA_DISKS, SHUTDOWN_INSTNANCE), that's also feasible, but please set the volumes_info accordingly, not block all of them.

What I originally envisioned was something like this:
instance1 has: root_disk1, shared_disk1;
instance2 has root_diks2, shared_disk1.

What's the point of blocking instance2 while instance1 is replicating? When you can have the following:
instance1 replicates root_disk1, shared_disk1;
instance2 replicates root_disk2, skips shared_disk1, based on the volumes_info that you can set up before launching the replicate_disks task.

Please let me know if there's anything preventing us from doing a parallel sync, with shared_disks being transferred by only one of the instances.

Introduce TASK_STATUS_SYNCING and TASK_TYPES_TO_SYNC (GET_INSTANCE_INFO, DEPLOY_TRANSFER_DISKS, REPLICATE_DISKS) so multi-instance transfers with base_transfer_action.clustered=True wait for all peer tasks of the same type before leaving SYNCING for COMPLETED and advancing dependents. - clustered is set as len(instances) > 1 on transfer create - On task_completed: enter SYNCING when the barrier applies, then when every peer is SYNCING, run sync hooks (GET_INSTANCE_INFO: promote shareable on export disks, DEPLOY_TRANSFER_DISKS: shared-disk volumes_info, REPLICATE_DISKS: sync change_id) - ReplicateDisksTask: skip provider replicate for replicate_disk_data=False - On task error: abort peers stuck in SYNCING for the same task type

Dany9966 · 2026-04-15T09:01:55Z

                    constants.TASK_STATUS_PENDING,
-                    constants.TASK_STATUS_STARTING):
+                    constants.TASK_STATUS_STARTING,
+                    constants.TASK_STATUS_SYNCING):


Wasn't this already handling whatever _abort_peer_sync_barrier_tasks_on_error does? If one of the tasks gets an error, the rest of the tasks get cancelled, even if the error is from another instance.

Dany9966 · 2026-04-15T09:15:35Z

+    return s if s else None
+
+
+def cluster_disk_identity(disk_id_or_obj):


If this is just some kind of workaround for the fact that we use disk_keys instead of IDs, I think you should fix this provider-side. The provider should always return a good identifier as the disk_id (one that's the same amongst clustered instances). In this case, for VMWare, you should use the disk's FCD ID instead of VM device key (which is arbitrary).

Dany9966 · 2026-04-15T09:28:35Z

+                        or not v.get(
+                            constants.VOLUME_INFO_REPLICATE_DISK_DATA, True))
+                    else 1))
+            if sorted_volumes != volumes_info:


Why are volumes re-ordered?

Dany9966 · 2026-04-15T09:34:50Z

+                owner_vol = owner_volumes_by_disk.get(owner_id, {}).get(ident)
+                if not owner_vol:
+                    continue
+                if "change_id" in owner_vol and (


change_id key is not the only incremental key we use/might use;

should we update this key for all of the instances? As long as we keep the same owner or replicate_disk_data the same between syncs, I don't think so. It's not like users are able to remove the "owner" from the clustered migration anyway, that stays the same, always.

Dany9966 · 2026-04-15T10:20:50Z

          }
        },
-        "required": ["disk_id", "volume_dev"],
+        "required": ["disk_id"],


Why was volume_dev removed?
The replicate_disks heavily relies on it, even if there's no shared disks (the most common case).
I think we'd be better off just setting an empty volume_dev for the volumes we don't actually transfer (non owners). It's better to enforce it for the most common case instead of removing it entirely imo.

Dany9966 · 2026-04-15T11:18:31Z

        volumes_info = _get_volumes_info(task_info)
+        # NOTE: parallel minion attach may run before conductor merges
+        # shareable.
+        utils.apply_export_disk_shareable_metadata_to_volumes_info(


Is this really necessary?
How can the minion make attachments if the deploy replica disks tasks are synced?

petrutlucian94 reviewed Mar 26, 2026

View reviewed changes

Comment thread coriolis/tasks/replica_tasks.py Outdated

fabi200123 force-pushed the cor-707 branch from 637cf20 to a4a64e0 Compare March 26, 2026 13:51

Dany9966 requested changes Apr 1, 2026

View reviewed changes

fabi200123 force-pushed the cor-707 branch 3 times, most recently from 8152e8e to 9813d2b Compare April 9, 2026 01:09

fabi200123 force-pushed the cor-707 branch from 9813d2b to 61ebc94 Compare April 9, 2026 01:21

Dany9966 reviewed Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add clustered migration sync for shared disks (SYNCING barrier)#403

Add clustered migration sync for shared disks (SYNCING barrier)#403
fabi200123 wants to merge 1 commit intocloudbase:masterfrom
fabi200123:cor-707

fabi200123 commented Mar 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Dany9966 Apr 1, 2026

Uh oh!

Dany9966 Apr 1, 2026

Uh oh!

Uh oh!

Dany9966 Apr 15, 2026

Uh oh!

Dany9966 Apr 15, 2026

Uh oh!

Dany9966 Apr 15, 2026

Uh oh!

Dany9966 Apr 15, 2026

Uh oh!

Dany9966 Apr 15, 2026

Uh oh!

Dany9966 Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return s if s else None


		def cluster_disk_identity(disk_id_or_obj):

Conversation

fabi200123 commented Mar 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants