KVM NAS backup: resume VM and exit on backup failure#12872
KVM NAS backup: resume VM and exit on backup failure#12872jmsperu wants to merge 1 commit intoapache:4.22from
Conversation
Fix three bugs in nasbackup.sh that caused VMs to remain paused indefinitely when backup jobs fail (e.g. storage full): 1. Add exit after cleanup on Failed backup job status to prevent infinite polling loop in backup_running_vm() 2. Add exit after cleanup on qemu-img convert failure in backup_stopped_vm() to stop processing subsequent disks 3. Add VM state check and virsh resume to cleanup() so paused VMs are automatically resumed after backup failure Fixes apache#12821
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## 4.22 #12872 +/- ##
=========================================
Coverage 17.61% 17.61%
Complexity 15661 15661
=========================================
Files 5917 5917
Lines 531430 531430
Branches 64973 64973
=========================================
+ Hits 93586 93588 +2
+ Misses 427288 427286 -2
Partials 10556 10556
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR fixes failure-handling paths in the KVM NAS backup helper script (nasbackup.sh) to prevent VMs from remaining paused indefinitely when backup operations fail.
Changes:
- Exit immediately after cleaning up on a failed
virshbackup job to avoid infinite polling. - Exit immediately after cleaning up on
qemu-img convertfailure to avoid continuing disk processing. - Enhance
cleanup()to attempt to resume a paused VM before removing temporary files and unmounting NAS storage.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Resume the VM if it was paused (e.g. by virsh backup-begin) | ||
| if [[ -n "$VM" ]]; then | ||
| local vm_state | ||
| vm_state=$(virsh -c qemu:///system domstate "$VM" 2>/dev/null) |
There was a problem hiding this comment.
With set -eo pipefail enabled, vm_state=$(virsh ... domstate ...) will cause cleanup() to exit immediately if virsh domstate returns non-zero (e.g., VM not found / libvirt transient error), which can prevent unmount/removal and also skip the resume attempt. Please make the domstate probe non-fatal (e.g., allow failure and treat state as empty) so cleanup always completes best-effort.
| vm_state=$(virsh -c qemu:///system domstate "$VM" 2>/dev/null) | |
| vm_state=$(virsh -c qemu:///system domstate "$VM" 2>/dev/null || true) |
| if [[ -n "$VM" ]]; then | ||
| local vm_state | ||
| vm_state=$(virsh -c qemu:///system domstate "$VM" 2>/dev/null) | ||
| if [[ "$vm_state" == "paused" ]]; then |
There was a problem hiding this comment.
virsh domstate output can include a reason suffix (e.g., paused (ioerror)), so an exact comparison to "paused" may miss paused VMs and fail to resume them. Consider matching a paused prefix (or use domstate --reason and parse the first token) so paused VMs are reliably detected.
| if [[ "$vm_state" == "paused" ]]; then | |
| if [[ "$vm_state" == paused* ]]; then |
Summary
Fix three bugs in
nasbackup.shthat caused VMs to remain paused indefinitely when backup jobs fail (e.g. storage full, I/O error):Infinite polling loop:
backup_running_vm()falls through theFailedcase without exiting, causing the script to poll the already-failed job forever. Fixed by addingexit 1after cleanup.Continued processing after failure:
backup_stopped_vm()continues processing subsequent disks afterqemu-img convertfails. Fixed by addingexit 1after cleanup.VM never resumed:
cleanup()removes temp files and unmounts but never resumes the VM that was paused byvirsh backup-begin. Fixed by adding a VM state check andvirsh resumeat the top ofcleanup().Root Cause
When
virsh backup-beginstarts a backup, the VM may be paused. If the backup fails for any reason (storage full, network issue, I/O error), the script's cleanup path never callsvirsh resume, leaving the VM paused until manual intervention.Test Plan
qemu-img convertfailure on stopped VM exits cleanlyFixes #12821