Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
188 changes: 188 additions & 0 deletions doc/content/toolstack/features/NUMA/lazy-reclaim.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
---
title: "Lazy memory reclaim"
weight: 10
categories:
NUMA
---
## Xen host memory scrubbing

Xen does not immediately reclaim deallocated memory.
Instead, Xen has a host memory scrubber that runs lazily in
the background to reclaim recently deallocated memory.

Thus, there is no guarantee that Xen has finished scrubbing
when `xenopsd` is being asked to build a domain.

## Waiting for enough free host memory

> [!info]
> In case the reclaimed host-wide memory is not sufficient yet,
> when `xenopsd` starts to build a VM, its
> [build_pre](https://github.com/xapi-project/xen-api/blob/073373ff/ocaml/xenopsd/xc/domain.ml#L899-L964)
> (also part of VM restore / VM migration)
> [polls](https://github.com/xapi-project/xen-api/blob/073373ff2abfa386025f2b1eee7131520df76be9/ocaml/xenopsd/xc/domain.ml#L904)
> Xen [until enough host-wide memory](https://github.com/xapi-project/xen-api/blob/073373ff2abfa386025f2b1eee7131520df76be9/ocaml/xenopsd/xc/domain.ml#L236-L272)
> has been reclaimed. See the
> [walk-through of Domain.build](../../../xenopsd/walkthroughs/VM.build/Domain.build.md#build_pre-prepare-building-the-vm)
> of `xenopsd` for more context:

```ml
let build_pre ~xc ~xs ~vcpus ~memory ~has_hard_affinity domid =
let open Memory in
let uuid = get_uuid ~xc domid in
debug "VM = %s; domid = %d; waiting for %Ld MiB of free host memory"
(Uuidx.to_string uuid) domid memory.required_host_free_mib ;
(* CA-39743: Wait, if necessary, for the Xen scrubber to catch up. *)
if
not (wait_xen_free_mem ~xc (Memory.kib_of_mib memory.required_host_free_mib))
then (
error "VM = %s; domid = %d; Failed waiting for Xen to free %Ld MiB"
(Uuidx.to_string uuid) domid memory.required_host_free_mib ;
raise (Not_enough_memory (Memory.bytes_of_mib memory.required_host_free_mib))
) ;
```

This is the implementation of the polling function:

```ml
let wait_xen_free_mem ~xc ?(maximum_wait_time_seconds = 64) required_memory_kib
: bool =
let open Memory in
let rec wait accumulated_wait_time_seconds =
let host_info = Xenctrl.physinfo xc in
let free_memory_kib =
kib_of_pages (Int64.of_nativeint host_info.Xenctrl.free_pages)
in
let scrub_memory_kib =
kib_of_pages (Int64.of_nativeint host_info.Xenctrl.scrub_pages)
in
(* At exponentially increasing intervals, write *)
(* a debug message saying how long we've waited: *)
if is_power_of_2 accumulated_wait_time_seconds then
debug
"Waited %i second(s) for memory to become available: %Ld KiB free, %Ld \
KiB scrub, %Ld KiB required"
accumulated_wait_time_seconds free_memory_kib scrub_memory_kib
required_memory_kib ;
if
free_memory_kib >= required_memory_kib
(* We already have enough memory. *)
then
true
else if scrub_memory_kib = 0L (* We'll never have enough memory. *) then
false
else if
accumulated_wait_time_seconds >= maximum_wait_time_seconds
(* We've waited long enough. *)
then
false
else (
Thread.delay 1.0 ;
wait (accumulated_wait_time_seconds + 1)
)
in
wait 0
```

## Waiting for enough free memory on NUMA nodes

To address the same situation not host-wide but specific to NUMA
nodes, the build, restore and migrate processors of domains on NUMA machines needs a similar algorithm.

This should be done directly before the NUMA placement algorithm
runs, or even as part of an improvement for it:

The NUMA placement algorithm calls the `numainfo` hypercall to
obtain a table of NUMA nodes with the available memory on each
node and the distance matrix between the NUMA nodes as the basis
for the NUMA placement decision for the VM.

If the reported free memory of the host is lower than would be
expected at that moment, this might be an indication that some
memory might not be scrubbed yet. Another indication might be
if the amount of free memory is increasing between two checks.

Also, if other domains are in the process of being shut down,
or if a shutdown recently occurred, Xen is likely scrubbing in
the background.

For cases where the NUMA placement returns no NUMA node affinity
for the new domain, the smallest possible change would be to
simply re-run the NUMA placement algorithm.

As a trivial first step would be to retry once if the initial
NUMA placement of a VM failed and abort retrying if the available
memory did not change since the initial failed attempt.


System-wide polling seems to abort polling when the amount of
free memory did not change compared to the previous poll. For
the NUMA memory poll, the previous results could be kept likewise
and compared to the new results.

Besides, the same polling timeout like for system-wide memory
could be used.

## An example scenario

This is an example scenario where not waiting for memory scrubbing
in a NUMA-aware way could fragment the VM across many NUMA nodes:

In this example, a relatively large VM is rebooted:

Fictional machine with 4 NUMA nodes, 25 GB each (for layout reasons):

```mermaid
%%{init: {"packet": {"bitsPerRow": 25, "rowHeight": 38}} }%%
packet-beta
0-18: "Memory used by other VMs"
19-24: "free: 6 GB"
25-44: "VM before restart: 20 GB"
45-49: "free: 5GB"
50-69: "Memory used by other VMs"
70-74: "free: 5GB"
75-94: "Memory used by other VMs"
95-99: "free: 5GB"
```

VM is destroyed:

```mermaid
%%{init: {"packet": {"bitsPerRow": 25, "rowHeight": 38}} }%%
packet-beta
0-18: "Memory used by other VMs"
19-24: "free: 6 GB"
25-44: "VM memory to be reclaimed, but not yet scrubbed"
45-49: "free: 5GB"
50-69: "Memory used by other VMs"
70-74: "free: 5GB"
75-94: "Memory used by other VMs"
95-99: "free: 5GB"
```

NUMA placement runs, and sees that no NUMA node has enough memory
for the VM. Therefore:
1. NUMA placement does not return a NUMA placement solution.
2. As a result, vCPU soft pinning it not set up
3. As a result, the domain does not get a NUMA node affinity
4. When `xenguest` allocates the VM's memory, Xen falls back to
round-robin memory allocation across all NUMA nodes.

Even if Xen has already scrubbed the memory by the time the
NUMA placement function returns, the decision to not select
a NUMA placement has already been done. Then, the domain is
built in this way:

```mermaid
%%{init: {"packet": {"bitsPerRow": 25, "rowHeight": 38}} }%%
packet-beta
0-18: "Memory used by other VMs"
19-23: "VM: 5GB"
24-24: ""
25-44: "scrubbed/reclaimed free memory: 20 GB"
45-49: "VM: 5GB"
50-69: "Memory used by other VMs"
70-74: "VM: 5GB"
75-94: "Memory used by other VMs"
95-99: "VM: 5GB"
```
62 changes: 62 additions & 0 deletions doc/content/toolstack/features/NUMA/node-fallbacks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
title: VM.build Use neighbouring NUMA nodes
weight: 60
mermaid:
force: true
---

{{% include "snippets/2x2.md" %}}

This shows that the distance to remote memory on the same socket
is far lower than the distance to the other socket, which is twice
the local memory distance.

As a result, this means that using remote memory on the same socket
increases the distance by only by 1/10th than using the other socket.

Hence, for VMs builds during which there might not be enough NUMA
memory on one NUMA node, using same-socket memory would have
only about 50% of the than remote-socket memory would have.

At same time, if the memory is (roughly) equally spread over two
NUMA nodes on the same socket, it could make sense to move the
vCPU affinity between the two NUMA nodes depending on their CPU load.

In the simplest case, the vCPU affinity could be set to e.g. two
NUMA nodes on the same socket (specified as having low distance),
which would cause Xen to allocate the memory from both NUMA nodes.

| Node | RAM | used | free |
| ----:| ---:| ----:| ----:|
| 1 | 50 | 35 | 15 |
| 2 | 50 | 45 | 5 |
| 3 | 50 | 35 | 15 |
| 4 | 50 | 35 | 15 |
| all | 200 | 150 | 50 |

This shows that the distance to remote memory on the same socket
is far lower than the distance to the other socket, which is twice
the local memory distance.

As a result, this means that using remote memory on the same socket
increases the distance by only by 1/10th than using the other socket.

Hence, for VMs builds where might not be enough NUMA memory on one
NUMA node, using same-socket memory would have only about 50% of the
than remote-socket memory would have.

At same time, if the memory is (roughly) equally spread over two
NUMA nodes on the same socket, it could make sense to move the
vCPU affinity between the two NUMA nodes depending on their CPU load.

In the simplest case, the vCPU affinity could be set to e.g. two
NUMA nodes on the same socket (specified as having low distance),
which would cause Xen to allocate the memory from both NUMA nodes.

| Node | RAM | used | free |
| ----:| ---:| ----:| ----:|
| 1 | 50 | 35 | 15 |
| 2 | 50 | 45 | 5 |
| 3 | 50 | 35 | 15 |
| 4 | 50 | 35 | 15 |
| all | 200 | 150 | 50 |
52 changes: 52 additions & 0 deletions doc/content/toolstack/features/NUMA/parallel-VM.build.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
---
title: "Parallel VM build"
categories:
- NUMA
weight: 50
mermaid:
force: true
---

## Introduction

When the `xenopsd` server receives a `VM.start` request, it:
1. splits the request it into micro-ops and
2. dispatches the micro-ops in one queue per VM.

When `VM.start` requests arrive faster than the thread pool
finishes them, the thread pool will run multiple
micro-ops for different VMs in parallel. This includes the
VM.build micro-op that does NUMA placement and VM memory allocation.

The [Xenopsd architecture](xenopsd/architecture/_index) and the
[walkthrough of VM.start](VM.start) provide more details.

This walkthrough dives deeper into the `VM_create` and `VM_build` micro-ops
and focusses on allocating the memory allocation for different VMs in
parallel with respect to the NUMA placement of the starting VMs.

## Architecture

This diagram shows the [architecture](../../../xenopsd/architecture/_index) of Xenopsd:

At the top of the diagram, two client RPCs have been sent:
One to start a VM and the other to fetch the latest events.
The `Xenops_server` module splits them into "micro-ops" (labelled "μ op" here).
These micro-ops are enqueued in queues, one queue per VM. The thread pool pulls
from the VM queues and runs the micro-ops:

![Inside xenopsd](../../../../xenopsd/architecture/xenopsd.svg)
<center><figcaption><i>Image 1: Xenopsd architecture</i></figcaption></center>

Overview of the micro-ops for creating a new VM:

- `VM.create`: create an empty Xen domain in the Hypervisor and the Xenstore
- `VM.build`: build a Xen domain: Allocate guest memory and load the firmware and `hvmloader`
- Several micro-ops to attach devices launch the device model.
- `VM.unpause`: unpause the domain

## Flowchart: Parallel VM start

When multiple `VM.start` run concurrently, an example could look like this:

{{% include "snippets/vm-build-parallel" %}}
61 changes: 61 additions & 0 deletions doc/content/toolstack/features/NUMA/snippets/2x2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
---
title: 2 sockets, 4 nodes
description: NUMA toplogy with 2 sockets, 4 nodes
---

### Example NUMA topology with 2 sockets, 4 nodes:

A topology with 2 sockets and 4 nodes results
in this NUMA memory distance matrix:

|node| 0| 2| 1| 3|
|---:|-:|-:|-:|-:|
| 0 |10|21|11|21|
| 2 |21|10|21|11|
| 1 |11|21|10|21|
| 3 |21|11|21|10|


The distance values in this diagram describes in a normalized way how large
the distance from a NUMA node's CPU to the memory of another node is:

- 10: This is (by convention) the distance to local memory of the NUMA node
- 11: Relative to 10, the distance to the remote memory on the same socket
- 21: Relative to 10, the distance to the remote memory on the other socket

This NUMA distance matrix could be visualized using this block diagram:

{{< mermaid >}}
block-beta
columns 3
%% 1st row, left column
block columns 1
Mem0[/"Memory of Node 0"/]
Dist0<["Distance: 10"]>(up)
Node0{{"CPU of Node 0"}}
end
%% 1st row, middle column
space
%% 1st row, right column
block columns 1
Mem2[/"Memory of Node 2"/]
Dist2<["Distance: 10"]>(up)
Node2{{"CPU of Node 2"}}
end
%% 2nd row
Socket_1<["Distance: 1"]>(y)
x<["Distance: 10"]>(x)
Socket_2<["Distance: 1"]>(y)
%% 3rd row
block columns 1
Node1{{"CPU of Node 1"}}
Dist1<["Distance: 10"]>(down)
Mem1[/"Memory of Node 1"/]
end
space
block columns 1
Node3{{"CPU of Node 3"}}
Dist3<["Distance: 10"]>(down)
Mem3[/"Memory of Node 3"/]
end
{{< /mermaid >}}
6 changes: 6 additions & 0 deletions doc/content/toolstack/features/NUMA/snippets/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
+++
title = "NUMA topologies"
weight = 200
+++

{{% children description=true %}}
Loading