Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions doc/content/lib/xenctrl/get_free_buddy-flowchart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
---
title: Flowchart of get_free_buddy() of the Xen Buddy allocator
hidden: true
---
```mermaid
flowchart TD
alloc_round_robin--No free memory on the host-->Failure
node_affinity_exact--No free memory<br>on the Domain's
node_affinity nodes:<br>Abort exact allocation-->Failure

get_free_buddy["get_free_buddy()"]
-->MEMF_node{memflags<br>&<br>MEMF_node?}
--Yes-->
try_MEMF_node{Alloc<br>from<br>node}--Success: page-->Success
try_MEMF_node--No free memory on the node
-->MEMF_exact{memflags<br>&<br>MEMF_exact?}
MEMF_exact--"No"-->node_affinity_set{NUMA affinity set?}
node_affinity_set
--Domain->node_affinity is<br>not set: Fall back to<br>round-robin allocation
-->alloc_round_robin
MEMF_exact--No free memory on<br>the requested NUMA node:
Abort exact allocation-->Failure
MEMF_node--No NUMA node in memflags
-->node_affinity_set{domain-><br>node_affinity<br>set?}
--Set-->node_affinity{Alloc from<br>node_affinity<br>nodes}
--No free memory on<br>the node_affinity nodes<br>Check if exact request
-->node_affinity_exact{memflags<br>&<br>MEMF_exact?}
--Not exact: Fall back to<br>round-robin allocation-->alloc_round_robin
node_affinity--Success: page-->Success
alloc_round_robin{"Fall back to<br>round-robin
allocation"}--Success: page-->Success(Success: Return the page)
click get_free_buddy
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L855" _blank
```
100 changes: 100 additions & 0 deletions doc/content/lib/xenctrl/populate_physmap-dataflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
---
title: Flowchart for the populate_physmap hypercall
hidden: true
---
```mermaid
flowchart

subgraph XenCtrl
xc_domain_populate_physmap["<tt>xc_domain_populate_physmap()"]
xc_domain_populate_physmap_exact["<tt>xc_domain_populate_physmap_exact()"]
end

subgraph Xen

%% sub-subgraph from memory_op() to populate_node() and back

xc_domain_populate_physmap & xc_domain_populate_physmap_exact
<--reservation,<br>and for preempt:<br>nr_start/nr_done-->
memory_op("<tt>memory_op(XENMEM_populate_physmap)")

memory_op
--struct xen_memory_reservation-->
construct_memop_from_reservation("<tt>construct_memop_from_reservation()")
--struct<br>xen_memory_reservation->mem_flags-->
propagate_node("<tt>propagate_node()")
--_struct<br>memop_args->memflags_-->
construct_memop_from_reservation
--_struct memop_args_
-->memory_op<--struct memop_args *:
struct domain *,
List of extent base addrs,
Number of extents,
Size of each extent (extent_order),
Allocation flags(memflags)-->
populate_physmap[["<tt>populate_physmap()"]]
<-.domain, extent base addrs, extent size, memflags, nr_start and nr_done.->
populate_physmap_loop--if memflags & MEMF_populate_on_demand -->guest_physmap_mark_populate_on_demand("
<tt>guest_physmap_mark_populate_on_demand()")
populate_physmap_loop@{ label: "While extents to populate,
and not asked to preempt,
for each extent left to do:", shape: notch-pent }
--domain, order, memflags-->
alloc_domheap_pages("<tt>alloc_domheap_pages()")
--zone_lo, zone_hi, order, memflags, domain-->
alloc_heap_pages
--zone_lo, zone_hi, order, memflags, domain-->
get_free_buddy("<tt>get_free_buddy()")
--_page_info_
-->alloc_heap_pages
--if no page-->
no_scrub("<tt>get_free_buddy(MEMF_no_scrub)</tt>
(honored only when order==0)")
--_dirty 4k page_
-->alloc_heap_pages
<--_dirty 4k page_-->
scrub_one_page("<tt>scrub_one_page()")
alloc_heap_pages("<tt>alloc_heap_pages()</tt>
(also splits higher-order pages
into smaller buddies if needed)")
--_page_info_
-->alloc_domheap_pages
--page_info, order, domain, memflags-->assign_page("<tt>assign_page()")
assign_page
--page_info, nr_mfns, domain, memflags-->
assign_pages("<tt>assign_pages()")
--domain, nr_mfns-->
domain_adjust_tot_pages("<tt>domain_adjust_tot_pages()")
alloc_domheap_pages
--_page_info_-->
populate_physmap_loop
--page(gpfn, mfn, extent_order)-->
guest_physmap_add_page("<tt>guest_physmap_add_page()")

populate_physmap--nr_done, preempted-->memory_op
end
click memory_op
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L1409-L1425" _blank
click construct_memop_from_reservation
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L1022-L1071" _blank
click propagate_node
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L524-L547" _blank
click populate_physmap
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L159-L314" _blank
click populate_physmap_loop
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L197-L304" _blank
click guest_physmap_mark_populate_on_demand
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L210-220" _blank
click guest_physmap_add_page
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L296" _blank
click get_free_buddy
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L855-L958" _blank
click alloc_heap_pages
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L967-L1116" _blank
click assign_page
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2540-L2633" _blank
click assign_pages
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2635-L2639" _blank
click alloc_domheap_pages
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2641-L2697" _blank
```
53 changes: 53 additions & 0 deletions doc/content/lib/xenctrl/struct/xen_memory_reservation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
title: xen_memory_reservation
description: xen_memory_reservation for memory-related hypercalls
hidden: true
---
[struct xen_memory_reservation](https://github.com/xen-project/xen/blob/96970b46/xen/include/public/memory.h#L46-80)
is used by
[these XENMEM hypercall commands](https://github.com/xen-project/xen/blob/96970b46/xen/include/public/memory.h#L48-59):

- `XENMEM_increase_reservation`: Returns the first MFN of the allocated extents
- `XENMEM_decrease_reservation`: To pass the first GPFN of extents to free
- [XENMEM_populate_physmap](../xc_domain_populate_physmap):
- In: To pass the first GPFN to populate with memory
- Out: Returns the first GMFN base of extents that were allocated
(NB. This command also updates the mach_to_phys translation table)
- `XENMEM_claim_pages`: Not used, must be passed as 0
(This is explicitly checked: Otherwise, it returns `-EINVAL`)

[struct xen_memory_reservation](https://github.com/xen-project/xen/blob/96970b46/xen/include/public/memory.h#L46-80)
is defined as:

```c
struct xen_memory_reservation {
XEN_GUEST_HANDLE(xen_pfn_t) extent_start; /* PFN of the starting extent */
xen_ulong_t nr_extents; /* number of extents of size extent_order */
unsigned int extent_order; /* an order of 0 means: 4k pages, 1: 8k, etc. */
unsigned int mem_flags;
domid_t domid; /* integer ID of the domain */
};
```

The `mem_flags` bit field is accessed using:

```js
/*
* Maximum # bits addressable by the user of the allocated region (e.g., I/O
* devices often have a 32-bit limitation even in 64-bit systems). If zero
* then the user has no addressing restriction. This field is not used by
* XENMEM_decrease_reservation.
*/
#define XENMEMF_address_bits(x) (x)
#define XENMEMF_get_address_bits(x) ((x) & 0xffu)
/* NUMA node to allocate from. */
#define XENMEMF_node(x) (((x) + 1) << 8)
#define XENMEMF_get_node(x) ((((x) >> 8) - 1) & 0xffu)
/* Flag to populate physmap with populate-on-demand entries */
#define XENMEMF_populate_on_demand (1<<16)
/* Flag to request allocation only from the node specified */
#define XENMEMF_exact_node_request (1<<17)
#define XENMEMF_exact_node(n) (XENMEMF_node(n) | XENMEMF_exact_node_request)
/* Flag to indicate the node specified is virtual node */
#define XENMEMF_vnode (1<<18)
```
32 changes: 23 additions & 9 deletions doc/content/lib/xenctrl/xc_domain_node_setaffinity.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,16 +62,30 @@ https://github.com/xen-project/xen/blob/master/xen/common/domain.c#L943-L970"

This function implements the functionality of `xc_domain_node_setaffinity`
to set the NUMA affinity of a domain as described above.
If the new_affinity does not intersect the `node_online_map`,
it returns `-EINVAL`, otherwise on success `0`.

When the `new_affinity` is a specific set of NUMA nodes, it updates the NUMA
`node_affinity` of the domain to these nodes and disables `auto_node_affinity`
for this domain. It also notifies the Xen scheduler of the change.

This sets the preference the memory allocator to the new NUMA nodes,
and in theory, it could also alter the behaviour of the scheduler.
This of course depends on the scheduler and its configuration.
- If `new_affinity` does not intersect the `node_online_map`,
it returns `-EINVAL`. Otherwise, the result is a success and it returns `0`.
- When the `new_affinity` is a specific set of NUMA nodes,
it sets `d->node_affinity` of the domain to these nodes
and disables `auto_node_affinity` for this domain.
- If `new_affinity` has all bits set, it re-enables `auto_node_affinity`
for this domain and calls
[domain_update_node_aff()](https://github.com/xen-project/xen/blob/e16acd80/xen/common/sched/core.c#L1809-L1876)
to re-set the domain's `node_affinity` mask to the NUMA nodes of the current
the hard and soft affinity of the domain's online vCPUs.

The result of changing the domains' node affinity changes the
preference of the memory allocator to the new NUMA nodes.

Currently, the only scheduling change is that if set before vCPU creation,
the initial pCPU of the new vCPU is the first pCPU of the first NUMA node
in the domain's `node_affinity`. This is if further changed when one of more
`cpupools` are set up.

When done early, before vCPU creation, domain-related data structures
could be allocated using the domain's `node_affinity` NUMA node mask.
With further changes in Xen, also the vCPU struct could be allocated
using it.

## Notes on future design improvements

Expand Down
135 changes: 135 additions & 0 deletions doc/content/lib/xenctrl/xc_domain_populate_physmap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
---
title: xc_domain_populate_physmap()
description: Populate a Xen domain's physical memory map
mermaid:
force: true
---
`xc_domain_populate_physmap()` and `xc_domain_populate_physmap_exact()`
populate a Xen domain's physical memory map:
Both call the `populate_physmap`
hypercall and `xc_domain_populate_physmap_exact()` also sets the flag
for allocating memory only on the given NUMA node.

As an overview, it
[constructs](https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L1022-L1071)
a `struct memop_args` from the requested
[reservation](struct/xen_memory_reservation)
(start address, page size, now many of them, optionally on which NUMA node) and
[passes](https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L1459)
it to
[populate_physmap()](https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L159-L314)
to
[allocate](https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L197)
the requested amount of pages:

{{% include "populate_physmap-dataflow.md" %}}

## construct_memop_from_reservation()

[construct_memop_from_reservation()](https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L1022-L1071)
populates `struct memop_args` using the
[hypercall arguments](struct/xen_memory_reservation). It:

- Copies `extent_start`, `nr_extents`, and `extent_order`.
- Populates `memop_args->memflags` using the given `mem_flags`.

### Converting a vNODE to a pNODE for vNUMA

When a vNUMA vnode is passed using `XENMEMF_vnode`, and `domain->vnuma` and
`domain->vnuma->nr_vnodes` are set, and the vnode maps to a pnode, it also:

- Populates the `pnode` in the `memflags` of the `struct memop_args`
- and sets a `XENMEMF_exact_node_request` in them as well.

### Using propagate_node() to pass a pNODE

If no vNUMA node is passed, `construct_memop_from_reservation`
[calls](https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L1067)
[propagate_node()](https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L524-L547)
to propagate the NUMA node and `XENMEMF_exact_node_request` for use in Xen.

## Allocate pages for the domain

`memory_op()`
[passes](https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L1459)
the populated `struct memop_args` to
[populate_physmap()](https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L159-L314)
to
[loop over the extents](https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L197)
to populate:

For each extent in the reservation,
[it calls](https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L275)
[alloc_domheap_pages()](https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2641)
which
[calls](https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2673)
[alloc_heap_pages()](https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L968)
which in turn
[calls](https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L1005)
[get_free_buddy()](https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L855)
to allocate the requested memory page.

## Find a page using the buddy allocator

[get_free_buddy()](https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L855-L1116)
is the main function of the Xen buddy allocator.
If possible, it tries to find the best NUMA node and memory zone to allocate from.

This flowchart shows an overview of the effects of the decisions described below:

{{% include "get_free_buddy-flowchart.md" %}}

Input parameters:
- `struct domain`
- Zones to allocate from (`zone_hi` until `zone_lo`)
- Page order (size of the page)
- populate_physmap() starts with 1GB pages and falls back to 2MB and 4k pages.

Its first attempt is to find a page of matching page order
on the requested NUMA node(s).

If this is not successful, it looks to breaking higher page orders,
and if that fails too, it lowers the zone until `zone_lo`.

It does not attempt to use not scrubbed pages, but when `memflags`
tell it `MEMF_no_scrub`, it uses `check_and_stop_scrub(pg)` on 4k
pages to prevent breaking higher order pages instead.

If this fails, it checks if other NUMA nodes shall be tried.

### Exact NUMA allocation (on request, e.g. for vNUMA)

For example for vNUMA domains, the calling functions pass one specific
NUMA node, and they would also set `MEMF_exact_node` to make sure that
memory is specifically only allocated from this NUMA node.

If no NUMA node was passed or the allocation from it failed, and
`MEMF_exact_node` was not set in `memflags`, the function falls
back to the first fallback, NUMA-affine allocation.

### NUMA-affine allocation

For local NUMA memory allocation, the domain should have one or more NUMA nodes
in its `struct domain->node_affinity` field when this function is called.

This happens as part of
[NUMA placement](../../../xenopsd/walkthroughs/VM.build/Domain.build/#numa-placement)
which writes the planned vCPU affinity of the domain's vCPUs to the XenStore
which [xenguest](../../../xenopsd/walkthroughs/VM.build/xenguest) reads to
update the vCPU affinities of the domain's vCPUs in Xen, which in turn, by
default (when to domain->auto_node_affinity is active) also updates the
`struct domain->node_affinity` field.

Note: In case it contains multiple
NUMA nodes, this step allocates from the next NUMA node after the previous
NUMA node the domain allocated from in a round-robin way.

Otherwise, the function falls back to host-wide round-robin allocation.

### Host-wide round-robin allocation

When the domain's `node_affinity` is not defined or did not succeed
and `MEMF_exact_node` was not passed in `memflags`, all remaining
NUMA nodes are attempted in a round-robin way: Each subsequent call
uses the next NUMA node after the previous node that the domain
allocated memory from.
Loading