Skip to content

BaseLib: Strengthen X86MemoryFence#12274

Open
jaykrell wants to merge 3 commits intotianocore:masterfrom
jaykrell:barrier
Open

BaseLib: Strengthen X86MemoryFence#12274
jaykrell wants to merge 3 commits intotianocore:masterfrom
jaykrell:barrier

Conversation

@jaykrell
Copy link
Copy Markdown

@jaykrell jaykrell commented Mar 11, 2026

Description

Under LTCG X86MemoryFence is completely broken.
It does nothing at all.
Under NoLTCG it was reasonable.

  • Breaking change?
    • No.
  • Impacts security?
    • It might help security.
  • Includes tests?
    • No tests.

How This Was Tested

Build.

Integration Instructions

Very easy to integrate.

While this is progress, I still feel uneasy and I think this merits further discussion.

In our code we have MemoryBarrier. That is intrinsic __faststorefence. That is lock or [rsp],0.
A "full" memory barrier, that no loads or stores will cross, in either direction.

I think that might be what people expect, and that is effectively what ARM64 does (dmb).

But I am also afraid to slow things too much.

  • Not really needed? The weak fence is adequate for most cases?
  • Make a new function FullMemoryFence?
  • Strengthen this function?

@jaykrell jaykrell force-pushed the barrier branch 4 times, most recently from 3ed1b3f to d80569d Compare March 11, 2026 07:30
@jaykrell jaykrell changed the title Strengthen X86MemoryFence. BaseLib: Strengthen X86MemoryFence. Mar 11, 2026
@jaykrell jaykrell changed the title BaseLib: Strengthen X86MemoryFence. BaseLib: Strengthen X86MemoryFence Mar 11, 2026
@jaykrell jaykrell marked this pull request as ready for review March 11, 2026 07:57
@mdkinney
Copy link
Copy Markdown
Member

mdkinney commented Mar 11, 2026

Please add some real examples where the order observed is incorrect.

The _MemoryBarrier() is already used in IoLib where needed to prevent compiler optimizations from reordering operations that break real use cases. That is mainly around the use of intrinsics for IoRead/Write operations to I/O ports.

For registers that are in uncached memory ranges, the Mmio*() functions use volatile to do the access so the compiler can not optimize away the access and the compiler preserves the order of the operations expressed by the C source code. For x86, uncached memory operations are strongly ordered, so the order from the code generated by the compiler is the order the operations are performed.

@jaykrell
Copy link
Copy Markdown
Author

jaykrell commented Mar 12, 2026

I am not sure of real world cases, but please hear me out a bit.

Agreed other instances of _ReadWriteBarrier, and volatile, help alleviate
any need for this function, no matter what it does.

We have shared memory between VM and host, so I/O functions are not used.
Our existing code, perhaps too conservatively written, uses MemoryBarrier.

If this function MemoryFence really can safely be completely empty, why do we have the Gcc form:

// This is a little bit of overkill and it is more about the compiler that it is

// This is a little bit of overkill and it is more about the compiler that it is

?

Presumably these three implementations should be more similar than they currently are?
Maybe the original author did not consider LTCG/LTO?

I realize the comments say the Gcc versions are overkill, but why?
Are they really?
Why have Gcc have overkill and Msvc do nothing?
Should we empty out the Gcc versions? I don't think so, but leaving Msvc asis suggests so.

This is what motivates this PR asis. With just _ReadWriteBarrier.

And then, of further concern, as I noted and you mentioned:

For x86, uncached memory operations are strongly ordered,
so the order from the code generated by the compiler is the order the operations are performed.

This is not quite true. Largely true, but not entirely.
x86 processors do reorder some. A little bit.
It is not for nothing that windows.h has MemoryBarrier and Visual C++ compiler has __faststorefence.
This inhibits loads being reordered before stores.

https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
"Loads may be reordered with older stores to different locations"

https://preshing.com/20120515/memory-reordering-caught-in-the-act/
Discusses the same thing.

That is what people might expect MemoryFence to inhibit, and what in fact MemoryBarrier does.
https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-memorybarrier
Again, on Intel, this is lock or [rsp],0
On ARM it dmb.

I don't think there is any technical difference between "barrier" and "fence", just similar words people chose.
So if UEFI is to have give this meaning to some function, and the existing
MemoryFence cannot be that slow, maybe call it FullMemoryFence?

Oh and why I got here, is because I have this existing code, and I was wondering about removing
some the "baggage" that got copied from its original environment and adapting UEFI idioms instead.
Replacing MemoryBarrier with MemoryFence seemed like one of those nice little cleanups.
But when I looked at the details, I becames very uncertain and left it alone.
But if we can agree there is something here, either much strengthen MemoryFence,
or add a new FullMemoryFence, then I'll use that.
But that is not required, of course. We can keep our local MemoryBarrier.
Still, the existing do-nothing Visual C++ implementations, compared to the Gcc versions, seems wrong.

Thank you for your consideration here.

@jaykrell
Copy link
Copy Markdown
Author

jaykrell commented Mar 13, 2026

Sorry I see I read your uncached as cached, right.
But still, you see what I am worried about -- cached, shared memory between VM and host. Not hardware.

@mdkinney
Copy link
Copy Markdown
Member

Sorry I see I read your uncached as cached, right. But still, you see what I am worried about -- cached, shared memory between VM and host. Not hardware.

Yes. This is a different use case. This is the "mailbox" model for sending data between two subsystems. One example of this in the UEFI Specification is PCI DMA host controllers. Some of them use one directional read or write transactions. Other use mailboxes with constant access from CPU and PCI Controller. When the mailbox model is used, a "CommonBuffer" must be allocated that has properties to support simultaneous access from both. This could be limited resources such as static RAM in chipset. Could be regular memory that has specific cache setting requirements. This is why there is a clearly documented set of services and algorithms in the UEFI Spec and the UEFI Driver Writer's Guide for this more complex mailbox type of interaction.

https://tianocore-docs.github.io/edk2-UefiDriverWritersGuide/draft/18_pci_driver_design_guidelines/185_pci_dma/1858_dma_bus_master_common_buffer_operation.html#1858-dma-bus-master-common-buffer-operation

For your specific use case, you are describing a mailbox in a memory buffer between a VM Host and a VM Guest. If the VM Host and VM Guest can access that mailbox at the same time from different logical processors, then the mailbox interface has to be carefully designed to know who the current owner is for setting up a transaction and ring the doorbell to release a mailbox to the other agent (ownership transfer). That transfer of ownership from one agent to the other is where memory ordering is very important. Need to make sure all transactions to the mailbox visible are in the coherency domain of the other agent before the ownership transfer is triggered. Of course using a non-cached buffer for the mailbox helps resolve a lot of issues, but may have perf impacts. Using cached buffer involves cache line ownership assumptions and that is where CPU specific instructions related to cache line ownership must be used. It is also where visibility of dirty cache line (also CPU specific) must be handled.

The edk2 SynchronizationLib and the edk2 BaseLib MemoryFence() service are the CPU specific services required for the use case you are describing if cached buffers are used.

The _ReadWriteBarrier() compiler hint should really only apply to the ownership transfer point. And applies to both cached and uncached mappings of the mailbox. Using _ReadWriteBarrier() before and after the write to signal ownership transfer is required. If you look at the implementations of the SynchronizationLib, you will see the use of _ReadWriteBarrier() is already there. So using SynchronizationLib for ownership transfer is the correct edk2 mechanism to apply for this use case.

BOOLEAN
EFIAPI
AcquireSpinLockOrFail (
  IN OUT  SPIN_LOCK  *SpinLock
  )
{
  SPIN_LOCK  LockValue;
  VOID       *Result;

  ASSERT (SpinLock != NULL);

  LockValue = *SpinLock;
  ASSERT (LockValue == SPIN_LOCK_ACQUIRED || LockValue == SPIN_LOCK_RELEASED);

  _ReadWriteBarrier ();
  Result = InterlockedCompareExchangePointer (
             (VOID **)SpinLock,
             (VOID *)SPIN_LOCK_RELEASED,
             (VOID *)SPIN_LOCK_ACQUIRED
             );

  _ReadWriteBarrier ();
  return (BOOLEAN)(Result == (VOID *)SPIN_LOCK_RELEASED);
}

The same approach would be used if shared memory mailboxes are created to communicate between CPUs that have been started using the MP Services Protocol. This use case is not specific to VMs.

@jaykrell
Copy link
Copy Markdown
Author

I think we are agreeing?

I mean, should UEFI library offer a portable name
to replace those _ReadWriteBarriers? Or should it remain
kinda only the domain of UEFI internals, free to use compiler-specific intrinics?

I mean, you want to greatly minimize use of intrinsics, and hide them
beyond portable names, right? Or this is a larger non-portable chunk,
free to be as non-portable as it wants to be?

I suggest UEFI borrow from standard C here.

https://en.cppreference.com/w/cpp/header/stdatomic.h

atomic_signal_fence
This maps to _ReadWriteBarrier.
Just a compiler inhibition.
But aims to be very compiler-portable.

atomic_thread_fence
This maps to lock or/_faststorefence, or dmb.
But aims to be very compiler-portable.

In our UEFI downstream, I use both of these for portability.
(Except with Visual C++ which is lacking standard conformance).

Come up with names for these, then replace each MemoryFence
and _ReadWriteBarrier with one of them as appropriate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants