loader,smp: Enable caches on all cores for shareability by bruelc · Pull Request #465 · seL4/microkit

bruelc · 2026-04-01T12:34:18Z

Hello,

This fixes an SMP boot issue observed on the STM32MP2 platform: core 0 does not start because print_lock is never seen as released by the secondary cores. The issue is related to cachability, which is not yet enabled at this stage.

Ensure all cores are in the same coherency domain before using cached shared data by enable MMU on all cores before using shared datas
Thus moved arch_mmu_enable() out of start_kernel().

midnightveil

This seems like it would be correct to me, but I'd like to confirm it against the ARM ARM first.

What is the shareability of these mappings, do you know? Hopefully we set it up as ISH.

Maybe I misunderstood @KurtWu10 but I was under the impression that this would be fine as we only disabled the MMU & L1 cache (or actually thinking about it I guess we also disable the cache itself. I was asking about how it works it you just disable the MMU, not the cache too)).

This probably would affect spin table booting too, though I'm inclined to leave it as is as nothing but raspberry pi uses it (looking at Linux it does a cache clean & invalidate to PoC there)

(Not something to fix here) this does make me realise I could have switched to the kernel page tables instead of disabling the uboot ones, as that would stop the loader loading things over the current page tables since it knows to not load over itself. Although we probably want to clean caches then.

KurtWu10 · 2026-04-01T23:32:58Z

Initially, I thought this was correct, but now I am not sure whether enabling MMU is required for cache coherency.

According to the Arm ARM D8.2.12 The effects of disabling an address translation stage, assuming that the effective value of HCR_EL2.DC is 0, it is said that

($\textsf{R}_\textsf{WFZPW}$) ... the stage 1 translation assigns one of the following attributes to memory accesses:

For a data access, the Device-nGnRnE memory attribute.

then according to B2.3 Ordering requirements defined by the formal concurrency model,

In this section ..., all of the following are assumed:

One of the following applies:

...

All PEs are part of the Full System domain and all Memory Effects are to Normal, Inner Non-Cacheable, Outer Non-Cacheable memory or Device memory as long as all barriers specify the Full System domain and there are no TLB invalidations.

All Memory Effects are to Normal, Inner Write-Back, Outer Write-Back memory. Behaviors in Ordering requirements defined by the formal concurrency model also apply to Memory Effects to memory with attributes other than Normal, Inner Write-Back, Outer Write-Back (for example, Device or Inner Write-Through, Outer Write-Through) as long as the behavior of the instruction which generated the Memory Effect is guaranteed for the memory type.

My interpretation is, with MMU disabled, data memory accesses use the device attribute, and cache coherency for them is provided for load-acquire and store-release instructions.

The reasoning above assumes that the effective value of HCR_EL2.DC is 0. This might not be always true, as only the switch_to_el1 function sets the value. By default, the field is said to reset to an unknown value.

bruelc · 2026-04-02T09:11:22Z

My interpretation is, with MMU disabled, data memory accesses use the device attribute, and cache coherency for them is provided for load-acquire and store-release instructions.

I think you are correct here: cacheability and coherency can indeed apply regardless of MMU, as long as the cores are in the same shareability domain and cache attributes are consistent.

My mistake in the description was that I confused “enabling the caches” with “enabling the MMU”, because both are currently done inside enable_mmu().

A possible refinement would be to separate cache activation from the enable_mmu() with a new function called before accessing shared data. Which makes now 2 possible fixes :

Keep the existing patch, update the comments accordingly, and retitle the PR to mention cache instead of MMU
Keep the current implementation but split enable_mmu() with a new enable_cache() called before accessing the sync datas.

I am fine with either option

Thank you for raising this inconsistency!

bruelc · 2026-04-02T09:20:52Z

This seems like it would be correct to me, but I'd like to confirm it against the ARM ARM first.

What is the shareability of these mappings, do you know? Hopefully we set it up as ISH.

In #464 I set inner shareability in the memory translation attributes

Maybe I misunderstood @KurtWu10 but I was under the impression that this would be fine as we only disabled the MMU & L1 cache (or actually thinking about it I guess we also disable the cache itself. I was asking about how it works it you just disable the MMU, not the cache too)).

That the point, we could disable the MMU and only enable the cache. But in this situation, I don't know what the default shareability settings. I tested on STM32MP and this works, but I'm afraid we could fall into SoC or undefined behavior.

Making sure attributes in the translation tables are set (thus with MMU enabled) is IMHO the best way to avoid ambiguity

This probably would affect spin table booting too, though I'm inclined to leave it as is as nothing but raspberry pi uses it (looking at Linux it does a cache clean & invalidate to PoC there)

(Not something to fix here) this does make me realise I could have switched to the kernel page tables instead of disabling the uboot ones, as that would stop the loader loading things over the current page tables since it knows to not load over itself. Although we probably want to clean caches then.

midnightveil · 2026-04-02T11:32:21Z

Yes, I agree. I think it makes sense to turn MMU and caches on.

midnightveil · 2026-04-09T00:11:11Z

FYI, you'll need to fix the git commit title length, since there is gitlint setup on seL4 repos. No more than 50 chars in the title and 70 in the body. Same for the other PR.

Ensure that caches are activated on all cores before accessing shared synchronisation cached data. Caches are enabled within arch_mmu_enable(), so sharablility attributes are consistenly configured. Signed-off-by: Christian Bruel <christian.bruel@foss.st.com>

midnightveil · 2026-04-10T10:14:35Z

    sp[1] = 0;

+    /* clean up the cache line containing sp before use by secondary core */
+    asm volatile("dc cvac, %0" :: "r"(sp) : "memory");


I guess this is an argument that we should make the loader use outer shareable instead of inner shareable, then I think this should be fine.

I don't think that would solve any coherency problems when the caches are disabled on the secondary cores.

I don't think that would solve any coherency problems when the caches are disabled on the secondary cores.

Caches are enabled early in arm_secondary_cpu_entry(). Before, the only data that must be coherent is the parameter read from the stack by arm_secondary_cpu_entry_asm() (without this I read 0 as logical_cpu parameter, which fails). Instructions should be fine.

We could conservatively clean the entire cache here, but that seems like overkill. Am I missing anything, or do you see any other possible source of incoherence before caches are enabled in this sequence?

(edit: arm_secondary_cpu_entry_asm() read the stack, not the PSCI call)

Apologies, I didn't notice this was for the stack, which probably needs a cache clean here yes.

What I had thought of was that the print_lock is a variable that exists across cores that won't be in the same shareability domain. I probably need to recheck the manual here, it might be that a "dsb sy" (full system barrier) would be OK when the memory is mapped ISH as opposed to OSH, but I also don't know if that's what the compiler atomics expect and whether the instructions they emit work for that case.

Admittedly this isn't an issue with this PR at all...

No problem. It's best to discuss the details now :). I think the print_lock exclusive accesses are in the same shareability domain, because they are accessed after enable_mmu() on all cores, with the same MMU/cache attributes. Also, we enable both ISH and OSH shareability in the translation tables

I'm lost. is the scenario you describe is with my dc cvac, w/a or with the early enable_mmu fix ?

I meant this PR as it is at this moment, without early MMU enable or other changes.

My question is why that works, as it triggers the corner case I'm concerned about. (Cached data shadowing uncached writes.)

using my w/a the sequence is:
[...]
that's probably why I see it OK at my end;

I don't know what you mean with "w/a", but the pop I mean is the return from the function call, not retrieving the arguments passed from core 0, that's always fine.

If you mean enabling caches without enabling the MMU, that should work for L2 as that's PIPT. Not entirely sure how that interacts with a VIPT L1 cache though, that might be implementation defined.

now if there is an issue about the cache not being invalidated after the clean, I can use "dc civac" instead of "dc cvac"

No, that would only hide the issue, as speculative reads at the wrong moment have the same effect. Better to always trigger the scenario so it breaks immediately when something changes, instead of once in a blue moon.

My question is why that works, as it triggers the corner case I'm concerned about. (Cached data shadowing uncached writes.)

It should only work if, when returning from the last call that enables the MMU, no (important) registers were pushed to stack and only LR is used to return. If LR got pushed when entering the MMU enable function, on leave, if we pop that value from cache, LR should contain an invalid (cached) value. That's okay as long as we don't do another return, but seems somewhat precarious. E.g. it may only work when compiling with optimisations enabled.

Same for other registers, but with so many registers on AARch64 that's unlikely, so it's mostly the return stack that's of concern.

That's why I said above that writing to stack isn't save with MMU disabled. I don't mean the arguments passed from core 0 to secondary cores, I meant stack writes done by the secondary cores themselves.

OK, sorry, You were talking about the ABI pop, I was still with the 'logical_cpu' pop... I see your point now.

It might be a stupid question, but how can we have a cache conflict between the main core stack and other core's ? they are not at the same address. (not even in the same cache line), and if they did because of cache associativity, one of them would be evicted.

... or the clean and invalidate of the cache clears out L2 too, ???

I think this the case : doing a full clean cache enforces what ARM calls PoC, and this is called from el2_mmu_enable->el2_mmu_disable->flush_dcache

So I see the sequence as your sequence now as

Core 0: Write data to core 1's stack, bringing it into L1/L2.
Core 0: Clean, but not invalidate cache line (this PR) and start core 1.
Core 1: Call function that cleans and invalidates cache and enables the MMU. -> PoC (*)
Core 1: Pops registers,

(*) Any data cleaned to PoC is guaranteed to be visible

It might be a stupid question, but how can we have a cache conflict between the main core stack and other core's ? they are not at the same address. (not even in the same cache line).

It's not a conflict between the main core's stack and other core's stack. It's a conflict in value between the cached view and the uncached view of the other core's stacks.

Looks like just doing a full clean cache before jumping to secondary core is the way to enforce what ARM calls PoC

Yes, and that's fine. That's part of what I meant with "retrieving the arguments passed from core 0, that's always fine.", as that's easily made coherent with the cache clean.

The incoherence is introduced by the other core writing any data to uncached memory, while the same cache line is in any other cache. Those writes bypass the cache altogether. The writes that are most concerning are pushes to stack.

If this happens, after enabling the cache, the core will suddenly see the cached value.

Now, I think the reason it works in practice is because flush_dcache does a clean and invalidate by set/way for all cache levels in el1/2_mmu_disable, which is the deepest function. At that point all values stored on stack are made coherent. Very importantly, it doesn't do any other memory writes (directly or via push) between the invalidate and the cache and MMU enable.

If it did just a clean, we should hit this issue and it probably would crash when returning from one of the functions called.

bruelc mentioned this pull request Apr 1, 2026

Add stm32mp2 support #446

Open

midnightveil approved these changes Apr 1, 2026

View reviewed changes

midnightveil mentioned this pull request Apr 2, 2026

fix aarch64 MMU fixes - don't disable it #466

Open

bruelc force-pushed the master-cpusync branch from d71ae26 to 281718f Compare April 7, 2026 07:59

bruelc changed the title ~~loader,smp: Make sure MMU is enabled on all cores for cache coherency.~~ loader,smp: Make sure caches are enabled on all cores for shareability Apr 7, 2026

bruelc force-pushed the master-cpusync branch from 281718f to ccd30a8 Compare April 7, 2026 08:04

midnightveil approved these changes Apr 7, 2026

View reviewed changes

bruelc force-pushed the master-cpusync branch from ccd30a8 to b263f85 Compare April 8, 2026 07:40

bruelc force-pushed the master-cpusync branch from b263f85 to fabc687 Compare April 9, 2026 07:25

bruelc changed the title ~~loader,smp: Make sure caches are enabled on all cores for shareability~~ loader,smp: Enable caches on all cores for shareability Apr 9, 2026

bruelc force-pushed the master-cpusync branch from fabc687 to 5759a6d Compare April 10, 2026 08:40

bruelc mentioned this pull request Apr 10, 2026

aarch64: Restrict address mappings in loader.rs #464

Open

midnightveil reviewed Apr 10, 2026

View reviewed changes

Conversation

bruelc commented Apr 1, 2026

Uh oh!

midnightveil left a comment

Choose a reason for hiding this comment

Uh oh!

KurtWu10 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bruelc commented Apr 2, 2026

Uh oh!

bruelc commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

midnightveil commented Apr 2, 2026

Uh oh!

midnightveil commented Apr 9, 2026

Uh oh!

midnightveil Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Indanz Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

bruelc Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

midnightveil Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bruelc Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Indanz Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Indanz Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

bruelc Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bruelc Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Indanz Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

KurtWu10 commented Apr 1, 2026 •

edited

Loading

bruelc commented Apr 2, 2026 •

edited

Loading

bruelc Apr 10, 2026 •

edited

Loading

midnightveil Apr 10, 2026 •

edited

Loading

bruelc Apr 14, 2026 •

edited

Loading

bruelc Apr 14, 2026 •

edited

Loading