Skip to content

[POC][FIX] Zero cause_chain and extra_context in SafeCallContext to prevent SIGBUS on macOS arm64#592

Closed
oraluben wants to merge 1 commit into
apache:mainfrom
oraluben:fix/sigbus-v0.1.11
Closed

[POC][FIX] Zero cause_chain and extra_context in SafeCallContext to prevent SIGBUS on macOS arm64#592
oraluben wants to merge 1 commit into
apache:mainfrom
oraluben:fix/sigbus-v0.1.11

Conversation

@oraluben
Copy link
Copy Markdown
Contributor

Problem

On macOS arm64 with tvm-ffi >= 0.1.8, when a C++ exception (e.g. InternalError from LOG(FATAL)) is thrown inside a PackedFunc and forwarded through SafeCallContext::SetRaised / SetRaisedByCstr, the process crashes with SIGBUS during cleanup.

Root Cause

SafeCallContext::SetRaised stores an ErrorObj via ObjectPtrFromUnowned. The cause_chain and extra_context fields (added in PR #396 / v0.1.8) may contain stale non-null bytes from reused allocator memory on macOS arm64. When ~ErrorObj() runs, it calls DecRefObjectHandle on these garbage pointer values, causing SIGBUS (EXC_ARM_DA_ALIGN).

Crash address 0x6c616e7265746e49 = ASCII "Internal" (first 8 bytes of "InternalError") being dereferenced as a pointer.

Crash trace:
SimpleObjAllocator::Handler::Deleter_
TVMFFIObjectDecRef
dict_dealloc → BaseException_dealloc → subtype_dealloc

Fix

Explicitly zero cause_chain and extra_context before storing the ErrorObj in TLS, ensuring ~ErrorObj() correctly sees nullptr and short-circuits the DecRefObjectHandle calls.

Tested

macOS arm64, tilelang 0.1.9 release

  • Before: RC=-10 (SIGBUS)
  • After: RC=0

…t SIGBUS on macOS arm64

On macOS arm64 with tvm-ffi >= 0.1.8, when a C++ exception
(e.g. InternalError from LOG(FATAL)) is thrown inside a PackedFunc
and the error is forwarded through SafeCallContext::SetRaised /
SetRaisedByCstr, the ErrorObj's cause_chain and extra_context
fields may contain stale non-null bytes from reused allocator
memory.  ~ErrorObj() then calls DecRefObjectHandle on a garbage
pointer → SIGBUS (EXC_ARM_DA_ALIGN).

Explicitly zero these two fields before storing the ErrorObj in
TLS to ensure the destructor sees nullptr and short-circuits
safely.

Fixes the crash reported in tilelang when using T.gemm on Metal
target (which triggers ICHECK → LOG(FATAL) → InternalError).
@oraluben oraluben force-pushed the fix/sigbus-v0.1.11 branch from 4097642 to 5217c81 Compare May 19, 2026 07:53
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces manual zeroing of the cause_chain and extra_context fields in SetRaised and SetRaisedByCstr to prevent SIGBUS errors on macOS arm64 caused by stale memory. However, the reviewer noted that unconditionally zeroing these fields in SetRaised could overwrite valid data and lead to memory leaks. It was suggested that these fields should instead be initialized within the ErrorObj constructor to ensure safety across all instances. Additionally, the reviewer recommended extending this fix to SetRaisedByCstrParts for consistency.

Comment thread src/ffi/error.cc
Comment on lines +39 to +40
cell->cause_chain = nullptr;
cell->extra_context = nullptr;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Unconditionally zeroing cause_chain and extra_context in SetRaised will overwrite valid data if the error object was already initialized with a cause chain (e.g., via TVMFFIErrorCreateWithCauseAndExtraContext). This results in data loss and potential memory leaks of the objects previously held in those fields.

If these fields contain garbage due to a layout mismatch or uninitialized memory in the allocator, the root cause should ideally be addressed in the ErrorObj constructor in include/tvm/ffi/error.h. This would ensure all ErrorObj instances are correctly initialized upon creation, regardless of how they are subsequently used, and would avoid the need for manual zeroing in every call site.

Comment thread src/ffi/error.cc
details::ObjectUnsafe::TVMFFIObjectPtrFromObjectRef(error));
cell->cause_chain = nullptr;
cell->extra_context = nullptr;
last_error_ = details::ObjectUnsafe::ObjectPtrFromObjectRef<ErrorObj>(std::move(error));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fix applied here should also be extended to SetRaisedByCstrParts (lines 56-73) to ensure consistency across all error-raising methods in SafeCallContext. Currently, SetRaisedByCstrParts remains vulnerable to the same SIGBUS issue described in the PR as it also creates a new Error and stores it in last_error_ without zeroing these fields.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@oraluben oraluben changed the title [FIX] Zero cause_chain and extra_context in SafeCallContext to prevent SIGBUS on macOS arm64 [POC][FIX] Zero cause_chain and extra_context in SafeCallContext to prevent SIGBUS on macOS arm64 May 19, 2026
@oraluben oraluben mentioned this pull request May 19, 2026
@oraluben
Copy link
Copy Markdown
Contributor Author

The current fix is not ideal for two reasons:

  1. Unconditionally clears cause_chain and extra_context — this would incorrectly discard legitimately set values from Error(kind, msg, bt, cause_chain, extra_context) (the 5-arg constructor).
  2. Scattered across two call sites (SetRaised + SetRaisedByCstr) rather than a single centralized point.

A more proper fix could be one of:

  • Zero the memory in the allocator: std::memset(data, 0, sizeof(T)) in SimpleObjAllocator::Handler::New after allocation and before placement new. This guarantees all POD fields are zeroed regardless of constructor, and is the most systematic approach.
  • Or use member initializer lists in ErrorObj to avoid relying on constructor body execution.

@tqchen @junrushao would appreciate your thoughts on the right approach here.

@tqchen
Copy link
Copy Markdown
Member

tqchen commented May 19, 2026

i think the latest code https://github.com/apache/tvm-ffi/blob/main/include/tvm/ffi/error.h#L69 already ensures the two fields are set to zero. so updating tvm-ffi to latest should resolve the issue

@oraluben
Copy link
Copy Markdown
Contributor Author

Updated findings — this is actually an ABI mismatch, not a missing constructor

After extensive testing, the root cause is not that the ErrorObj constructor fails to initialize
the fields, but rather an ABI mismatch between compile-time and runtime tvm-ffi.

Test matrix

tilelang build (vendored tvm-ffi) runtime tvm-ffi Result
release 0.1.9 (v0.1.3) 0.1.7 RC=0 ✓
release 0.1.9 (v0.1.3) 0.1.8 SIGBUS
release 0.1.9 (v0.1.3) 0.1.10 SIGBUS
release 0.1.9 (v0.1.3) 0.1.11 SIGBUS
refactor branch (v0.1.11) 0.1.10 RC=0 ✓
refactor branch (v0.1.11) 0.1.11 RC=0 ✓

Explanation

v0.1.3: TVMFFIErrorCell = 56 bytes (kind, message, backtrace, update_backtrace)
v0.1.8+: TVMFFIErrorCell = 72 bytes (adds cause_chain, extra_context)

When tilelang is compiled against v0.1.3 headers and linked against v0.1.8+ at runtime,
internal struct layouts (TVMFFIAny, TVMFFIObject, TVMFFIErrorCell) can differ.
The ErrorObj destructor in the runtime binary accesses cause_chain at the v0.1.8+
offset, but the memory at that offset may contain leftover data ("Internal") from a
different allocation, not the nullptr set by the constructor.

What this means for this PR

This POC fix (zeroing cause_chain/extra_context in SafeCallContext) is a workaround
that prevents the SIGBUS by defensively clearing the fields — but it doesn't address
the underlying ABI mismatch. The proper fix is for downstream projects to ensure their
vendored tvm-ffi headers match the runtime version.

That said, zeroing these fields is still a reasonable defensive measure in case any
code path ever creates ErrorObj without proper initialization.

@tqchen @junrushao thoughts on whether we should still land this as a safety net?

@oraluben oraluben closed this May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants