[POC][FIX] Zero cause_chain and extra_context in SafeCallContext to prevent SIGBUS on macOS arm64#592
[POC][FIX] Zero cause_chain and extra_context in SafeCallContext to prevent SIGBUS on macOS arm64#592oraluben wants to merge 1 commit into
Conversation
…t SIGBUS on macOS arm64 On macOS arm64 with tvm-ffi >= 0.1.8, when a C++ exception (e.g. InternalError from LOG(FATAL)) is thrown inside a PackedFunc and the error is forwarded through SafeCallContext::SetRaised / SetRaisedByCstr, the ErrorObj's cause_chain and extra_context fields may contain stale non-null bytes from reused allocator memory. ~ErrorObj() then calls DecRefObjectHandle on a garbage pointer → SIGBUS (EXC_ARM_DA_ALIGN). Explicitly zero these two fields before storing the ErrorObj in TLS to ensure the destructor sees nullptr and short-circuits safely. Fixes the crash reported in tilelang when using T.gemm on Metal target (which triggers ICHECK → LOG(FATAL) → InternalError).
4097642 to
5217c81
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces manual zeroing of the cause_chain and extra_context fields in SetRaised and SetRaisedByCstr to prevent SIGBUS errors on macOS arm64 caused by stale memory. However, the reviewer noted that unconditionally zeroing these fields in SetRaised could overwrite valid data and lead to memory leaks. It was suggested that these fields should instead be initialized within the ErrorObj constructor to ensure safety across all instances. Additionally, the reviewer recommended extending this fix to SetRaisedByCstrParts for consistency.
| cell->cause_chain = nullptr; | ||
| cell->extra_context = nullptr; |
There was a problem hiding this comment.
Unconditionally zeroing cause_chain and extra_context in SetRaised will overwrite valid data if the error object was already initialized with a cause chain (e.g., via TVMFFIErrorCreateWithCauseAndExtraContext). This results in data loss and potential memory leaks of the objects previously held in those fields.
If these fields contain garbage due to a layout mismatch or uninitialized memory in the allocator, the root cause should ideally be addressed in the ErrorObj constructor in include/tvm/ffi/error.h. This would ensure all ErrorObj instances are correctly initialized upon creation, regardless of how they are subsequently used, and would avoid the need for manual zeroing in every call site.
| details::ObjectUnsafe::TVMFFIObjectPtrFromObjectRef(error)); | ||
| cell->cause_chain = nullptr; | ||
| cell->extra_context = nullptr; | ||
| last_error_ = details::ObjectUnsafe::ObjectPtrFromObjectRef<ErrorObj>(std::move(error)); |
There was a problem hiding this comment.
The fix applied here should also be extended to SetRaisedByCstrParts (lines 56-73) to ensure consistency across all error-raising methods in SafeCallContext. Currently, SetRaisedByCstrParts remains vulnerable to the same SIGBUS issue described in the PR as it also creates a new Error and stores it in last_error_ without zeroing these fields.
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
|
The current fix is not ideal for two reasons:
A more proper fix could be one of:
@tqchen @junrushao would appreciate your thoughts on the right approach here. |
|
i think the latest code https://github.com/apache/tvm-ffi/blob/main/include/tvm/ffi/error.h#L69 already ensures the two fields are set to zero. so updating tvm-ffi to latest should resolve the issue |
Updated findings — this is actually an ABI mismatch, not a missing constructorAfter extensive testing, the root cause is not that the ErrorObj constructor fails to initialize Test matrix
Explanationv0.1.3: When tilelang is compiled against v0.1.3 headers and linked against v0.1.8+ at runtime, What this means for this PRThis POC fix (zeroing cause_chain/extra_context in SafeCallContext) is a workaround That said, zeroing these fields is still a reasonable defensive measure in case any @tqchen @junrushao thoughts on whether we should still land this as a safety net? |
Problem
On macOS arm64 with tvm-ffi >= 0.1.8, when a C++ exception (e.g. InternalError from LOG(FATAL)) is thrown inside a PackedFunc and forwarded through SafeCallContext::SetRaised / SetRaisedByCstr, the process crashes with SIGBUS during cleanup.
Root Cause
SafeCallContext::SetRaised stores an ErrorObj via ObjectPtrFromUnowned. The cause_chain and extra_context fields (added in PR #396 / v0.1.8) may contain stale non-null bytes from reused allocator memory on macOS arm64. When ~ErrorObj() runs, it calls DecRefObjectHandle on these garbage pointer values, causing SIGBUS (EXC_ARM_DA_ALIGN).
Crash address 0x6c616e7265746e49 = ASCII "Internal" (first 8 bytes of "InternalError") being dereferenced as a pointer.
Crash trace:
SimpleObjAllocator::Handler::Deleter_
TVMFFIObjectDecRef
dict_dealloc → BaseException_dealloc → subtype_dealloc
Fix
Explicitly zero cause_chain and extra_context before storing the ErrorObj in TLS, ensuring ~ErrorObj() correctly sees nullptr and short-circuits the DecRefObjectHandle calls.
Tested
macOS arm64, tilelang 0.1.9 release