Add failure logging and alerting for SSD offloading (#5542) by Frederick-Zhu · Pull Request #5542 · pytorch/FBGEMM

Frederick-Zhu · 2026-03-26T21:10:59Z

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2509

Add structured log messages with '[SSD Offloading]' and '[Rank N]' prefixes
to expose the top 10 most critical failure points in SSD TBE (SSD-backed
embedding tables using RocksDB) for alerting. Previously, these failures were
either invisible (silent data corruption, silent thread death), caused
unexplained crashes (generic CHECK), or produced mystery hangs. Now every
critical failure produces a searchable, structured log message.

Failure Points Addressed

C++ (3 files)

Background worker thread try/catch — prevents silent thread death
that causes training to hang forever in wait_util_filling_work_done().
Logs: [SSD Offloading] Background worker thread caught exception: <msg>
RocksDB WriteBatch failure logging — structured error before CHECK
crash so disk-full/IO-error is visible in logs.
Logs: [SSD Offloading] RocksDB WriteBatch FAILED on shard N: <status>
Flush/CompactRange return value checking — 4 sites where failures
were silently discarded now log errors.
Logs: [SSD Offloading] Flush failed: <status>
Queue depth warning — alerts when background write queue exceeds
1000 items, indicating consumer thread is falling behind or dead.
Logs: [SSD Offloading] Background write queue depth is N (>1000)
DB Open failure logging — structured error before CHECK crash with
possible cause hints (disk full, permissions, corrupted DB).

Python (1 file)

Background _insert_all_kv exception capture — stores error in
_lazy_init_error, raises RuntimeError on ssd_db property access.
Prevents silent data corruption from swallowed thread exceptions.
Error check hoisted outside thread-join block for permanent raise.
toggle_compaction re-enabled via finally block.
try/except for ALL stats reporter call sites — 8 previously
unprotected report* methods (including _report_duration) now catch
exceptions to prevent monitoring bugs from crashing training.
Logs: [SSD Offloading] Failed to report <category>: <error>
Cross-rank failure visibility — [Rank N] prefix on flush/checkpoint
logs so non-rank-0 failures are identifiable in multi-GPU jobs.
Disk space monitoring — periodic os.statvfs check on ALL
comma-separated SSD paths with graduated warnings at 90% and 95%.
Logs: [SSD Offloading] [Rank N] CRITICAL: SSD disk usage at 96.3%
Checkpoint flush failure logging — structured try/except with
[SSD Offloading] [Rank N] prefix before re-raise.

Tests

C++ (4 new tests in ssd_table_batched_embeddings_test.cpp)

TestBackgroundThreadErrorCountInit — verifies bg_thread_error_count_
starts at 0, normal operations don't trigger errors
TestFlushAndCompactWithoutCrash — exercises flush/compact with new
return value checking, verifies no crash
TestQueueDepthWarningPath — rapid writes to exercise queue depth
tracking, verifies bg thread stays healthy
TestCompactionAfterMultipleFlushes — write/flush/compact cycle,
verifies data integrity via read-back

Python (6 new tests in ssd_alerting_test.py)

test_reporter_exception_does_not_crash — CrashingStatsReporter
verifies exceptions in reporter don't kill training
test_reporter_exception_logged_with_prefix — verifies [SSD Offloading]
prefix in caught exception logs
test_disk_space_check_runs — verifies os.statvfs check executes
without crash in _report_kv_backend_stats
test_flush_logging — verifies [SSD Offloading] [Rank N] prefix in
flush logs
test_lazy_init_error_propagated — verifies _lazy_init_error raises
RuntimeError with [SSD Offloading] prefix on ssd_db access
test_all_report_methods_handle_exceptions — calls all report*
methods to verify try/except coverage

Differential Revision: D97821441

meta-codesync · 2026-03-26T21:11:06Z

@Frederick-Zhu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D97821441.

Summary: X-link: facebookresearch/FBGEMM#2509 Add structured log messages with '[SSD Offloading]' and '[Rank N]' prefixes to expose the top 10 most critical failure points in SSD TBE (SSD-backed embedding tables using RocksDB) for alerting. Previously, these failures were either invisible (silent data corruption, silent thread death), caused unexplained crashes (generic CHECK), or produced mystery hangs. Now every critical failure produces a searchable, structured log message. ## Failure Points Addressed ### C++ (3 files) 1. **Background worker thread try/catch** — prevents silent thread death that causes training to hang forever in wait_util_filling_work_done(). Logs: `[SSD Offloading] Background worker thread caught exception: <msg>` 2. **RocksDB WriteBatch failure logging** — structured error before CHECK crash so disk-full/IO-error is visible in logs. Logs: `[SSD Offloading] RocksDB WriteBatch FAILED on shard N: <status>` 3. **Flush/CompactRange return value checking** — 4 sites where failures were silently discarded now log errors. Logs: `[SSD Offloading] Flush failed: <status>` 4. **Queue depth warning** — alerts when background write queue exceeds 1000 items, indicating consumer thread is falling behind or dead. Logs: `[SSD Offloading] Background write queue depth is N (>1000)` 5. **DB Open failure logging** — structured error before CHECK crash with possible cause hints (disk full, permissions, corrupted DB). ### Python (1 file) 6. **Background _insert_all_kv exception capture** — stores error in _lazy_init_error, raises RuntimeError on ssd_db property access. Prevents silent data corruption from swallowed thread exceptions. Error check hoisted outside thread-join block for permanent raise. toggle_compaction re-enabled via finally block. 7. **try/except for ALL stats reporter call sites** — 8 previously unprotected _report_* methods (including _report_duration) now catch exceptions to prevent monitoring bugs from crashing training. Logs: `[SSD Offloading] Failed to report <category>: <error>` 8. **Cross-rank failure visibility** — [Rank N] prefix on flush/checkpoint logs so non-rank-0 failures are identifiable in multi-GPU jobs. 9. **Disk space monitoring** — periodic os.statvfs check on ALL comma-separated SSD paths with graduated warnings at 90% and 95%. Logs: `[SSD Offloading] [Rank N] CRITICAL: SSD disk usage at 96.3%` 10. **Checkpoint flush failure logging** — structured try/except with [SSD Offloading] [Rank N] prefix before re-raise. ## Tests ### C++ (4 new tests in ssd_table_batched_embeddings_test.cpp) - TestBackgroundThreadErrorCountInit — verifies bg_thread_error_count_ starts at 0, normal operations don't trigger errors - TestFlushAndCompactWithoutCrash — exercises flush/compact with new return value checking, verifies no crash - TestQueueDepthWarningPath — rapid writes to exercise queue depth tracking, verifies bg thread stays healthy - TestCompactionAfterMultipleFlushes — write/flush/compact cycle, verifies data integrity via read-back ### Python (6 new tests in ssd_alerting_test.py) - test_reporter_exception_does_not_crash — CrashingStatsReporter verifies exceptions in reporter don't kill training - test_reporter_exception_logged_with_prefix — verifies [SSD Offloading] prefix in caught exception logs - test_disk_space_check_runs — verifies os.statvfs check executes without crash in _report_kv_backend_stats - test_flush_logging — verifies [SSD Offloading] [Rank N] prefix in flush logs - test_lazy_init_error_propagated — verifies _lazy_init_error raises RuntimeError with [SSD Offloading] prefix on ssd_db access - test_all_report_methods_handle_exceptions — calls all _report_* methods to verify try/except coverage Differential Revision: D97821441

meta-cla bot added the cla signed label Mar 26, 2026

meta-codesync bot added fb-exported meta-exported labels Mar 26, 2026

meta-codesync bot changed the title ~~Add failure logging and alerting for SSD offloading~~ Add failure logging and alerting for SSD offloading (#5542) Apr 3, 2026

Frederick-Zhu force-pushed the export-D97821441 branch from 962e5a4 to c19744b Compare April 3, 2026 21:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add failure logging and alerting for SSD offloading (#5542)#5542

Add failure logging and alerting for SSD offloading (#5542)#5542
Frederick-Zhu wants to merge 1 commit intopytorch:mainfrom
Frederick-Zhu:export-D97821441

Frederick-Zhu commented Mar 26, 2026 •

edited by meta-codesync bot

Loading

Uh oh!

meta-codesync bot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Frederick-Zhu commented Mar 26, 2026 • edited by meta-codesync bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Failure Points Addressed

C++ (3 files)

Python (1 file)

Tests

C++ (4 new tests in ssd_table_batched_embeddings_test.cpp)

Python (6 new tests in ssd_alerting_test.py)

Uh oh!

meta-codesync bot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Frederick-Zhu commented Mar 26, 2026 •

edited by meta-codesync bot

Loading