Add failure logging and alerting for SSD offloading (#5542)#5542
Open
Frederick-Zhu wants to merge 1 commit intopytorch:mainfrom
Open
Add failure logging and alerting for SSD offloading (#5542)#5542Frederick-Zhu wants to merge 1 commit intopytorch:mainfrom
Frederick-Zhu wants to merge 1 commit intopytorch:mainfrom
Conversation
Contributor
|
@Frederick-Zhu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D97821441. |
Summary: X-link: facebookresearch/FBGEMM#2509 Add structured log messages with '[SSD Offloading]' and '[Rank N]' prefixes to expose the top 10 most critical failure points in SSD TBE (SSD-backed embedding tables using RocksDB) for alerting. Previously, these failures were either invisible (silent data corruption, silent thread death), caused unexplained crashes (generic CHECK), or produced mystery hangs. Now every critical failure produces a searchable, structured log message. ## Failure Points Addressed ### C++ (3 files) 1. **Background worker thread try/catch** — prevents silent thread death that causes training to hang forever in wait_util_filling_work_done(). Logs: `[SSD Offloading] Background worker thread caught exception: <msg>` 2. **RocksDB WriteBatch failure logging** — structured error before CHECK crash so disk-full/IO-error is visible in logs. Logs: `[SSD Offloading] RocksDB WriteBatch FAILED on shard N: <status>` 3. **Flush/CompactRange return value checking** — 4 sites where failures were silently discarded now log errors. Logs: `[SSD Offloading] Flush failed: <status>` 4. **Queue depth warning** — alerts when background write queue exceeds 1000 items, indicating consumer thread is falling behind or dead. Logs: `[SSD Offloading] Background write queue depth is N (>1000)` 5. **DB Open failure logging** — structured error before CHECK crash with possible cause hints (disk full, permissions, corrupted DB). ### Python (1 file) 6. **Background _insert_all_kv exception capture** — stores error in _lazy_init_error, raises RuntimeError on ssd_db property access. Prevents silent data corruption from swallowed thread exceptions. Error check hoisted outside thread-join block for permanent raise. toggle_compaction re-enabled via finally block. 7. **try/except for ALL stats reporter call sites** — 8 previously unprotected _report_* methods (including _report_duration) now catch exceptions to prevent monitoring bugs from crashing training. Logs: `[SSD Offloading] Failed to report <category>: <error>` 8. **Cross-rank failure visibility** — [Rank N] prefix on flush/checkpoint logs so non-rank-0 failures are identifiable in multi-GPU jobs. 9. **Disk space monitoring** — periodic os.statvfs check on ALL comma-separated SSD paths with graduated warnings at 90% and 95%. Logs: `[SSD Offloading] [Rank N] CRITICAL: SSD disk usage at 96.3%` 10. **Checkpoint flush failure logging** — structured try/except with [SSD Offloading] [Rank N] prefix before re-raise. ## Tests ### C++ (4 new tests in ssd_table_batched_embeddings_test.cpp) - TestBackgroundThreadErrorCountInit — verifies bg_thread_error_count_ starts at 0, normal operations don't trigger errors - TestFlushAndCompactWithoutCrash — exercises flush/compact with new return value checking, verifies no crash - TestQueueDepthWarningPath — rapid writes to exercise queue depth tracking, verifies bg thread stays healthy - TestCompactionAfterMultipleFlushes — write/flush/compact cycle, verifies data integrity via read-back ### Python (6 new tests in ssd_alerting_test.py) - test_reporter_exception_does_not_crash — CrashingStatsReporter verifies exceptions in reporter don't kill training - test_reporter_exception_logged_with_prefix — verifies [SSD Offloading] prefix in caught exception logs - test_disk_space_check_runs — verifies os.statvfs check executes without crash in _report_kv_backend_stats - test_flush_logging — verifies [SSD Offloading] [Rank N] prefix in flush logs - test_lazy_init_error_propagated — verifies _lazy_init_error raises RuntimeError with [SSD Offloading] prefix on ssd_db access - test_all_report_methods_handle_exceptions — calls all _report_* methods to verify try/except coverage Differential Revision: D97821441
962e5a4 to
c19744b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2509
Add structured log messages with '[SSD Offloading]' and '[Rank N]' prefixes
to expose the top 10 most critical failure points in SSD TBE (SSD-backed
embedding tables using RocksDB) for alerting. Previously, these failures were
either invisible (silent data corruption, silent thread death), caused
unexplained crashes (generic CHECK), or produced mystery hangs. Now every
critical failure produces a searchable, structured log message.
Failure Points Addressed
C++ (3 files)
Background worker thread try/catch — prevents silent thread death
that causes training to hang forever in wait_util_filling_work_done().
Logs:
[SSD Offloading] Background worker thread caught exception: <msg>RocksDB WriteBatch failure logging — structured error before CHECK
crash so disk-full/IO-error is visible in logs.
Logs:
[SSD Offloading] RocksDB WriteBatch FAILED on shard N: <status>Flush/CompactRange return value checking — 4 sites where failures
were silently discarded now log errors.
Logs:
[SSD Offloading] Flush failed: <status>Queue depth warning — alerts when background write queue exceeds
1000 items, indicating consumer thread is falling behind or dead.
Logs:
[SSD Offloading] Background write queue depth is N (>1000)DB Open failure logging — structured error before CHECK crash with
possible cause hints (disk full, permissions, corrupted DB).
Python (1 file)
Background _insert_all_kv exception capture — stores error in
_lazy_init_error, raises RuntimeError on ssd_db property access.
Prevents silent data corruption from swallowed thread exceptions.
Error check hoisted outside thread-join block for permanent raise.
toggle_compaction re-enabled via finally block.
try/except for ALL stats reporter call sites — 8 previously
unprotected report* methods (including _report_duration) now catch
exceptions to prevent monitoring bugs from crashing training.
Logs:
[SSD Offloading] Failed to report <category>: <error>Cross-rank failure visibility — [Rank N] prefix on flush/checkpoint
logs so non-rank-0 failures are identifiable in multi-GPU jobs.
Disk space monitoring — periodic os.statvfs check on ALL
comma-separated SSD paths with graduated warnings at 90% and 95%.
Logs:
[SSD Offloading] [Rank N] CRITICAL: SSD disk usage at 96.3%Checkpoint flush failure logging — structured try/except with
[SSD Offloading] [Rank N] prefix before re-raise.
Tests
C++ (4 new tests in ssd_table_batched_embeddings_test.cpp)
starts at 0, normal operations don't trigger errors
return value checking, verifies no crash
tracking, verifies bg thread stays healthy
verifies data integrity via read-back
Python (6 new tests in ssd_alerting_test.py)
verifies exceptions in reporter don't kill training
prefix in caught exception logs
without crash in _report_kv_backend_stats
flush logs
RuntimeError with [SSD Offloading] prefix on ssd_db access
methods to verify try/except coverage
Differential Revision: D97821441