Skip to content

Add failure logging and alerting for SSD offloading (#5542)#5542

Open
Frederick-Zhu wants to merge 1 commit intopytorch:mainfrom
Frederick-Zhu:export-D97821441
Open

Add failure logging and alerting for SSD offloading (#5542)#5542
Frederick-Zhu wants to merge 1 commit intopytorch:mainfrom
Frederick-Zhu:export-D97821441

Conversation

@Frederick-Zhu
Copy link
Copy Markdown

@Frederick-Zhu Frederick-Zhu commented Mar 26, 2026

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2509

Add structured log messages with '[SSD Offloading]' and '[Rank N]' prefixes
to expose the top 10 most critical failure points in SSD TBE (SSD-backed
embedding tables using RocksDB) for alerting. Previously, these failures were
either invisible (silent data corruption, silent thread death), caused
unexplained crashes (generic CHECK), or produced mystery hangs. Now every
critical failure produces a searchable, structured log message.

Failure Points Addressed

C++ (3 files)

  1. Background worker thread try/catch — prevents silent thread death
    that causes training to hang forever in wait_util_filling_work_done().
    Logs: [SSD Offloading] Background worker thread caught exception: <msg>

  2. RocksDB WriteBatch failure logging — structured error before CHECK
    crash so disk-full/IO-error is visible in logs.
    Logs: [SSD Offloading] RocksDB WriteBatch FAILED on shard N: <status>

  3. Flush/CompactRange return value checking — 4 sites where failures
    were silently discarded now log errors.
    Logs: [SSD Offloading] Flush failed: <status>

  4. Queue depth warning — alerts when background write queue exceeds
    1000 items, indicating consumer thread is falling behind or dead.
    Logs: [SSD Offloading] Background write queue depth is N (>1000)

  5. DB Open failure logging — structured error before CHECK crash with
    possible cause hints (disk full, permissions, corrupted DB).

Python (1 file)

  1. Background _insert_all_kv exception capture — stores error in
    _lazy_init_error, raises RuntimeError on ssd_db property access.
    Prevents silent data corruption from swallowed thread exceptions.
    Error check hoisted outside thread-join block for permanent raise.
    toggle_compaction re-enabled via finally block.

  2. try/except for ALL stats reporter call sites — 8 previously
    unprotected report* methods (including _report_duration) now catch
    exceptions to prevent monitoring bugs from crashing training.
    Logs: [SSD Offloading] Failed to report <category>: <error>

  3. Cross-rank failure visibility — [Rank N] prefix on flush/checkpoint
    logs so non-rank-0 failures are identifiable in multi-GPU jobs.

  4. Disk space monitoring — periodic os.statvfs check on ALL
    comma-separated SSD paths with graduated warnings at 90% and 95%.
    Logs: [SSD Offloading] [Rank N] CRITICAL: SSD disk usage at 96.3%

  5. Checkpoint flush failure logging — structured try/except with
    [SSD Offloading] [Rank N] prefix before re-raise.

Tests

C++ (4 new tests in ssd_table_batched_embeddings_test.cpp)

  • TestBackgroundThreadErrorCountInit — verifies bg_thread_error_count_
    starts at 0, normal operations don't trigger errors
  • TestFlushAndCompactWithoutCrash — exercises flush/compact with new
    return value checking, verifies no crash
  • TestQueueDepthWarningPath — rapid writes to exercise queue depth
    tracking, verifies bg thread stays healthy
  • TestCompactionAfterMultipleFlushes — write/flush/compact cycle,
    verifies data integrity via read-back

Python (6 new tests in ssd_alerting_test.py)

  • test_reporter_exception_does_not_crash — CrashingStatsReporter
    verifies exceptions in reporter don't kill training
  • test_reporter_exception_logged_with_prefix — verifies [SSD Offloading]
    prefix in caught exception logs
  • test_disk_space_check_runs — verifies os.statvfs check executes
    without crash in _report_kv_backend_stats
  • test_flush_logging — verifies [SSD Offloading] [Rank N] prefix in
    flush logs
  • test_lazy_init_error_propagated — verifies _lazy_init_error raises
    RuntimeError with [SSD Offloading] prefix on ssd_db access
  • test_all_report_methods_handle_exceptions — calls all report*
    methods to verify try/except coverage

Differential Revision: D97821441

@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync bot commented Mar 26, 2026

@Frederick-Zhu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D97821441.

Summary:
X-link: facebookresearch/FBGEMM#2509


Add structured log messages with '[SSD Offloading]' and '[Rank N]' prefixes
to expose the top 10 most critical failure points in SSD TBE (SSD-backed
embedding tables using RocksDB) for alerting. Previously, these failures were
either invisible (silent data corruption, silent thread death), caused
unexplained crashes (generic CHECK), or produced mystery hangs. Now every
critical failure produces a searchable, structured log message.

## Failure Points Addressed

### C++ (3 files)

1. **Background worker thread try/catch** — prevents silent thread death
   that causes training to hang forever in wait_util_filling_work_done().
   Logs: `[SSD Offloading] Background worker thread caught exception: <msg>`

2. **RocksDB WriteBatch failure logging** — structured error before CHECK
   crash so disk-full/IO-error is visible in logs.
   Logs: `[SSD Offloading] RocksDB WriteBatch FAILED on shard N: <status>`

3. **Flush/CompactRange return value checking** — 4 sites where failures
   were silently discarded now log errors.
   Logs: `[SSD Offloading] Flush failed: <status>`

4. **Queue depth warning** — alerts when background write queue exceeds
   1000 items, indicating consumer thread is falling behind or dead.
   Logs: `[SSD Offloading] Background write queue depth is N (>1000)`

5. **DB Open failure logging** — structured error before CHECK crash with
   possible cause hints (disk full, permissions, corrupted DB).

### Python (1 file)

6. **Background _insert_all_kv exception capture** — stores error in
   _lazy_init_error, raises RuntimeError on ssd_db property access.
   Prevents silent data corruption from swallowed thread exceptions.
   Error check hoisted outside thread-join block for permanent raise.
   toggle_compaction re-enabled via finally block.

7. **try/except for ALL stats reporter call sites** — 8 previously
   unprotected _report_* methods (including _report_duration) now catch
   exceptions to prevent monitoring bugs from crashing training.
   Logs: `[SSD Offloading] Failed to report <category>: <error>`

8. **Cross-rank failure visibility** — [Rank N] prefix on flush/checkpoint
   logs so non-rank-0 failures are identifiable in multi-GPU jobs.

9. **Disk space monitoring** — periodic os.statvfs check on ALL
   comma-separated SSD paths with graduated warnings at 90% and 95%.
   Logs: `[SSD Offloading] [Rank N] CRITICAL: SSD disk usage at 96.3%`

10. **Checkpoint flush failure logging** — structured try/except with
    [SSD Offloading] [Rank N] prefix before re-raise.

## Tests

### C++ (4 new tests in ssd_table_batched_embeddings_test.cpp)

- TestBackgroundThreadErrorCountInit — verifies bg_thread_error_count_
  starts at 0, normal operations don't trigger errors
- TestFlushAndCompactWithoutCrash — exercises flush/compact with new
  return value checking, verifies no crash
- TestQueueDepthWarningPath — rapid writes to exercise queue depth
  tracking, verifies bg thread stays healthy
- TestCompactionAfterMultipleFlushes — write/flush/compact cycle,
  verifies data integrity via read-back

### Python (6 new tests in ssd_alerting_test.py)

- test_reporter_exception_does_not_crash — CrashingStatsReporter
  verifies exceptions in reporter don't kill training
- test_reporter_exception_logged_with_prefix — verifies [SSD Offloading]
  prefix in caught exception logs
- test_disk_space_check_runs — verifies os.statvfs check executes
  without crash in _report_kv_backend_stats
- test_flush_logging — verifies [SSD Offloading] [Rank N] prefix in
  flush logs
- test_lazy_init_error_propagated — verifies _lazy_init_error raises
  RuntimeError with [SSD Offloading] prefix on ssd_db access
- test_all_report_methods_handle_exceptions — calls all _report_*
  methods to verify try/except coverage

Differential Revision: D97821441
@meta-codesync meta-codesync bot changed the title Add failure logging and alerting for SSD offloading Add failure logging and alerting for SSD offloading (#5542) Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant