tensorboard_writer: chief-rank uploader for Vertex AI TB#622
Draft
kmontemayor2-sc wants to merge 2 commits intomainfrom
Draft
tensorboard_writer: chief-rank uploader for Vertex AI TB#622kmontemayor2-sc wants to merge 2 commits intomainfrom
kmontemayor2-sc wants to merge 2 commits intomainfrom
Conversation
Adds two optional fields on ``VertexAiResourceConfig`` for opting into Vertex AI TensorBoard. ``tensorboard_resource_name`` points at an existing ``Tensorboard`` resource; ``tensorboard_experiment_name`` is the user-chosen ``TensorboardExperiment`` ID under that resource — multiple jobs sharing this name surface as comparable runs on the same TB page. The fields must be set together (or both unset). The validation rule is not enforced yet (lands in a follow-up PR); this commit only adds the proto fields and regenerates Python + Scala stubs. Also expands the docstring on ``TrainedModelMetadata.tensorboard_logs_uri`` to document its mapping to ``AIP_TENSORBOARD_LOG_DIR`` via ``CustomJobSpec.baseOutputDirectory``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds ``gigl.utils.tensorboard_writer.TensorBoardWriter``, the trainer-side abstraction that writes TF event files locally and (on the chief rank only) starts an ``aiplatform.start_upload_tb_log`` background uploader streaming events to a stable, user-named ``TensorboardExperiment``. Key design points: - ``TensorBoardWriter.from_env(enabled=is_chief_process)`` returns a real writer on the chief rank, a no-op writer everywhere else, so trainer entry points can share the same call sites across ranks. - Reads ``AIP_TENSORBOARD_LOG_DIR`` (Vertex's contract — set when ``baseOutputDirectory`` is configured) plus three GiGL env vars (``GIGL_TENSORBOARD_RESOURCE_NAME``, ``EXPERIMENT_NAME``, ``RUN_NAME``) injected by the launcher. - Writes events to ``<AIP_TENSORBOARD_LOG_DIR>/<run_name>/`` so the SDK's ``LogdirLoader`` discovers each subdir as a distinct ``TensorboardRun`` instead of merging into the SDK's hardcoded ``DEFAULT_RUN_NAME = "default"``. - Logs the named-experiment URL on uploader start so engineers can find the comparison page from trainer stdout. - Paired ``aiplatform.end_upload_tb_log()`` on close; idempotent. This PR introduces the writer with full test coverage but no callers — the example trainers/inferencers wire it in subsequent PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
/all_test |
Contributor
GiGL Automation@ 20:32:08UTC : 🔄 @ 21:33:00UTC : ✅ Workflow completed successfully. |
Contributor
GiGL Automation@ 20:32:08UTC : 🔄 @ 20:40:26UTC : ✅ Workflow completed successfully. |
Contributor
GiGL Automation@ 20:32:09UTC : 🔄 @ 20:35:38UTC : ✅ Workflow completed successfully. |
Contributor
GiGL Automation@ 20:32:11UTC : 🔄 @ 21:57:02UTC : ✅ Workflow completed successfully. |
Contributor
GiGL Automation@ 20:32:12UTC : 🔄 @ 20:39:48UTC : ✅ Workflow completed successfully. |
Contributor
GiGL Automation@ 20:32:15UTC : 🔄 @ 21:46:54UTC : ✅ Workflow completed successfully. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
gigl.utils.tensorboard_writer.TensorBoardWriter, thetrainer-side abstraction that writes TF event files locally and (on the
chief rank only) starts an
aiplatform.start_upload_tb_logbackgrounduploader streaming events to a stable, user-named
TensorboardExperiment.Key design points:
TensorBoardWriter.from_env(enabled=is_chief_process)returns areal writer on the chief rank, a no-op writer everywhere else, so
trainer entry points can share the same call sites across ranks.
AIP_TENSORBOARD_LOG_DIR(Vertex's contract — set whenbaseOutputDirectoryis configured) plus three GiGL env vars(
GIGL_TENSORBOARD_RESOURCE_NAME,EXPERIMENT_NAME,RUN_NAME)injected by the launcher.
<AIP_TENSORBOARD_LOG_DIR>/<run_name>/so theSDK's
LogdirLoaderdiscovers each subdir as a distinctTensorboardRuninstead of merging into the SDK's hardcodedDEFAULT_RUN_NAME = "default".the comparison page from trainer stdout.
aiplatform.end_upload_tb_log()on close; idempotent.This PR introduces the writer with full test coverage but no callers —
the example trainers/inferencers wire it in subsequent PRs.
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com