Kmonte/add tb for glt by kmontemayor2-sc · Pull Request #603 · Snapchat/GiGL

kmontemayor2-sc · 2026-04-29T17:43:52Z

Scope of work done

Where is the documentation for this feature?: N/A

Did you add automated tests or write a test plan?

Updated Changelog.md? NO

Ready for code review?: NO

Replace tests for the four-function API with tests for the new TensorBoardWriter class. Tests fail with ImportError until the class lands in the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Code review found that test_close_on_noop_writer_does_not_raise only called close() once, so idempotency on the no-op path was untested. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Collapse resolve_tensorboard_log_dir, create_tensorboard_writer, write_tensorboard_scalar, close_tensorboard_writer, and VERTEX_TENSORBOARD_LOG_DIR_ENV_KEY into a single TensorBoardWriter class with from_uri classmethod. The class is context-managerable and no-ops when disabled, eliminating Optional plumbing at call sites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The V1 BaseTrainer.train body never wrote scalars via TF's ambient default writer, so the file_writer + as_default() block in __run_training served no purpose. Verified with grep: no tf.summary.scalar or write_tensorboard_scalar callers anywhere in gigl/src/training/v1/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace function-based tensorboard helpers with the new TensorBoardWriter class. Collapses two back-to-back scalar writes per log step into a single dict-style log() call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…riter Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nsorBoardWriter Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…TensorBoardWriter Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The default for the proto bool field is false, so removing the line preserves the same behavior. Avoids implying that tensorboard logging should be on by default in example task configs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Relocate from gigl/src/common/utils/tensorboard.py (internal pipeline utilities) to gigl/utils/tensorboard_writer.py (general-purpose user utilities). Update the test file path, the in-test patch paths, and all four example training scripts to import from the new location. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Examples now log to TensorBoard whenever a tensorboard_log_uri is set on the gbml config; no separate boolean to flip. Removes the should_log_to_tensorboard field from TrainingProcessArgs (and its docstring entry, extraction from the gbml config, and pass-through to _training_process) in all four link-prediction example training scripts. TensorBoardWriter.from_uri already returns a no-op writer when the URI is None, so a single signal (the URI) is sufficient. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

V1 and V2 trainer launchers previously gated tensorboard_logs_uri construction on both should_log_to_tensorboard AND a non-empty proto field. Drop the should_log_to_tensorboard half so the URI is forwarded to the Vertex AI launcher whenever the proto field is set. One signal (URI presence) instead of two. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

TensorBoardWriter now reads AIP_TENSORBOARD_LOG_DIR directly and raises when unset. Vertex AI populates that env var from CustomJobSpec.baseOutputDirectory, which GiGL's launcher already derives from tensorboardLogsUri — so the prior fallback in _resolve_log_dir was dead weight. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ning Now that TensorBoardWriter reads AIP_TENSORBOARD_LOG_DIR directly via from_env(), the example no longer needs to plumb tensorboard_logs_uri from the GbmlConfig through TrainingProcessArgs to the worker process. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Now that TensorBoardWriter reads AIP_TENSORBOARD_LOG_DIR directly via from_env(), the examples no longer need to plumb tensorboard_logs_uri from the GbmlConfig through TrainingProcessArgs to the worker process. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…_name Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… set Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…IsNone

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…s_uri

…oard_logs_uri" This reverts commit 99ab56d.

…t_name Vertex AI MetadataStore Context IDs (which back ExperimentRuns) must match [a-z0-9][a-z0-9-]{0,127}. GiGL's gigl_train_<task>_<timestamp> job names contain underscores and were rejected with a 400 from the SDK. Lowercase and replace underscores with hyphens for the run ID; fail fast with a clear message when the user-supplied tensorboard_experiment_name itself is invalid (don't silently rewrite user input).

Passing experiment_run=<string> invokes aiplatform.ExperimentRun's getter, which 404s when the run doesn't exist yet (every first submit does). Per the SDK's own contract — "If 'experiment' is set but 'experiment_run' is not, an ExperimentRun resource will still be auto-generated" (jobs.py:2509-2514) — the right move is to set only experiment= and let the SDK create the run. The user-supplied experiment name is still validated up front against Vertex's resource-ID regex; the run-id sanitizer is no longer needed.

…path Vertex's auto-uploader is gated on submit(tensorboard=); submit(experiment=) alone leaves AIP_TENSORBOARD_LOG_DIR populated but never reads from it, so events sit in GCS un-uploaded (job 6570151780682825728 is the smoking gun). The two submit kwargs are mutually exclusive, and tensorboard= forces a job-scoped experiment we can't rename — neither delivers multi-run comparison. Switch approach: when tensorboard_experiment_name is set, the launcher injects GIGL_TENSORBOARD_RESOURCE_NAME and GIGL_TENSORBOARD_EXPERIMENT_NAME into the worker container, and submits with NEITHER experiment= nor tensorboard=. The trainer's TensorBoardWriter.from_env reads those env vars on the chief rank and starts aiplatform.start_upload_tb_log to stream AIP_TENSORBOARD_LOG_DIR to the named TensorboardExperiment. close() pairs with end_upload_tb_log. Multiple jobs sharing the experiment name show as comparable runs on one TB page. Drops the now-unused _ensure_experiment_with_backing_tb helper (the SDK uploader auto-creates the TensorboardExperiment).

The "Open TensorBoard" link on the Vertex AI job page is gated on jobSpec.tensorboard. Dropping that field (which the previous experiment-name branch did) hides the link entirely. Make _submit_job set tensorboard= any time the resource name is configured. The chief-rank uploader continues to stream to the user-named TensorboardExperiment in parallel for cross-job comparison. Refresh the now-stale dual-uploader story in the dataclass docstring, launcher comment block, and proto comment. Also adds the round-2 plan to docs/plans/.

…runs The chief-rank aiplatform.start_upload_tb_log uploader was watching AIP_TENSORBOARD_LOG_DIR with no run_name_prefix. The SDK's LogdirLoader maps root-logdir events to DEFAULT_RUN_NAME = "default", and the SDK silently merges into existing runs by name — so two jobs sharing the same TensorboardExperiment would collapse into one "default" run instead of producing two comparable runs (codex round-2 issue 1). Approach: - Launcher injects GIGL_TENSORBOARD_RUN_NAME = sanitized(job_name) + utc timestamp. The timestamp ensures launch-unique even across reruns of the same applied_task_identifier. - Writer uses <AIP_TENSORBOARD_LOG_DIR>/<run_name>/ as the file-writer directory; the uploader still watches the parent so the SDK's LogdirLoader discovers the subdir as a TensorboardRun via os.path.relpath. - Sanitization mirrors the SDK's reformat_run_name regex so the GCS subdir name and the SDK-derived run name agree (codex round-2 issue 5). - from_env constructs the file writer first, then attempts to start the uploader; on failure the writer is closed before re-raising — no leaked uploader thread (codex round-2 issue 6). - launch_single_pool_job now returns the submitted CustomJob so the smoke script can look up the per-job TensorboardExperiment (codex round-2 issue 3).

…ing entrypoints Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bypasses ConfigPopulator and the full pipeline, submitting a tiny n1-standard-2 CustomJob via the production launch_single_pool_job path so env-var injection and submit-side wiring exercise the same code as a real trainer. Verifies the TensorBoard story end-to-end via the API: - Per-job auto-named TensorboardExperiment exists, has runs, runs have TensorboardTimeSeries (R1 + scalar ingestion). - When --experiment-name is passed, the named experiment also exists with ≥1 run and ≥1 time series (R2). --container-uri is required so the smoke loop tests branch-local code, not a released image (codex round-2 issue 2). The launcher's launch_single_pool_job now returns the CustomJob (codex round-2 issue 3) so the script can resolve the auto-named experiment from the job's numeric ID.

Run via: python -m gigl.utils.dev.submit_smoke_job ...

After every CustomJob submission, log the per-job TensorboardExperiment URL (name == job's numeric ID) and, when the user opted into a stable TensorboardExperiment, also log the cross-job comparison URL. Saves users having to construct the URL by hand from the project/region/TB-ID/experiment template — particularly useful for the cross-job link, which the VAI job UI does NOT surface (the "Open TensorBoard" button there only resolves to the per-job experiment).

The checked-in e2e CORA task config previously hard-coded ``tensorboardExperimentName: "kmonte-test-experiment"``. That value is personal to one developer's debugging, AND any e2e/CI resource config without ``tensorboard_resource_name`` would now fail validation because the experiment-name check requires a backing TB resource (codex round-2 issue 4). Leave the field unset in the example so the default e2e CORA test stays compatible with all resource configs. Users opt into cross-job comparison by setting ``tensorboardExperimentName`` in their own task config copy alongside a TB resource on the trainer resource config. Also folds in ``make format`` output across the branch (mostly mdformat-driven markdown reflow).

The smoke launcher and tb_smoke_main were used to validate the multi-run TB design end-to-end during this branch's development. Now that R1 + R2 are validated and the production trainers carry the same behavior, drop the dev tooling so it doesn't ship in the public package. The plan at docs/plans/20260505-tb-multi-job-iteration.md retains a record of the iteration loop for posterity.

The plan was load-bearing during dev; the final state of the branch is the source of truth now. Plan stays in /home/kmontemayor/.claude/plans/ for posterity.

Now that we always pass tensorboard=<resource> when set (and just None otherwise), the conditional dict is unnecessary. Reverts the dict + conditional add added in 4868e04 to a single direct submit() call. Drops the now-unused Any import.

The branch's v1/trainer.py refactor (inline launch → launch_single_pool_job) and v1/lib/training_process.py tensorboard-writer removal land separately. Drops gigl/src/training/v1/trainer.py + .../training_process.py back to main; deletes the v1 trainer unit-test file. After this commit: `git diff main...HEAD -- gigl/src/training/v1/` is empty.

Cross-references in source comments and proto comments so readers can jump to the canonical Google Cloud docs for CustomJobSpec, TensorBoard data model (Tensorboard / TensorboardExperiment / TensorboardRun / TensorboardTimeSeries), and the auto-uploader contract.

Per-environment metaparam, just like tensorboard_resource_name. Single-pool trainers read it off their VertexAiResourceConfig directly; graph-store trainers read it off compute_pool (which is itself a VertexAiResourceConfig), matching how other Vertex AI metaparams already flow. This lets the launcher signature shrink: launch_single_pool_job / launch_graph_store_enabled_job / _build_job_config no longer take tensorboard_experiment_name as a parameter — they pull it from the proto. The v2 GLT trainer correspondingly drops the gbml_config extraction. Validation simplifies: both fields now live on the same proto object, so the single-dispatch pattern that already handled tensorboard_resource_name now handles experiment_name too. Tests rewritten to set the field on the resource config proto. The tests/unit/src/training/glt_trainer_test.py file now tests dispatch-by-type since the experiment-name extraction code path is gone.

…roto After moving tensorboard_experiment_name to VertexAiResourceConfig the comment was still pointing users at the old TrainerConfig location and gbml_config.proto.

These fixes were originally bundled with the TB enablement commit but they're independent of the TB feature. Reverting them here keeps this PR scoped to TB only; the bug-fix lands in a follow-up PR: - compute pool: use_cuda=is_cpu_execution (re-introduces the inverted flag, matches main) - graph-store storage pool: tracks compute pool's container_uri / use_cuda instead of always running CPU - rename is_compute_cpu_execution back to is_cpu_execution The corresponding test assertions (--use_cuda presence, storage container_uri) are dropped since they verified the fixed behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The try/finally wrapper was meant to guarantee ``tensorboard_writer.close()`` runs even if the training body raised. But Vertex AI is the only relevant runtime here, and on a worker that crashes mid-training the process exits anyway — the paired ``aiplatform.end_upload_tb_log()`` is best-effort regardless. Plain sequential ``writer = from_env(...); ... writer.close()`` matches the rest of the codebase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The local ``from google.cloud import aiplatform`` inside ``close()`` and ``_maybe_start_uploader`` was a defensive measure for environments without ``google-cloud-aiplatform``. It's a hard dependency here (declared in pyproject.toml and used by other modules at import time), so the local-import wrapping just adds noise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vertex_ai_test.py and glt_trainer_test.py were primarily exercising the test mocks rather than real behavior. The real behavior is covered by tensorboard_writer_test.py (writer + uploader lifecycle), vertex_ai_launcher_test.py (launcher → VertexAiJobConfig wiring), and gbml_and_resource_config_compatibility_checks_test.py (validation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

yliu2-sc

Could you add some examples?

yliu2-sc · 2026-05-08T00:52:49Z

+    match = _VERTEX_TENSORBOARD_ID_FROM_RESOURCE_PATTERN.match(
+        tensorboard_resource_name
+    )
+    if not match:


should this throw an error or default to sth?

yliu2-sc · 2026-05-08T00:53:30Z

        logger.info(
            f"See job logs at: https://console.cloud.google.com/ai/platform/locations/{self._location}/training/{job.name}?project={self._project}"
        )
+        if job_config.tensorboard_resource_name:


else? what if tensorboard_resource_name is not provided, should we default to sth or throw error?

yliu2-sc · 2026-05-08T00:55:16Z

+            if per_job_url:
+                logger.info(f"View TensorBoard (per-job): {per_job_url}")
+            if job_config.tensorboard_experiment_name:
+                comparison_url = _build_tensorboard_experiment_url(


I feel like if experiment is provided, we just need to return the experiment url? there's not really much difference otherwise? User just have to filter for their specific jobs?

yliu2-sc · 2026-05-08T00:56:34Z

+)
+
+
+def _sanitize_for_vertex_run(value: str) -> str:


nit:

Suggested change

def _sanitize_for_vertex_run(value: str) -> str:

def _sanitize_vertex_run_name(value: str) -> str:

kmontemayor and others added 30 commits April 28, 2026 18:24

Update generated protobuf artifacts

187bd3a

Enable Vertex AI TensorBoard for trainer jobs

5edfedd

test(tensorboard): cover close idempotency on no-op writer

ddc4fcd

Code review found that test_close_on_noop_writer_does_not_raise only called close() once, so idempotency on the no-op path was untested. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

refactor(examples): migrate heterogeneous_training.py to TensorBoardW…

2018f23

…riter Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

refactor(examples): migrate graph_store/homogeneous_training.py to Te…

89939de

…nsorBoardWriter Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

refactor(examples): migrate graph_store/heterogeneous_training.py to …

d9817f4

…TensorBoardWriter Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cleanup

fe879dc

Update

1c14a16

proto: add TrainerConfig.tensorboard_experiment_name

667df9b

validation: tensorboard_experiment_name requires tensorboard_resource…

cd40efd

…_name Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vertex_ai: add tensorboard_experiment_name to VertexAiJobConfig

2a78d5a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vertex_ai: helper to ensure Experiment exists with backing TB

b6b70a7

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vertex_ai: submit with experiment when tensorboard_experiment_name is…

4868e04

… set Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

launcher: thread tensorboard_experiment_name through _build_job_config

9d3109e

launcher: thread tensorboard_experiment_name through launch entrypoints

1ffd750

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(launcher): tighten experiment_name negative assertions to assert…

07784ed

…IsNone

trainer: forward TrainerConfig.tensorboard_experiment_name to launcher

a1d73e9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

examples: demo TrainerConfig.tensorboardExperimentName

8dc0b49

validation: tensorboard_experiment_name also requires tensorboard_log…

99ab56d

…s_uri

kmontemayor and others added 23 commits May 4, 2026 19:07

Revert "validation: tensorboard_experiment_name also requires tensorb…

73103ed

…oard_logs_uri" This reverts commit 99ab56d.

examples: scope TensorBoardWriter to a try/finally block in all train…

a5048fd

…ing entrypoints Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tools: relocate smoke launcher to gigl.utils.dev (tools/ is gitignored)

51d9df7

Run via: python -m gigl.utils.dev.submit_smoke_job ...

smoke: bump machine_type to n1-standard-16 (n1-standard-2 unsupported)

f06df02

docs: remove in-flight branch plan from git

e5d0cf9

The plan was load-bearing during dev; the final state of the branch is the source of truth now. Plan stays in /home/kmontemayor/.claude/plans/ for posterity.

examples: update stale CORA task-config comment to point at the new p…

236275e

…roto After moving tensorboard_experiment_name to VertexAiResourceConfig the comment was still pointing users at the old TrainerConfig location and gbml_config.proto.

yliu2-sc reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kmonte/add tb for glt#603

Kmonte/add tb for glt#603
kmontemayor2-sc wants to merge 53 commits intomainfrom
kmonte/add-tb-for-glt

kmontemayor2-sc commented Apr 29, 2026

Uh oh!

yliu2-sc left a comment

Uh oh!

yliu2-sc May 8, 2026

Uh oh!

yliu2-sc May 8, 2026

Uh oh!

yliu2-sc May 8, 2026

Uh oh!

yliu2-sc May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	def _sanitize_for_vertex_run(value: str) -> str:
	def _sanitize_vertex_run_name(value: str) -> str:

Conversation

kmontemayor2-sc commented Apr 29, 2026

Uh oh!

yliu2-sc left a comment

Choose a reason for hiding this comment

Uh oh!

yliu2-sc May 8, 2026

Choose a reason for hiding this comment

Uh oh!

yliu2-sc May 8, 2026

Choose a reason for hiding this comment

Uh oh!

yliu2-sc May 8, 2026

Choose a reason for hiding this comment

Uh oh!

yliu2-sc May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants