Kmonte/add tb for glt#603
Draft
kmontemayor2-sc wants to merge 53 commits intomainfrom
Draft
Conversation
Replace tests for the four-function API with tests for the new TensorBoardWriter class. Tests fail with ImportError until the class lands in the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code review found that test_close_on_noop_writer_does_not_raise only called close() once, so idempotency on the no-op path was untested. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collapse resolve_tensorboard_log_dir, create_tensorboard_writer, write_tensorboard_scalar, close_tensorboard_writer, and VERTEX_TENSORBOARD_LOG_DIR_ENV_KEY into a single TensorBoardWriter class with from_uri classmethod. The class is context-managerable and no-ops when disabled, eliminating Optional plumbing at call sites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The V1 BaseTrainer.train body never wrote scalars via TF's ambient default writer, so the file_writer + as_default() block in __run_training served no purpose. Verified with grep: no tf.summary.scalar or write_tensorboard_scalar callers anywhere in gigl/src/training/v1/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace function-based tensorboard helpers with the new TensorBoardWriter class. Collapses two back-to-back scalar writes per log step into a single dict-style log() call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…riter Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nsorBoardWriter Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…TensorBoardWriter Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default for the proto bool field is false, so removing the line preserves the same behavior. Avoids implying that tensorboard logging should be on by default in example task configs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Relocate from gigl/src/common/utils/tensorboard.py (internal pipeline utilities) to gigl/utils/tensorboard_writer.py (general-purpose user utilities). Update the test file path, the in-test patch paths, and all four example training scripts to import from the new location. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Examples now log to TensorBoard whenever a tensorboard_log_uri is set on the gbml config; no separate boolean to flip. Removes the should_log_to_tensorboard field from TrainingProcessArgs (and its docstring entry, extraction from the gbml config, and pass-through to _training_process) in all four link-prediction example training scripts. TensorBoardWriter.from_uri already returns a no-op writer when the URI is None, so a single signal (the URI) is sufficient. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
V1 and V2 trainer launchers previously gated tensorboard_logs_uri construction on both should_log_to_tensorboard AND a non-empty proto field. Drop the should_log_to_tensorboard half so the URI is forwarded to the Vertex AI launcher whenever the proto field is set. One signal (URI presence) instead of two. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TensorBoardWriter now reads AIP_TENSORBOARD_LOG_DIR directly and raises when unset. Vertex AI populates that env var from CustomJobSpec.baseOutputDirectory, which GiGL's launcher already derives from tensorboardLogsUri — so the prior fallback in _resolve_log_dir was dead weight. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ning Now that TensorBoardWriter reads AIP_TENSORBOARD_LOG_DIR directly via from_env(), the example no longer needs to plumb tensorboard_logs_uri from the GbmlConfig through TrainingProcessArgs to the worker process. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that TensorBoardWriter reads AIP_TENSORBOARD_LOG_DIR directly via from_env(), the examples no longer need to plumb tensorboard_logs_uri from the GbmlConfig through TrainingProcessArgs to the worker process. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_name Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… set Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…oard_logs_uri" This reverts commit 99ab56d.
…t_name
Vertex AI MetadataStore Context IDs (which back ExperimentRuns) must match
[a-z0-9][a-z0-9-]{0,127}. GiGL's gigl_train_<task>_<timestamp> job names
contain underscores and were rejected with a 400 from the SDK. Lowercase and
replace underscores with hyphens for the run ID; fail fast with a clear
message when the user-supplied tensorboard_experiment_name itself is invalid
(don't silently rewrite user input).
Passing experiment_run=<string> invokes aiplatform.ExperimentRun's getter, which 404s when the run doesn't exist yet (every first submit does). Per the SDK's own contract — "If 'experiment' is set but 'experiment_run' is not, an ExperimentRun resource will still be auto-generated" (jobs.py:2509-2514) — the right move is to set only experiment= and let the SDK create the run. The user-supplied experiment name is still validated up front against Vertex's resource-ID regex; the run-id sanitizer is no longer needed.
…path Vertex's auto-uploader is gated on submit(tensorboard=); submit(experiment=) alone leaves AIP_TENSORBOARD_LOG_DIR populated but never reads from it, so events sit in GCS un-uploaded (job 6570151780682825728 is the smoking gun). The two submit kwargs are mutually exclusive, and tensorboard= forces a job-scoped experiment we can't rename — neither delivers multi-run comparison. Switch approach: when tensorboard_experiment_name is set, the launcher injects GIGL_TENSORBOARD_RESOURCE_NAME and GIGL_TENSORBOARD_EXPERIMENT_NAME into the worker container, and submits with NEITHER experiment= nor tensorboard=. The trainer's TensorBoardWriter.from_env reads those env vars on the chief rank and starts aiplatform.start_upload_tb_log to stream AIP_TENSORBOARD_LOG_DIR to the named TensorboardExperiment. close() pairs with end_upload_tb_log. Multiple jobs sharing the experiment name show as comparable runs on one TB page. Drops the now-unused _ensure_experiment_with_backing_tb helper (the SDK uploader auto-creates the TensorboardExperiment).
The "Open TensorBoard" link on the Vertex AI job page is gated on jobSpec.tensorboard. Dropping that field (which the previous experiment-name branch did) hides the link entirely. Make _submit_job set tensorboard= any time the resource name is configured. The chief-rank uploader continues to stream to the user-named TensorboardExperiment in parallel for cross-job comparison. Refresh the now-stale dual-uploader story in the dataclass docstring, launcher comment block, and proto comment. Also adds the round-2 plan to docs/plans/.
…runs The chief-rank aiplatform.start_upload_tb_log uploader was watching AIP_TENSORBOARD_LOG_DIR with no run_name_prefix. The SDK's LogdirLoader maps root-logdir events to DEFAULT_RUN_NAME = "default", and the SDK silently merges into existing runs by name — so two jobs sharing the same TensorboardExperiment would collapse into one "default" run instead of producing two comparable runs (codex round-2 issue 1). Approach: - Launcher injects GIGL_TENSORBOARD_RUN_NAME = sanitized(job_name) + utc timestamp. The timestamp ensures launch-unique even across reruns of the same applied_task_identifier. - Writer uses <AIP_TENSORBOARD_LOG_DIR>/<run_name>/ as the file-writer directory; the uploader still watches the parent so the SDK's LogdirLoader discovers the subdir as a TensorboardRun via os.path.relpath. - Sanitization mirrors the SDK's reformat_run_name regex so the GCS subdir name and the SDK-derived run name agree (codex round-2 issue 5). - from_env constructs the file writer first, then attempts to start the uploader; on failure the writer is closed before re-raising — no leaked uploader thread (codex round-2 issue 6). - launch_single_pool_job now returns the submitted CustomJob so the smoke script can look up the per-job TensorboardExperiment (codex round-2 issue 3).
…ing entrypoints Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bypasses ConfigPopulator and the full pipeline, submitting a tiny n1-standard-2 CustomJob via the production launch_single_pool_job path so env-var injection and submit-side wiring exercise the same code as a real trainer. Verifies the TensorBoard story end-to-end via the API: - Per-job auto-named TensorboardExperiment exists, has runs, runs have TensorboardTimeSeries (R1 + scalar ingestion). - When --experiment-name is passed, the named experiment also exists with ≥1 run and ≥1 time series (R2). --container-uri is required so the smoke loop tests branch-local code, not a released image (codex round-2 issue 2). The launcher's launch_single_pool_job now returns the CustomJob (codex round-2 issue 3) so the script can resolve the auto-named experiment from the job's numeric ID.
Run via: python -m gigl.utils.dev.submit_smoke_job ...
After every CustomJob submission, log the per-job TensorboardExperiment URL (name == job's numeric ID) and, when the user opted into a stable TensorboardExperiment, also log the cross-job comparison URL. Saves users having to construct the URL by hand from the project/region/TB-ID/experiment template — particularly useful for the cross-job link, which the VAI job UI does NOT surface (the "Open TensorBoard" button there only resolves to the per-job experiment).
The checked-in e2e CORA task config previously hard-coded ``tensorboardExperimentName: "kmonte-test-experiment"``. That value is personal to one developer's debugging, AND any e2e/CI resource config without ``tensorboard_resource_name`` would now fail validation because the experiment-name check requires a backing TB resource (codex round-2 issue 4). Leave the field unset in the example so the default e2e CORA test stays compatible with all resource configs. Users opt into cross-job comparison by setting ``tensorboardExperimentName`` in their own task config copy alongside a TB resource on the trainer resource config. Also folds in ``make format`` output across the branch (mostly mdformat-driven markdown reflow).
The smoke launcher and tb_smoke_main were used to validate the multi-run TB design end-to-end during this branch's development. Now that R1 + R2 are validated and the production trainers carry the same behavior, drop the dev tooling so it doesn't ship in the public package. The plan at docs/plans/20260505-tb-multi-job-iteration.md retains a record of the iteration loop for posterity.
The plan was load-bearing during dev; the final state of the branch is the source of truth now. Plan stays in /home/kmontemayor/.claude/plans/ for posterity.
Now that we always pass tensorboard=<resource> when set (and just None otherwise), the conditional dict is unnecessary. Reverts the dict + conditional add added in 4868e04 to a single direct submit() call. Drops the now-unused Any import.
The branch's v1/trainer.py refactor (inline launch → launch_single_pool_job) and v1/lib/training_process.py tensorboard-writer removal land separately. Drops gigl/src/training/v1/trainer.py + .../training_process.py back to main; deletes the v1 trainer unit-test file. After this commit: `git diff main...HEAD -- gigl/src/training/v1/` is empty.
Cross-references in source comments and proto comments so readers can jump to the canonical Google Cloud docs for CustomJobSpec, TensorBoard data model (Tensorboard / TensorboardExperiment / TensorboardRun / TensorboardTimeSeries), and the auto-uploader contract.
Per-environment metaparam, just like tensorboard_resource_name. Single-pool trainers read it off their VertexAiResourceConfig directly; graph-store trainers read it off compute_pool (which is itself a VertexAiResourceConfig), matching how other Vertex AI metaparams already flow. This lets the launcher signature shrink: launch_single_pool_job / launch_graph_store_enabled_job / _build_job_config no longer take tensorboard_experiment_name as a parameter — they pull it from the proto. The v2 GLT trainer correspondingly drops the gbml_config extraction. Validation simplifies: both fields now live on the same proto object, so the single-dispatch pattern that already handled tensorboard_resource_name now handles experiment_name too. Tests rewritten to set the field on the resource config proto. The tests/unit/src/training/glt_trainer_test.py file now tests dispatch-by-type since the experiment-name extraction code path is gone.
…roto After moving tensorboard_experiment_name to VertexAiResourceConfig the comment was still pointing users at the old TrainerConfig location and gbml_config.proto.
These fixes were originally bundled with the TB enablement commit but
they're independent of the TB feature. Reverting them here keeps this
PR scoped to TB only; the bug-fix lands in a follow-up PR:
- compute pool: use_cuda=is_cpu_execution (re-introduces the inverted
flag, matches main)
- graph-store storage pool: tracks compute pool's container_uri /
use_cuda instead of always running CPU
- rename is_compute_cpu_execution back to is_cpu_execution
The corresponding test assertions (--use_cuda presence, storage
container_uri) are dropped since they verified the fixed behavior.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The try/finally wrapper was meant to guarantee ``tensorboard_writer.close()`` runs even if the training body raised. But Vertex AI is the only relevant runtime here, and on a worker that crashes mid-training the process exits anyway — the paired ``aiplatform.end_upload_tb_log()`` is best-effort regardless. Plain sequential ``writer = from_env(...); ... writer.close()`` matches the rest of the codebase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The local ``from google.cloud import aiplatform`` inside ``close()`` and ``_maybe_start_uploader`` was a defensive measure for environments without ``google-cloud-aiplatform``. It's a hard dependency here (declared in pyproject.toml and used by other modules at import time), so the local-import wrapping just adds noise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vertex_ai_test.py and glt_trainer_test.py were primarily exercising the test mocks rather than real behavior. The real behavior is covered by tensorboard_writer_test.py (writer + uploader lifecycle), vertex_ai_launcher_test.py (launcher → VertexAiJobConfig wiring), and gbml_and_resource_config_compatibility_checks_test.py (validation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
yliu2-sc
reviewed
May 8, 2026
Collaborator
yliu2-sc
left a comment
There was a problem hiding this comment.
Could you add some examples?
| match = _VERTEX_TENSORBOARD_ID_FROM_RESOURCE_PATTERN.match( | ||
| tensorboard_resource_name | ||
| ) | ||
| if not match: |
Collaborator
There was a problem hiding this comment.
should this throw an error or default to sth?
| logger.info( | ||
| f"See job logs at: https://console.cloud.google.com/ai/platform/locations/{self._location}/training/{job.name}?project={self._project}" | ||
| ) | ||
| if job_config.tensorboard_resource_name: |
Collaborator
There was a problem hiding this comment.
else? what if tensorboard_resource_name is not provided, should we default to sth or throw error?
| if per_job_url: | ||
| logger.info(f"View TensorBoard (per-job): {per_job_url}") | ||
| if job_config.tensorboard_experiment_name: | ||
| comparison_url = _build_tensorboard_experiment_url( |
Collaborator
There was a problem hiding this comment.
I feel like if experiment is provided, we just need to return the experiment url? there's not really much difference otherwise? User just have to filter for their specific jobs?
| ) | ||
|
|
||
|
|
||
| def _sanitize_for_vertex_run(value: str) -> str: |
Collaborator
There was a problem hiding this comment.
nit:
Suggested change
| def _sanitize_for_vertex_run(value: str) -> str: | |
| def _sanitize_vertex_run_name(value: str) -> str: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Scope of work done
Where is the documentation for this feature?: N/A
Did you add automated tests or write a test plan?
Updated Changelog.md? NO
Ready for code review?: NO