Dispatch CustomResourceConfig through the v2 GLT trainer + inferencer#628
Dispatch CustomResourceConfig through the v2 GLT trainer + inferencer#628kmontemayor2-sc wants to merge 9 commits intomainfrom
Conversation
Introduces a new `oneof` arm on `TrainerResourceConfig` / `InferencerResourceConfig` that lets callers describe a launcher as a shell command + positional args, instead of a fixed-shape Vertex AI / KFP / local resource config. The proto carries no semantics here — the dispatcher is added in a follow-up PR; this commit only ships the message, regenerated bindings, and the wrapper-property update so downstream code can read `wrapper.trainer_config` and get a `CustomResourceConfig` back. The diff includes a long tail of cosmetic Scala changes outside `gigl_resource_config/` because scalapbc regenerates every sibling proto's emitted source whenever any one proto in the same directory changes. Reviewers can scope to `CustomResourceConfig.scala` and the `*ResourceConfig.scala` siblings that gain the new oneof case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/all_test |
GiGL Automation@ 23:09:28UTC : 🔄 @ 24:10:53UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 23:09:30UTC : 🔄 @ 23:19:49UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 23:09:32UTC : 🔄 @ 23:16:26UTC : ❌ Workflow failed. |
GiGL Automation@ 23:09:35UTC : 🔄 @ 24:35:00UTC : ❌ Workflow failed. |
GiGL Automation@ 23:09:37UTC : 🔄 @ 23:18:13UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 23:09:38UTC : 🔄 @ 24:20:42UTC : ✅ Workflow completed successfully. |
|
|
||
| shell_line = " ".join([command, *(shlex.quote(a) for a in args)]) | ||
| logger.info(f"Launching {component.name} via subprocess: {shell_line!r}") | ||
| subprocess.run(shell_line, shell=True, check=True) |
There was a problem hiding this comment.
Semgrep identified an issue in your code:
The subprocess.run() call uses shell=True on user-controlled command strings, allowing shell injection attacks that could execute arbitrary commands with the process's privileges.
More details about this
The subprocess.run() call executes shell_line with shell=True, which spawns a shell process to interpret the command. This is dangerous because if resolved_command or any element in resolved_args comes from untrusted input (e.g., user-provided configuration, external data), an attacker can inject arbitrary shell commands.
For example, if an attacker controls custom_resource_config.command and sets it to id; rm -rf /, the shell will execute both id and the destructive rm -rf / command. Even though shlex.quote() escapes individual arguments, it doesn't protect against injection in the command itself—only in the args list. An attacker who controls the command field can bypass this protection entirely.
Exploit scenario:
- Attacker provides
custom_resource_config.command = "echo test; cat /etc/passwd #" - After joining with args,
shell_linebecomes something like"echo test; cat /etc/passwd #" - When
subprocess.run()executes this withshell=True, the shell interprets the semicolon as a command separator - Both
echo testandcat /etc/passwdexecute, leaking sensitive system files - If the process runs with elevated privileges, the attacker can exfiltrate or modify sensitive data
The shell inherits environment variables and settings from the parent process, which further expands the attack surface. Using shell=False would treat the entire string as a single command name rather than allowing shell metacharacters to be interpreted.
To resolve this comment:
💡 Follow autofix suggestion
| subprocess.run(shell_line, shell=True, check=True) | |
| subprocess.run(shell_line, shell=False, check=True) |
View step-by-step instructions
- Replace
subprocess.run(shell_line, shell=True, check=True)with an invocation that does not useshell=True. - Change the code to directly pass the command and arguments as a list, rather than joining into a string. Use:
subprocess.run([resolved_command] + resolved_args, check=True). - Remove any uses of
shlex.quotewhen building the argument list, since quoting is only necessary when passing a string to the shell. - Ensure that
resolved_commandand every item inresolved_argsare unquoted strings representing the command and its arguments.
Passing the command and arguments as a list with shell=False (the default) is safer, because it avoids any interpretation by a shell, preventing command injection vulnerabilities.
💬 Ignore this finding
Reply with Semgrep commands to ignore this finding.
/fp <comment>for false positive/ar <comment>for acceptable risk/other <comment>for all other reasons
Alternatively, triage in Semgrep AppSec Platform to ignore the finding created by subprocess-shell-true.
You can view more details about this finding in the Semgrep AppSec Platform.
CustomResourceConfig is launcher-pluggable — there is no concrete machine spec to validate against (no machine_type, num_workers, GPU config, etc.). The wrapper's trainer_config / inferencer_config properties now return a union that includes CustomResourceConfig (introduced earlier in this PR), but _validate_machine_config does not accept it. Add an isinstance early-return guard at each call site that logs the skip and returns. Shape and backend-compatibility validation for CustomResourceConfig come in a follow-up PR; this commit only makes the existing validation flow type-clean against the widened union. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements `launch_custom`, a thin shim that takes a populated
`CustomResourceConfig` and shells out via
`subprocess.run(shell=True, check=True)`. The proto's `command` is
a shell snippet (so leading `KEY=VALUE` env assignments parse
naturally) and `args[]` are individually `shlex.quote`-d before
joining, so values containing whitespace survive the shell pass.
The dispatcher performs no template substitution: `command` and
`args[]` are taken verbatim, and any placeholder text reaches
`subprocess.run` literally. Consumers that want runtime-context
substitution (e.g. ${gigl:foo}) should resolve it at YAML-load
time before the proto reaches this module. No call site in the
rest of the repo invokes `launch_custom` yet — wiring is added in
a follow-up PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two validators wired into the existing pre-run validation chain: 1. A bypass in `_validate_machine_config` (via the trainer / inferencer resource-config-valid checks) so a `CustomResourceConfig` trainer / inferencer skips machine-shape checks — the new oneof arm carries no machine spec to validate. 2. `check_custom_resource_config_shape`, which raises if a populated `CustomResourceConfig` has an empty `command`. Pure shape check, no subprocess spawn. 3. `check_custom_resource_config_requires_glt_backend`, an unconditional gate that pairs `CustomResourceConfig` with `feature_flags.should_run_glt_backend`. The v1 dispatchers do not route through `launch_custom`; this catches the misconfig at validate time. Both new gates run unconditionally inside `kfp_validation_checks` for both Trainer and Inferencer components. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Routes the custom oneof arm of `TrainerResourceConfig` / `InferencerResourceConfig` to `launch_custom` from the v2 GLT entrypoints, alongside the existing Vertex AI single-pool / graph- store / local branches. With this in place, callers can swap any trainer or inferencer onto a shell-command launcher by populating the `custom_*_config` field on their resource config; no further GiGL-side change is required. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Build a per-call env dict from os.environ.copy() plus six launcher-
managed GIGL_* keys and pass it via subprocess.run(..., env=...). The
parent process's os.environ is not mutated, so concurrent
launch_custom calls in the same process don't race or leak context
across launches.
Env-var contract:
GIGL_TASK_CONFIG_URI — str(task_config_uri)
GIGL_RESOURCE_CONFIG_URI — str(resource_config_uri)
GIGL_COMPONENT — component.name
GIGL_APPLIED_TASK_IDENTIFIER — str(applied_task_identifier)
GIGL_CUDA_DOCKER_IMAGE — cuda_docker_uri or ""
GIGL_CPU_DOCKER_IMAGE — cpu_docker_uri or ""
The kwargs that previously had ``# Accepted for back-compat; ignored``
(applied_task_identifier, task_config_uri, resource_config_uri,
cpu_docker_uri, cuda_docker_uri) are now the source of these env vars.
process_command and process_runtime_args remain accepted-but-ignored:
the receiving CLI is expected to re-derive them from the GbmlConfig at
GIGL_TASK_CONFIG_URI.
GIGL_* is reserved for launcher-managed context; custom-launcher YAML
authors must not collide with these names. Documented that a leading
KEY=VALUE shell prefix on `command` overrides the inherited env for
that command (so a stray GIGL_FOO=... prefix would silently shadow
what the dispatcher set).
The launcher still performs no template substitution on command/args,
so any ${...}-style placeholder reaches the subprocess verbatim;
consumers should drop those in favor of reading GIGL_* from
os.environ. Existing dispatch-wiring tests for the GLT trainer +
inferencer paths still pass — no signature change to launch_custom,
only the previously-ignored kwargs become live.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ThreadPoolExecutor(max_workers=0) raises ValueError. The previous code passed len(node_data_references) / len(edge_data_references) directly, which crashes when a preprocessor returns empty preprocessing-spec dicts (legitimate use case for an end-to-end harness). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same root cause as the enumerate fix: ThreadPoolExecutor(max_workers=0) raises ValueError. When both node_ref_to_preprocessing_spec and edge_ref_to_preprocessing_spec are empty, num_dataflow_jobs is 0 and the executor blows up in __init__. Early-return an empty PreprocessedMetadataReferences in that case — no Dataflow work to do. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8d02ca2 to
423b0d0
Compare
Routes the custom oneof arm of
TrainerResourceConfig/InferencerResourceConfigtolaunch_customfrom the v2 GLTentrypoints, alongside the existing Vertex AI single-pool / graph-
store / local branches. With this in place, callers can swap any
trainer or inferencer onto a shell-command launcher by populating
the
custom_*_configfield on their resource config; no furtherGiGL-side change is required.