Provide nvidia-imex system service#428
Conversation
|
Forced push to:
|
|
|
||
| ## OS Changes | ||
| * Backport patch to prevent a race in neighbor resolution for RDMA workloads ([#427]) | ||
| * Provide innactive nvidia-imex systemd service ([#428]) |
There was a problem hiding this comment.
| * Provide innactive nvidia-imex systemd service ([#428]) | |
| * Provide inactive nvidia-imex systemd service ([#428]) |
| ExecStart=/usr/bin/nvidia-imex -c /etc/nvidia-imex/config.cfg | ||
| StandardOutput=journal | ||
| StandardError=journal | ||
| LimitCORE=infinity |
There was a problem hiding this comment.
Based on the PR description the missing [Install] section seems intentional (start-only, never enabled). If so, can we add a short comment in the unit explaining that ?
There was a problem hiding this comment.
start-only, never enabled
This sounds confusing, the intention is never start, never enabled (which is what is shown in the PR details.) I can add the comment but I prefer to keep a good commit message stating why the change is what it is.
| %{_cross_bindir}/nvidia-imex-ctl | ||
| %{_cross_unitdir}/nvidia-imex.service | ||
| %{_cross_factorydir}/etc/nvidia-imex/config.cfg | ||
| %{_cross_tmpfilesdir}/nvidia-imex-tmpfiles.conf |
There was a problem hiding this comment.
nit: Would it makes sense to rename nvidia-imex-tmpfiles.conf to be nvidia-imex.conf since they are already in the tmp dir ?
There was a problem hiding this comment.
Yeah, I should have caught that. Nice catch!
| @@ -0,0 +1,2 @@ | |||
| d /etc/nvidia-imex 0755 root root - | |||
| C /etc/nvidia-imex/config.cfg | |||
There was a problem hiding this comment.
| C /etc/nvidia-imex/config.cfg | |
| C /etc/nvidia-imex/config.cfg 0644 root root - |
e31065e to
1c0dc52
Compare
8b77147 to
46f90be
Compare
|
(Forced push includes rebase to fix merge conflicts) |
Provide an inactive nvidia-imex systemd service, which should be managed by the downstreams. Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
Provide a modprobe override for the NVIDIA kernel module to create a default IMEX channel Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
Provide an inactive nvidia-imex systemd service, which should be managed by the downstreams Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
Provide a modprobe override for the NVIDIA kernel module to create a default IMEX channel Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
Add changelog entries for the changes introduced in this series Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
46f90be to
ae2560d
Compare
|
(forced push to fix typo in commit message) |
|
nit: commit: LGTM |
Description of changes:
This series adds a new inactive systemd service for nvidia-imex. The service is inactive (not started on boot) because downstreams are expected to manage its lifecycle, as nvidia-imex requires a configuration file with the IPs of the nodes that will belong to the same cluster which are only known by the control plan and the capacity provider. In Kubernetes, nvidia-imex channels are managed by the NVIDIA DRA driver, so this service is intended to be used by other orchestrators.
As part of this change, a new modprobe override was added that allows the NVIDIA kmods to create a default IMEX channel. This configuration is opt-in, as enabling by default in all variants could interfere with the IMEX channels management performed by the NVIDIA DRA driver.
No API will be provided for the time being, but one might be needed if we decide to extend the support of nvidia-imex beyond what we currently have.
The systemd service was based off the
nvidia-imex.serviceprovided by the run archives, I just adapt it slightlyA default configuration file is provided with sensible defaults:
Details
LOG_LEVEL=3: log info messagesLOG_FILE_NAME: empty to log to STDOUT and get the logs to the journalSTATS_FILE_NAME: same as aboveDAEMONIZE=0: run as an actual process (don't fork)BIND_INTERFACE_IP=: set throughnodes_config.cfg, managed by the downstreamsSERVER_PORT=50000: default port, but make it explicitIMEX_NODE_CONFIG_FILE=/etc/nvidia-imex/nodes_config.cfg: default path, but make it explicitNETWORK_INTERFACE=set throughnodes_config.cfg, managed by the downstreamsOUTGOING_NETWORK_INTERFACE=IMEX_WAIT_FOR_QUORUM=RECOVERY: safe default (wait for quorum)IMEX_CMD_ENABLED=1IMEX_CMD_UNIX_DOMAIN_PATH=/run/nvidia/nvidia-imex-cmd.sock: socket to send commands to the serviceIMEX_NODE_DISCONNECTED_GRACE_TIME=-1: Wait indefinitelyIMEX_GRPC_DSCP_OVERRIDE=0:NO-OP configurations but since I don't have access to the source code nor to instances that can connect through IMEX, I decided to keep them.
LOG_APPEND_TO_LOG=1:LOG_FILE_MAX_SIZE=1024LOG_MAX_ROTATE_COUNT=3LOG_USE_SYSLOG=1Testing done:
nvidia-imexservice doesn't start on boot:Details
bash-5.2# systemctl status nvidia-imex.service ○ nvidia-imex.service - NVIDIA IMEX service Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/nvidia-imex.service; static) Drop-In: /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/service.d └─00-aws-config.conf, 10-requires-tmp.conf Active: inactive (dead)Details
Details
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.