Feat/second nic communication by sesame0224 · Pull Request #3129 · numaproj/numaflow

sesame0224 · 2026-01-08T22:10:16Z

What this PR does / why we need it

This PR proposes a new method for enabling direct data communication between Numaflow vertices, and describes the motivation, resource specifications and workflows.

Internal discussions are ongoing, so we will revise the doc as appropriate.

Related issues

#2990
This PR is an initial design derived from the above post.
We would like to discuss this with the community.

Testing

This PR includes only documentation, so no tests were performed.

Special notes for reviewers

Since base branch is wrong in #3125, I recreated this PR.

codecov · 2026-01-08T22:21:37Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.08%. Comparing base (6cbe3a7) to head (46565e3).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3129      +/-   ##
==========================================
+ Coverage   81.00%   81.08%   +0.07%     
==========================================
  Files         317      317              
  Lines       73287    73287              
==========================================
+ Hits        59368    59425      +57     
+ Misses      13367    13307      -60     
- Partials      552      555       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

vigith · 2026-01-09T03:12:20Z

Thank you for putting this together. I have a couple of basic questions.

How do you replay a message if we rely purely on direct communication? E.g, if the pod in Vertex^N is talking to another pod in Vertex^N+1 and that pod in Vertex^N crashes due to a underlying node failure, then the message will be lost.
Today we use buffer so that the producer (Vertex^N) and the next vertex (Vertex^N+1) are decoupled. The reason we did this was because a pod could crash any time, but we have a buffer that is persistent which can be used by the Vertex^N+1 to read the data.
How do you know which pod in Vertex^N should talk to which pod in Vertex^N+1

sesame0224 · 2026-01-14T09:05:48Z

How do you replay a message if we rely purely on direct communication?

As a premise, we assume a use case of object detection using video inference. Therefore, video frames are used as input data. In this use case, losing some video frames is not a critical issue. As a result, retransmission of lost data is out of scope.
If this functionality is implemented, as you pointed out, users will need to choose the type of path depending on their use case.

How do you know which pod in VertexN should talk to which pod in VertexN+1

First, the user defines the connectivity between Vertices in spec.edges of the manifest file, as before.
When a Pipeline resource is deployed, Pods are deployed in a chained manner by the Vertex controller.

An external controller watches Pod deployments and checks whether the domain name of the MultiNetwork Service corresponding to each Vertex and the Pod’s SecondNIC IP are registered in CoreDNS. If not, it registers them.

Now, let me move on to application execution.
Currently, the UDF container in each Vertex already has the destination Vertex specified as an environment variable. Therefore, we assume that it can also have the domain name of the corresponding MultiNetwork Service.
By querying CoreDNS using this domain name, we believe that the container can obtain the SecondNIC IP addresses of candidate destination Pods.

Additionally, standardization of the Service for MultiNetwork via the Gateway API is currently under consideration, but the specification is still undecided.

vigith

Could you please provide links to all the important technology you are referencing to. It will help us understand better.

vigith · 2026-01-15T18:46:55Z

+
+Therefore, we consider introducing a high-speed communication method with low transfer overhead, such as GPUDirect RDMA, for inter-vertex communication. To achieve this, the following elements are required.
+
+1. GPUDirect RDMA, which enables direct device-to-device communication between GPUs, requires RDMA-capable NICs. In other words, it is necessary to introduce a high-speed network by assigning a second NIC for RDMA to each pod, separate from the default network.


GPUDirect RDMA

Could you please provide link for this?

I’ve updated the README and embedded the links.

sesame0224 · 2026-01-16T08:55:01Z

GPUDirectRDMA
- GPUDirect
  - https://network.nvidia.com/products/GPUDirect-RDMA/
  - 1. Overview — GPUDirect RDMA 13.1 documentation
DRA(Dynamic Resource Allocation)
- Dynamic Resource Allocation
- enhancements/keps/sig-node/4381-dra-structured-parameters at master · kubernetes/enhancements
dranet
- GitHub - kubernetes-sigs/dranet: DRANET is a Kubernetes Network Driver that uses Dynamic Resource Allocation (DRA) to deliver high-performance networking for demanding applications in Kubernetes. > How It Works
MultiNetwork
- Introduction - Kubernetes Gateway API
- gateway-api/geps/gep-3539/index.md at 8dab84ae52f627a747f1a8b3a7a3f255a79f882f · kubernetes-sigs/gateway-api
  - issue: GEP: Gateway API to Expose Pods on Cluster-Internal IP Address (ClusterIP) · Issue #3539 · kubernetes-sigs/gateway-api
  - PR: Adding GEP-3539: Gateway API to Expose Pods on Cluster-Internal IP Address (ClusterIP Gateway) by ptrivedi · Pull Request #3608 · kubernetes-sigs/gateway-api

Here is a list of links that I have come up with so far.
I will likely add supplementary explanations for the MultiNetwork-related part later.
I also plan to embed these links in proposal/README.md.

sesame0224 · 2026-01-20T03:21:30Z

I will likely add supplementary explanations for the MultiNetwork-related part later.

I will explain the MultiNetwork-related links as a supplement to my response to the previous question (How do you know which pod in VertexN should talk to which pod in VertexN+1)

As a prerequisite, in order to use GPUDirect RDMA, the application running inside the container must know the IP addresses of the destination Pods.
Therefore, what we essentially want to achieve is to obtain the list of Pod IPs handled by a Service, by using a Headless Service.

The related components currently work as follows:

When a Service resource is deployed, EndpointSlice controller collects the IP addresses of the target Pods and creates EndpointSlice resources.
CoreDNS watches Service and EndpointSlice resources and caches the corresponding DNS records internally.
When a DNS query is received, CoreDNS refers to this cache and returns the candidate Pod IP addresses.

The problem is that EndpointSlice only collects IP addresses from the default network. As a result, this mechanism does not work for MultiNetwork paths that use a Second NIC.

To address this issue, we are considering two possible approaches.

The first approach is to use the CoreDNS etcd plugin to register a domain name corresponding to each Vertex, along with the list of Second NIC IP addresses assigned to the Pods belonging to that Vertex.
This registration would be performed by a component outside of Numaflow at the time when Pods with assigned Second NIC IPs are deployed.

Given the current Numaflow specification, where each Vertex already has the name of the next destination Vertex as an environment variable, we believe it is also feasible to provide the domain name of the destination Vertex. By querying CoreDNS during application execution, the application can then obtain the candidate destination IP addresses.

The second approach is based on ongoing efforts(GEP-3539) to evolve the Service API for MultiNetwork support using the Gateway API.
Since this work is still in progress, it is not possible to fully rely on it at this stage. Therefore, the idea is to implement our own solution while monitoring the direction of the Gateway API.
In this approach, we would create HeadlessService-like and EndpointSlice-like resources for the Gateway API, along with a CoreDNS plugin that can handle these new resources.
This implementation would be positioned as a reference or sample implementation that could potentially be contributed back to Kubernetes as a standard feature in the future.

At this point, we prefer to start with the first approach because it has a lower implementation cost. However, if it turns out to be infeasible, we plan to move forward with the second approach.

Signed-off-by: Kazuki Yamamoto <yamamoto.kazuki.24@gmail.com>

- Change doc structure - add a new chapter(Functionality) - Swap the order Workflow and Resource Specification - Update Workflow and Resource Specification Signed-off-by: Kazuki Yamamoto <yamamoto.kazuki.24@gmail.com>

Signed-off-by: Kazuki Yamamoto <yamamoto.kazuki.24@gmail.com>

sesame0224 · 2026-03-26T00:12:27Z

I have organized the changes introduced by this proposal into three main points (see "Changes").
Among them, I believe that "Direct Communication Processing During Application Execution"(Architecture) will have the biggest impact on how Numaflow works.
I would especially appreciate your review and feedback on this part.

vigith · 2026-04-01T21:20:59Z

My understanding, you are proposing two changes:

UDF-to-UDF

move the data plane out of numaflow's main container entirely; let UDF containers talk directly to each other over RDMA, keeping GPU data in GPU memory across vertices, and use numaflow only for orchestration/topology/scaling.

"multi-isbsvc" mode

This just routes ISB traffic over the faster second NIC and doesn't need most of this proposal's complexity (no direct UDF communication, no CUDA graphs, no vertexDomainManager).

My Concerns

My concern is mostly on UDF-to-UDF:

It essentially creates a shadow data plane that runs parallel to numaflow. The numaflow runtime (ISB, WAL, watermarks, exactly-once, backpressure) is bypassed for "direct" edges.
The UDF is no longer a simple handler - it becomes a network-aware GPU program that manages its own connections, memory registration, and data transfer. The simplicity of "write a function, numaflow handles the rest" is lost.

My Recommendation

Perhaps we should do "multi-isbsvc" first and then look into UDF-to-UDF, because UDF-to-UDF is a major change.

/cc @whynowy @yhl25 WDYT?

sesame0224 · 2026-04-03T11:03:42Z

@vigith
Thank you for your review.

We would like to clarify a few points where your understanding of the proposed changes may differ.

Our main objective is to introduce a direct communication path using GPUDirect RDMA for existing UDF-to-UDF communication. (This path is designed to be independent of the current ISBSVC-based communication.)

The description of multi-isbsvc in the Pipeline field is intended only as an example of possible future extensions and is not the primary focus of this proposal. At this stage, we would consider supporting acceleration using a second NIC only if there is clear demand for improving the performance of the existing ISBSVC communication.

We agree with your approach of starting with changes that have a smaller impact. At the same time, We would like to frame this discussion with the introduction of direct communication as the baseline.

yhl25 · 2026-04-07T23:21:27Z

Thanks for the proposal. I share @vigith's concerns about the scope of changes to core.
I'd suggest building this as a separate controller in a contrib repo rather than touching the platform. Numaflow already exposes Pipeline and Vertex resources, your controller can watch those and handle everything externally:

Watch Vertex pod deployments and manage second NIC provisioning via DRA
Run your own vertexDomainManager as a standalone controller that registers RDMA IPs in CoreDNS
Ship a shared library or SDK extension that UDF authors import to opt into direct RDMA communication

This way the core platform stays untouched, your controller reacts to the same events, and teams that want RDMA can install your controller alongside Numaflow. If this gets traction and the community sees demand, we can discuss promoting pieces into core.
Would you be open to this approach?

sesame0224 · 2026-04-10T07:30:26Z

Thank you for the thoughtful suggestion, @yhl25. I really appreciate the direction you're pointing us toward.

I want to make sure I understand your intent correctly. Here is how I'm now thinking about the separation:

Components such as the second NIC IP manager and service discovery can be implemented independently from the numaflow platform in the contrib repo.
Manifest-level behavior (e.g., routing data over a second NIC per edge) might be achievable without modifying the Pipeline/Vertex CRD or their controllers in core, by using annotations on existing resources combined with a Mutating Admission Webhook — as an alternative to adding a new field like spec.edges[].numaNetwork to the Pipeline CRD.
However, to achieve GPU direct communication, changes to the platform-side main container and the SDK running inside each vertex are inevitable and cannot be implemented outside of core Numaflow.

Given this, I understand your suggestion as a phased approach — rather than implementing all proposed components in the contrib repo at once (which is not fully feasible for item 3), I should first implement item 1 independently to gain traction, and then open the discussion about the necessary platform-side changes once there is community alignment.

Does this match your intent? I want to make sure we're aligned before proceeding.
If so, I will start by implementing the parts that are highly independent from numaflow core in the contrib repo.

whynowy · 2026-04-16T06:15:09Z

Thank you for the thoughtful suggestion, @yhl25. I really appreciate the direction you're pointing us toward.

I want to make sure I understand your intent correctly. Here is how I'm now thinking about the separation:

Components such as the second NIC IP manager and service discovery can be implemented independently from the numaflow platform in the contrib repo.

Manifest-level behavior (e.g., routing data over a second NIC per edge) might be achievable without modifying the Pipeline/Vertex CRD or their controllers in core, by using annotations on existing resources combined with a Mutating Admission Webhook — as an alternative to adding a new field like spec.edges[].numaNetwork to the Pipeline CRD.

However, to achieve GPU direct communication, changes to the platform-side main container and the SDK running inside each vertex are inevitable and cannot be implemented outside of core Numaflow.

Given this, I understand your suggestion as a phased approach — rather than implementing all proposed components in the contrib repo at once (which is not fully feasible for item 3), I should first implement item 1 independently to gain traction, and then open the discussion about the necessary platform-side changes once there is community alignment.

Does this match your intent? I want to make sure we're aligned before proceeding. If so, I will start by implementing the parts that are highly independent from numaflow core in the contrib repo.

I agree that step 1 and 2 can be explored first, in terms of platform change, I'm inclined to make it support ud isbsvc implementation, which means the customized isbsvc implementation is running in a sidecar, and all the customization in step 1 and 2 should be abstracted as much as possible in the customized isbsvc, e.g.

apiVersion: numaflow.numaproj.io/v1alpha1
kind: InterStepBufferService
metadata:
  name: default
spec:
  customized:
    image: xxxx

vigith · 2026-04-16T14:58:19Z

@sesame0224 please let us know what should be the name of repo in numaproj-contrib, we can create and make you an admin of that repo

sesame0224 · 2026-04-16T15:06:55Z

@vigith
Could you please name the repository gpu-direct-comm ?

vigith · 2026-04-16T15:36:00Z

Created https://github.com/numaproj-contrib/gpu-direct-comm and invited you as admin

sesame0224 · 2026-04-17T07:46:13Z

@vigith
Thank you for creating the repository.
I will start working on it.

@whynowy
Thank you for replying.
Which milestone is ud isbsvc planned for? I'd like to follow how the specifications evolve.

vigith reviewed Jan 15, 2026

View reviewed changes

syayi self-requested a review January 26, 2026 02:04

sesame0224 force-pushed the feat/second-nic-communication branch from 906242b to dd7c203 Compare March 25, 2026 08:38

sesame0224 added 6 commits March 25, 2026 11:59

docs: Add proposal for direct data transfer

61cfba4

Signed-off-by: Kazuki Yamamoto <yamamoto.kazuki.24@gmail.com>

chore: fix link paths of figures

9c421e7

Signed-off-by: Kazuki Yamamoto <yamamoto.kazuki.24@gmail.com>

docs: proposal/README.md

4d0f802

Signed-off-by: Kazuki Yamamoto <yamamoto.kazuki.24@gmail.com>

docs: Update proposal/README.md

77e7c0d

- Change doc structure - add a new chapter(Functionality) - Swap the order Workflow and Resource Specification - Update Workflow and Resource Specification Signed-off-by: Kazuki Yamamoto <yamamoto.kazuki.24@gmail.com>

docs: Update Workflow

ea0b48c

Signed-off-by: Kazuki Yamamoto <yamamoto.kazuki.24@gmail.com>

docs: add architecture and sequence diagrams

46565e3

Signed-off-by: Kazuki Yamamoto <yamamoto.kazuki.24@gmail.com>

sesame0224 force-pushed the feat/second-nic-communication branch from 47e3644 to 46565e3 Compare March 25, 2026 12:08

sesame0224 marked this pull request as ready for review March 25, 2026 23:46

sesame0224 requested review from whynowy and yhl25 as code owners March 25, 2026 23:46


		Therefore, we consider introducing a high-speed communication method with low transfer overhead, such as GPUDirect RDMA, for inter-vertex communication. To achieve this, the following elements are required.

		1. GPUDirect RDMA, which enables direct device-to-device communication between GPUs, requires RDMA-capable NICs. In other words, it is necessary to introduce a high-speed network by assigning a second NIC for RDMA to each pod, separate from the default network.

Conversation

sesame0224 commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it

Related issues

Testing

Special notes for reviewers

Uh oh!

codecov bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vigith commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sesame0224 commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vigith left a comment

Choose a reason for hiding this comment

Uh oh!

vigith Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

sesame0224 Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

sesame0224 commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sesame0224 commented Jan 20, 2026

Uh oh!

sesame0224 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vigith commented Apr 1, 2026

UDF-to-UDF

"multi-isbsvc" mode

My Concerns

My Recommendation

Uh oh!

sesame0224 commented Apr 3, 2026

Uh oh!

yhl25 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sesame0224 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

whynowy commented Apr 16, 2026

Uh oh!

vigith commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sesame0224 commented Apr 16, 2026

Uh oh!

vigith commented Apr 16, 2026

Uh oh!

sesame0224 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sesame0224 commented Jan 8, 2026 •

edited

Loading

codecov bot commented Jan 8, 2026 •

edited

Loading

vigith commented Jan 9, 2026 •

edited

Loading

sesame0224 commented Jan 14, 2026 •

edited

Loading

sesame0224 commented Jan 16, 2026 •

edited

Loading

sesame0224 commented Mar 26, 2026 •

edited

Loading

yhl25 commented Apr 7, 2026 •

edited

Loading

sesame0224 commented Apr 10, 2026 •

edited

Loading

vigith commented Apr 16, 2026 •

edited

Loading

sesame0224 commented Apr 17, 2026 •

edited

Loading