Skip to content

Feat/second nic communication#3129

Open
sesame0224 wants to merge 6 commits intonumaproj:mainfrom
sesame0224:feat/second-nic-communication
Open

Feat/second nic communication#3129
sesame0224 wants to merge 6 commits intonumaproj:mainfrom
sesame0224:feat/second-nic-communication

Conversation

@sesame0224
Copy link
Copy Markdown
Contributor

@sesame0224 sesame0224 commented Jan 8, 2026

What this PR does / why we need it

This PR proposes a new method for enabling direct data communication between Numaflow vertices, and describes the motivation, resource specifications and workflows.

Internal discussions are ongoing, so we will revise the doc as appropriate.

Related issues

#2990
This PR is an initial design derived from the above post.
We would like to discuss this with the community.

Testing

This PR includes only documentation, so no tests were performed.

Special notes for reviewers

Since base branch is wrong in #3125, I recreated this PR.

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.08%. Comparing base (6cbe3a7) to head (46565e3).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3129      +/-   ##
==========================================
+ Coverage   81.00%   81.08%   +0.07%     
==========================================
  Files         317      317              
  Lines       73287    73287              
==========================================
+ Hits        59368    59425      +57     
+ Misses      13367    13307      -60     
- Partials      552      555       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vigith
Copy link
Copy Markdown
Member

vigith commented Jan 9, 2026

Thank you for putting this together. I have a couple of basic questions.

  1. How do you replay a message if we rely purely on direct communication? E.g, if the pod in VertexN is talking to another pod in VertexN+1 and that pod in VertexN crashes due to a underlying node failure, then the message will be lost.
    Today we use buffer so that the producer (VertexN) and the next vertex (VertexN+1) are decoupled. The reason we did this was because a pod could crash any time, but we have a buffer that is persistent which can be used by the VertexN+1 to read the data.

  2. How do you know which pod in VertexN should talk to which pod in VertexN+1

@sesame0224
Copy link
Copy Markdown
Contributor Author

sesame0224 commented Jan 14, 2026

  1. How do you replay a message if we rely purely on direct communication?

As a premise, we assume a use case of object detection using video inference. Therefore, video frames are used as input data. In this use case, losing some video frames is not a critical issue. As a result, retransmission of lost data is out of scope.
If this functionality is implemented, as you pointed out, users will need to choose the type of path depending on their use case.

  1. How do you know which pod in VertexN should talk to which pod in VertexN+1

First, the user defines the connectivity between Vertices in spec.edges of the manifest file, as before.
When a Pipeline resource is deployed, Pods are deployed in a chained manner by the Vertex controller.

An external controller watches Pod deployments and checks whether the domain name of the MultiNetwork Service corresponding to each Vertex and the Pod’s SecondNIC IP are registered in CoreDNS. If not, it registers them.

Now, let me move on to application execution.
Currently, the UDF container in each Vertex already has the destination Vertex specified as an environment variable. Therefore, we assume that it can also have the domain name of the corresponding MultiNetwork Service.
By querying CoreDNS using this domain name, we believe that the container can obtain the SecondNIC IP addresses of candidate destination Pods.

Additionally, standardization of the Service for MultiNetwork via the Gateway API is currently under consideration, but the specification is still undecided.

Copy link
Copy Markdown
Member

@vigith vigith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please provide links to all the important technology you are referencing to. It will help us understand better.

Comment thread proposal/README.md Outdated

Therefore, we consider introducing a high-speed communication method with low transfer overhead, such as GPUDirect RDMA, for inter-vertex communication. To achieve this, the following elements are required.

1. GPUDirect RDMA, which enables direct device-to-device communication between GPUs, requires RDMA-capable NICs. In other words, it is necessary to introduce a high-speed network by assigning a second NIC for RDMA to each pod, separate from the default network.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPUDirect RDMA

Could you please provide link for this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve updated the README and embedded the links.

@sesame0224
Copy link
Copy Markdown
Contributor Author

I will likely add supplementary explanations for the MultiNetwork-related part later.

I will explain the MultiNetwork-related links as a supplement to my response to the previous question (How do you know which pod in VertexN should talk to which pod in VertexN+1)

As a prerequisite, in order to use GPUDirect RDMA, the application running inside the container must know the IP addresses of the destination Pods.
Therefore, what we essentially want to achieve is to obtain the list of Pod IPs handled by a Service, by using a Headless Service.

The related components currently work as follows:

  • When a Service resource is deployed, EndpointSlice controller collects the IP addresses of the target Pods and creates EndpointSlice resources.
  • CoreDNS watches Service and EndpointSlice resources and caches the corresponding DNS records internally.
  • When a DNS query is received, CoreDNS refers to this cache and returns the candidate Pod IP addresses.

The problem is that EndpointSlice only collects IP addresses from the default network. As a result, this mechanism does not work for MultiNetwork paths that use a Second NIC.

To address this issue, we are considering two possible approaches.

The first approach is to use the CoreDNS etcd plugin to register a domain name corresponding to each Vertex, along with the list of Second NIC IP addresses assigned to the Pods belonging to that Vertex.
This registration would be performed by a component outside of Numaflow at the time when Pods with assigned Second NIC IPs are deployed.

Given the current Numaflow specification, where each Vertex already has the name of the next destination Vertex as an environment variable, we believe it is also feasible to provide the domain name of the destination Vertex. By querying CoreDNS during application execution, the application can then obtain the candidate destination IP addresses.

The second approach is based on ongoing efforts(GEP-3539) to evolve the Service API for MultiNetwork support using the Gateway API.
Since this work is still in progress, it is not possible to fully rely on it at this stage. Therefore, the idea is to implement our own solution while monitoring the direction of the Gateway API.
In this approach, we would create HeadlessService-like and EndpointSlice-like resources for the Gateway API, along with a CoreDNS plugin that can handle these new resources.
This implementation would be positioned as a reference or sample implementation that could potentially be contributed back to Kubernetes as a standard feature in the future.

At this point, we prefer to start with the first approach because it has a lower implementation cost. However, if it turns out to be infeasible, we plan to move forward with the second approach.

@syayi syayi self-requested a review January 26, 2026 02:04
@sesame0224 sesame0224 force-pushed the feat/second-nic-communication branch from 906242b to dd7c203 Compare March 25, 2026 08:38
Signed-off-by: Kazuki Yamamoto <yamamoto.kazuki.24@gmail.com>
Signed-off-by: Kazuki Yamamoto <yamamoto.kazuki.24@gmail.com>
Signed-off-by: Kazuki Yamamoto <yamamoto.kazuki.24@gmail.com>
- Change doc structure
  - add a new chapter(Functionality)
  - Swap the order Workflow and Resource Specification
- Update Workflow and Resource Specification

Signed-off-by: Kazuki Yamamoto <yamamoto.kazuki.24@gmail.com>
Signed-off-by: Kazuki Yamamoto <yamamoto.kazuki.24@gmail.com>
Signed-off-by: Kazuki Yamamoto <yamamoto.kazuki.24@gmail.com>
@sesame0224 sesame0224 force-pushed the feat/second-nic-communication branch from 47e3644 to 46565e3 Compare March 25, 2026 12:08
@sesame0224 sesame0224 marked this pull request as ready for review March 25, 2026 23:46
@sesame0224
Copy link
Copy Markdown
Contributor Author

sesame0224 commented Mar 26, 2026

I have organized the changes introduced by this proposal into three main points (see "Changes").
Among them, I believe that "Direct Communication Processing During Application Execution"(Architecture) will have the biggest impact on how Numaflow works.
I would especially appreciate your review and feedback on this part.

@vigith
Copy link
Copy Markdown
Member

vigith commented Apr 1, 2026

My understanding, you are proposing two changes:

UDF-to-UDF

move the data plane out of numaflow's main container entirely; let UDF containers talk directly to each other over RDMA, keeping GPU data in GPU memory across vertices, and use numaflow only for orchestration/topology/scaling.

"multi-isbsvc" mode

This just routes ISB traffic over the faster second NIC and doesn't need most of this proposal's complexity (no direct UDF communication, no CUDA graphs, no vertexDomainManager).

My Concerns

My concern is mostly on UDF-to-UDF:

  1. It essentially creates a shadow data plane that runs parallel to numaflow. The numaflow runtime (ISB, WAL, watermarks, exactly-once, backpressure) is bypassed for "direct" edges.
  2. The UDF is no longer a simple handler - it becomes a network-aware GPU program that manages its own connections, memory registration, and data transfer. The simplicity of "write a function, numaflow handles the rest" is lost.

My Recommendation

Perhaps we should do "multi-isbsvc" first and then look into UDF-to-UDF, because UDF-to-UDF is a major change.

/cc @whynowy @yhl25 WDYT?

@sesame0224
Copy link
Copy Markdown
Contributor Author

@vigith
Thank you for your review.

We would like to clarify a few points where your understanding of the proposed changes may differ.

Our main objective is to introduce a direct communication path using GPUDirect RDMA for existing UDF-to-UDF communication. (This path is designed to be independent of the current ISBSVC-based communication.)

The description of multi-isbsvc in the Pipeline field is intended only as an example of possible future extensions and is not the primary focus of this proposal. At this stage, we would consider supporting acceleration using a second NIC only if there is clear demand for improving the performance of the existing ISBSVC communication.

We agree with your approach of starting with changes that have a smaller impact. At the same time, We would like to frame this discussion with the introduction of direct communication as the baseline.

@yhl25
Copy link
Copy Markdown
Contributor

yhl25 commented Apr 7, 2026

Thanks for the proposal. I share @vigith's concerns about the scope of changes to core.
I'd suggest building this as a separate controller in a contrib repo rather than touching the platform. Numaflow already exposes Pipeline and Vertex resources, your controller can watch those and handle everything externally:

  • Watch Vertex pod deployments and manage second NIC provisioning via DRA
  • Run your own vertexDomainManager as a standalone controller that registers RDMA IPs in CoreDNS
  • Ship a shared library or SDK extension that UDF authors import to opt into direct RDMA communication

This way the core platform stays untouched, your controller reacts to the same events, and teams that want RDMA can install your controller alongside Numaflow. If this gets traction and the community sees demand, we can discuss promoting pieces into core.
Would you be open to this approach?

@sesame0224
Copy link
Copy Markdown
Contributor Author

sesame0224 commented Apr 10, 2026

Thank you for the thoughtful suggestion, @yhl25. I really appreciate the direction you're pointing us toward.

I want to make sure I understand your intent correctly. Here is how I'm now thinking about the separation:

  1. Components such as the second NIC IP manager and service discovery can be implemented independently from the numaflow platform in the contrib repo.

  2. Manifest-level behavior (e.g., routing data over a second NIC per edge) might be achievable without modifying the Pipeline/Vertex CRD or their controllers in core, by using annotations on existing resources combined with a Mutating Admission Webhook — as an alternative to adding a new field like spec.edges[].numaNetwork to the Pipeline CRD.

  3. However, to achieve GPU direct communication, changes to the platform-side main container and the SDK running inside each vertex are inevitable and cannot be implemented outside of core Numaflow.

Given this, I understand your suggestion as a phased approach — rather than implementing all proposed components in the contrib repo at once (which is not fully feasible for item 3), I should first implement item 1 independently to gain traction, and then open the discussion about the necessary platform-side changes once there is community alignment.

Does this match your intent? I want to make sure we're aligned before proceeding.
If so, I will start by implementing the parts that are highly independent from numaflow core in the contrib repo.

@whynowy
Copy link
Copy Markdown
Member

whynowy commented Apr 16, 2026

Thank you for the thoughtful suggestion, @yhl25. I really appreciate the direction you're pointing us toward.

I want to make sure I understand your intent correctly. Here is how I'm now thinking about the separation:

  1. Components such as the second NIC IP manager and service discovery can be implemented independently from the numaflow platform in the contrib repo.
  2. Manifest-level behavior (e.g., routing data over a second NIC per edge) might be achievable without modifying the Pipeline/Vertex CRD or their controllers in core, by using annotations on existing resources combined with a Mutating Admission Webhook — as an alternative to adding a new field like spec.edges[].numaNetwork to the Pipeline CRD.
  3. However, to achieve GPU direct communication, changes to the platform-side main container and the SDK running inside each vertex are inevitable and cannot be implemented outside of core Numaflow.

Given this, I understand your suggestion as a phased approach — rather than implementing all proposed components in the contrib repo at once (which is not fully feasible for item 3), I should first implement item 1 independently to gain traction, and then open the discussion about the necessary platform-side changes once there is community alignment.

Does this match your intent? I want to make sure we're aligned before proceeding. If so, I will start by implementing the parts that are highly independent from numaflow core in the contrib repo.

I agree that step 1 and 2 can be explored first, in terms of platform change, I'm inclined to make it support ud isbsvc implementation, which means the customized isbsvc implementation is running in a sidecar, and all the customization in step 1 and 2 should be abstracted as much as possible in the customized isbsvc, e.g.

apiVersion: numaflow.numaproj.io/v1alpha1
kind: InterStepBufferService
metadata:
  name: default
spec:
  customized:
    image: xxxx

@vigith
Copy link
Copy Markdown
Member

vigith commented Apr 16, 2026

@sesame0224 please let us know what should be the name of repo in numaproj-contrib, we can create and make you an admin of that repo

@sesame0224
Copy link
Copy Markdown
Contributor Author

@vigith
Could you please name the repository gpu-direct-comm ?

@vigith
Copy link
Copy Markdown
Member

vigith commented Apr 16, 2026

Created https://github.com/numaproj-contrib/gpu-direct-comm and invited you as admin

@sesame0224
Copy link
Copy Markdown
Contributor Author

sesame0224 commented Apr 17, 2026

@vigith
Thank you for creating the repository.
I will start working on it.

@whynowy
Thank you for replying.
Which milestone is ud isbsvc planned for? I'd like to follow how the specifications evolve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants