Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3129 +/- ##
==========================================
+ Coverage 81.00% 81.08% +0.07%
==========================================
Files 317 317
Lines 73287 73287
==========================================
+ Hits 59368 59425 +57
+ Misses 13367 13307 -60
- Partials 552 555 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thank you for putting this together. I have a couple of basic questions.
|
As a premise, we assume a use case of object detection using video inference. Therefore, video frames are used as input data. In this use case, losing some video frames is not a critical issue. As a result, retransmission of lost data is out of scope.
First, the user defines the connectivity between Vertices in spec.edges of the manifest file, as before. An external controller watches Pod deployments and checks whether the domain name of the MultiNetwork Service corresponding to each Vertex and the Pod’s SecondNIC IP are registered in CoreDNS. If not, it registers them. Now, let me move on to application execution. Additionally, standardization of the Service for MultiNetwork via the Gateway API is currently under consideration, but the specification is still undecided. |
vigith
left a comment
There was a problem hiding this comment.
Could you please provide links to all the important technology you are referencing to. It will help us understand better.
|
|
||
| Therefore, we consider introducing a high-speed communication method with low transfer overhead, such as GPUDirect RDMA, for inter-vertex communication. To achieve this, the following elements are required. | ||
|
|
||
| 1. GPUDirect RDMA, which enables direct device-to-device communication between GPUs, requires RDMA-capable NICs. In other words, it is necessary to introduce a high-speed network by assigning a second NIC for RDMA to each pod, separate from the default network. |
There was a problem hiding this comment.
GPUDirect RDMA
Could you please provide link for this?
There was a problem hiding this comment.
I’ve updated the README and embedded the links.
Here is a list of links that I have come up with so far. |
I will explain the MultiNetwork-related links as a supplement to my response to the previous question ( As a prerequisite, in order to use GPUDirect RDMA, the application running inside the container must know the IP addresses of the destination Pods. The related components currently work as follows:
The problem is that EndpointSlice only collects IP addresses from the default network. As a result, this mechanism does not work for MultiNetwork paths that use a Second NIC. To address this issue, we are considering two possible approaches. The first approach is to use the CoreDNS etcd plugin to register a domain name corresponding to each Vertex, along with the list of Second NIC IP addresses assigned to the Pods belonging to that Vertex. Given the current Numaflow specification, where each Vertex already has the name of the next destination Vertex as an environment variable, we believe it is also feasible to provide the domain name of the destination Vertex. By querying CoreDNS during application execution, the application can then obtain the candidate destination IP addresses. The second approach is based on ongoing efforts(GEP-3539) to evolve the Service API for MultiNetwork support using the Gateway API. At this point, we prefer to start with the first approach because it has a lower implementation cost. However, if it turns out to be infeasible, we plan to move forward with the second approach. |
906242b to
dd7c203
Compare
Signed-off-by: Kazuki Yamamoto <yamamoto.kazuki.24@gmail.com>
Signed-off-by: Kazuki Yamamoto <yamamoto.kazuki.24@gmail.com>
Signed-off-by: Kazuki Yamamoto <yamamoto.kazuki.24@gmail.com>
- Change doc structure - add a new chapter(Functionality) - Swap the order Workflow and Resource Specification - Update Workflow and Resource Specification Signed-off-by: Kazuki Yamamoto <yamamoto.kazuki.24@gmail.com>
Signed-off-by: Kazuki Yamamoto <yamamoto.kazuki.24@gmail.com>
Signed-off-by: Kazuki Yamamoto <yamamoto.kazuki.24@gmail.com>
47e3644 to
46565e3
Compare
|
I have organized the changes introduced by this proposal into three main points (see "Changes"). |
|
My understanding, you are proposing two changes: UDF-to-UDFmove the data plane out of numaflow's main container entirely; let UDF containers talk directly to each other over RDMA, keeping GPU data in GPU memory across vertices, and use numaflow only for orchestration/topology/scaling. "multi-isbsvc" modeThis just routes ISB traffic over the faster second NIC and doesn't need most of this proposal's complexity (no direct UDF communication, no CUDA graphs, no vertexDomainManager). My ConcernsMy concern is mostly on UDF-to-UDF:
My RecommendationPerhaps we should do "multi-isbsvc" first and then look into UDF-to-UDF, because UDF-to-UDF is a major change. |
|
@vigith We would like to clarify a few points where your understanding of the proposed changes may differ. Our main objective is to introduce a direct communication path using GPUDirect RDMA for existing UDF-to-UDF communication. (This path is designed to be independent of the current ISBSVC-based communication.) The description of We agree with your approach of starting with changes that have a smaller impact. At the same time, We would like to frame this discussion with the introduction of direct communication as the baseline. |
|
Thanks for the proposal. I share @vigith's concerns about the scope of changes to core.
This way the core platform stays untouched, your controller reacts to the same events, and teams that want RDMA can install your controller alongside Numaflow. If this gets traction and the community sees demand, we can discuss promoting pieces into core. |
|
Thank you for the thoughtful suggestion, @yhl25. I really appreciate the direction you're pointing us toward. I want to make sure I understand your intent correctly. Here is how I'm now thinking about the separation:
Given this, I understand your suggestion as a phased approach — rather than implementing all proposed components in the contrib repo at once (which is not fully feasible for item 3), I should first implement item 1 independently to gain traction, and then open the discussion about the necessary platform-side changes once there is community alignment. Does this match your intent? I want to make sure we're aligned before proceeding. |
I agree that step 1 and 2 can be explored first, in terms of platform change, I'm inclined to make it support ud isbsvc implementation, which means the customized isbsvc implementation is running in a sidecar, and all the customization in step 1 and 2 should be abstracted as much as possible in the customized isbsvc, e.g. apiVersion: numaflow.numaproj.io/v1alpha1
kind: InterStepBufferService
metadata:
name: default
spec:
customized:
image: xxxx |
|
@sesame0224 please let us know what should be the name of repo in numaproj-contrib, we can create and make you an admin of that repo |
|
@vigith |
|
Created https://github.com/numaproj-contrib/gpu-direct-comm and invited you as admin |
What this PR does / why we need it
This PR proposes a new method for enabling direct data communication between Numaflow vertices, and describes the motivation, resource specifications and workflows.
Internal discussions are ongoing, so we will revise the doc as appropriate.
Related issues
#2990
This PR is an initial design derived from the above post.
We would like to discuss this with the community.
Testing
This PR includes only documentation, so no tests were performed.
Special notes for reviewers
Since base branch is wrong in #3125, I recreated this PR.