There are two levels of assembly for Nvidia GPU's The high level virtual machine assembly is PTX (Portable Thread Execution). The lower level assembly that directly corresponds to the GPU ISA is SASS.
The specification for PTX : https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
Example output can be obtained using the -ptx switch to nvcc. The standard -S switch to clang can also be used.
The venerable vector add example:
__global__ void vector_add(const RealType *a, const RealType *b, RealType *c, const int N)
{
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if (idx < N) {
c[idx] = a[idx] + b[idx];
}
}Running nvcc -ptx vector_add.cu produces vector_add.ptx
//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-26907403
// Cuda compilation tools, release 10.1, V10.1.243
// Based on LLVM 3.4svn
//
.version 6.4
.target sm_30
.address_size 64
// .globl _Z10vector_addPKfS0_Pfi
.visible .entry _Z10vector_addPKfS0_Pfi(
.param .u64 _Z10vector_addPKfS0_Pfi_param_0,
.param .u64 _Z10vector_addPKfS0_Pfi_param_1,
.param .u64 _Z10vector_addPKfS0_Pfi_param_2,
.param .u32 _Z10vector_addPKfS0_Pfi_param_3
)
{
.reg .pred %p<2>;
.reg .f32 %f<4>;
.reg .b32 %r<6>;
.reg .b64 %rd<11>;
ld.param.u64 %rd1, [_Z10vector_addPKfS0_Pfi_param_0];
ld.param.u64 %rd2, [_Z10vector_addPKfS0_Pfi_param_1];
ld.param.u64 %rd3, [_Z10vector_addPKfS0_Pfi_param_2];
ld.param.u32 %r2, [_Z10vector_addPKfS0_Pfi_param_3];
mov.u32 %r3, %ntid.x;
mov.u32 %r4, %ctaid.x;
mov.u32 %r5, %tid.x;
mad.lo.s32 %r1, %r4, %r3, %r5;
setp.ge.s32 %p1, %r1, %r2;
@%p1 bra BB0_2;
cvta.to.global.u64 %rd4, %rd1;
mul.wide.s32 %rd5, %r1, 4;
add.s64 %rd6, %rd4, %rd5;
cvta.to.global.u64 %rd7, %rd2;
add.s64 %rd8, %rd7, %rd5;
ld.global.f32 %f1, [%rd8];
ld.global.f32 %f2, [%rd6];
add.f32 %f3, %f2, %f1;
cvta.to.global.u64 %rd9, %rd3;
add.s64 %rd10, %rd9, %rd5;
st.global.f32 [%rd10], %f3;
BB0_2:
ret;
}
NVCC: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html
Compile CUDA with clang: https://llvm.org/docs/CompileCudaWithLLVM.html
Compiling CUDA -> PTX assembly (text file) -> SASS (.cubin binary file in ELF format) -> possibly collect mulitple .cubin files (.fatbin binary in ELF format)
-> put fatbin in ELF section in final executable
The cuobjdump utility is useful for inspecting .cubin files (https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#cuobjdump)
One import difference from CPU executables is the GPU section of the executable can be in PTX or SASS form. The CUDA module loading functions will accept PTX, .cubin, or .fatbin as input (compiling the PTX as needed)
Fat binaries: https://developer.nvidia.com/blog/cuda-pro-tip-understand-fat-binaries-jit-caching/