Skip to content

Latest commit

 

History

History
98 lines (75 loc) · 2.8 KB

File metadata and controls

98 lines (75 loc) · 2.8 KB

Nvidia Assembly

There are two levels of assembly for Nvidia GPU's The high level virtual machine assembly is PTX (Portable Thread Execution). The lower level assembly that directly corresponds to the GPU ISA is SASS.

PTX

The specification for PTX : https://docs.nvidia.com/cuda/parallel-thread-execution/index.html

Example output can be obtained using the -ptx switch to nvcc. The standard -S switch to clang can also be used.

The venerable vector add example:

__global__ void vector_add(const RealType *a, const RealType *b, RealType *c, const int N)
{
  int idx = blockDim.x * blockIdx.x + threadIdx.x;
  if (idx < N) {
    c[idx] = a[idx] + b[idx];
  }
}

Running nvcc -ptx vector_add.cu produces vector_add.ptx

//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-26907403
// Cuda compilation tools, release 10.1, V10.1.243
// Based on LLVM 3.4svn
//

.version 6.4
.target sm_30
.address_size 64

	// .globl	_Z10vector_addPKfS0_Pfi

.visible .entry _Z10vector_addPKfS0_Pfi(
	.param .u64 _Z10vector_addPKfS0_Pfi_param_0,
	.param .u64 _Z10vector_addPKfS0_Pfi_param_1,
	.param .u64 _Z10vector_addPKfS0_Pfi_param_2,
	.param .u32 _Z10vector_addPKfS0_Pfi_param_3
)
{
	.reg .pred 	%p<2>;
	.reg .f32 	%f<4>;
	.reg .b32 	%r<6>;
	.reg .b64 	%rd<11>;


	ld.param.u64 	%rd1, [_Z10vector_addPKfS0_Pfi_param_0];
	ld.param.u64 	%rd2, [_Z10vector_addPKfS0_Pfi_param_1];
	ld.param.u64 	%rd3, [_Z10vector_addPKfS0_Pfi_param_2];
	ld.param.u32 	%r2, [_Z10vector_addPKfS0_Pfi_param_3];
	mov.u32 	%r3, %ntid.x;
	mov.u32 	%r4, %ctaid.x;
	mov.u32 	%r5, %tid.x;
	mad.lo.s32 	%r1, %r4, %r3, %r5;
	setp.ge.s32	%p1, %r1, %r2;
	@%p1 bra 	BB0_2;

	cvta.to.global.u64 	%rd4, %rd1;
	mul.wide.s32 	%rd5, %r1, 4;
	add.s64 	%rd6, %rd4, %rd5;
	cvta.to.global.u64 	%rd7, %rd2;
	add.s64 	%rd8, %rd7, %rd5;
	ld.global.f32 	%f1, [%rd8];
	ld.global.f32 	%f2, [%rd6];
	add.f32 	%f3, %f2, %f1;
	cvta.to.global.u64 	%rd9, %rd3;
	add.s64 	%rd10, %rd9, %rd5;
	st.global.f32 	[%rd10], %f3;

BB0_2:
	ret;
}

Compilers and toolchain

NVCC: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html

Compile CUDA with clang: https://llvm.org/docs/CompileCudaWithLLVM.html

Compiling CUDA -> PTX assembly (text file) -> SASS (.cubin binary file in ELF format) -> possibly collect mulitple .cubin files (.fatbin binary in ELF format) -> put fatbin in ELF section in final executable

The cuobjdump utility is useful for inspecting .cubin files (https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#cuobjdump)

One import difference from CPU executables is the GPU section of the executable can be in PTX or SASS form. The CUDA module loading functions will accept PTX, .cubin, or .fatbin as input (compiling the PTX as needed)

Fat binaries: https://developer.nvidia.com/blog/cuda-pro-tip-understand-fat-binaries-jit-caching/