You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Dec 22, 2021. It is now read-only.
Sign-replication is an often-used operation that replicates the sign bit of a SIMD lane into all bits of the lane. There are two reasons why we need to pay attention to sign-replication. First, it can be encoded in WebAssembly SIMD in several ways:
i8x16.lt_s(v, v128.const(0))/i16x8.lt_s(v, v128.const(0))/i32x4.lt_s(v, v128.const(0))/i64x2.lt_s(v, v128.const(0))
Secondly, sign-replication can be lowered in many ways depending on the data type and the target instruction set, as noted by @jan-wassenberg in Sign Select instructions #124.
My suggestion is:
To standardize i8x16.shr_s(v, -1)/i16x8.shr_s(v, -1)/i32x4.shr_s(v, -1)/i64x2.shr_s(v, -1) and i8x16.shr_s(v, 7)/i16x8.shr_s(v, 15)/i32x4.shr_s(v, 31)/i64x2.shr_s(v, 63) as the canonical sign-replication instructions, and recommend that WebAssembly engines lower them differently that other arithmetic shift instructions.
To provide an informative recommendation on optimal lowering depending on the instruction set (see below).
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX512F and AVX512VL instruction sets
i64x2.shr_s(v, -1)/i64x2.shr_s(v, 63)
y = i64x2.shr_s(x, 63) is lowered to VPSRAQ xmm_y, xmm_x, 63
x86/x86-64 processors with AVX instruction set
i8x16.shr_s(v, -1)/i8x16.shr_s(v, 7)
y = i8x16.shr_s(x, 7) (y is NOTx) is lowered to:
VPXOR xmm_y, xmm_y, xmm_y
VPCMPGTB xmm_y, xmm_y, xmm_x
x = i8x16.shr_s(x, 7) is lowered to:
VPXOR xmm_tmp, xmm_tmp, xmm_tmp
VPCMPGTB xmm_x, xmm_tmp, xmm_x
i16x8.shr_s(v, -1)/i16x8.shr_s(v, 15)
y = i16x8.shr_s(x, 15) is lowered to VPSRAW xmm_y, xmm_x, 15
i32x4.shr_s(v, -1)/i32x4.shr_s(v, 31)
y = i32x4.shr_s(x, 31) is lowered to VPSRAD xmm_y, xmm_x, 31
i64x2.shr_s(v, -1)/i64x2.shr_s(v, 63)
y = i64x2.shr_s(x, 63) is lowered to:
VPSRAD xmm_y, xmm_x, 31
VPSHUFD xmm_y, xmm_y, 0xF5
x86/x86-64 processors with SSE2 instruction set
i8x16.shr_s(v, -1)/i8x16.shr_s(v, 7)
y = i8x16.shr_s(x, 7) (y is NOTx) is lowered to:
PXOR xmm_y, xmm_y
PCMPGTB xmm_y, xmm_x
x = i8x16.shr_s(x, 7) is lowered to:
MOVDQA xmm_tmp, xmm_x
PXOR xmm_x, xmm_x
PCMPGTB xmm_x, xmm_tmp
i16x8.shr_s(v, -1)/i16x8.shr_s(v, 15)
y = i16x8.shr_s(x, 15) is lowered to:
PXOR xmm_y, xmm_y
PCMPGTW xmm_y, xmm_x
x = i16x8.shr_s(x, 15) is lowered to PSRAW xmm_x, 15
i32x4.shr_s(v, -1)/i32x4.shr_s(v, 31)
y = i32x4.shr_s(x, 31) is lowered to:
PXOR xmm_y, xmm_y
PCMPGTD xmm_y, xmm_x
x = i32x4.shr_s(x, 31) is lowered to PSRAD xmm_x, 31
i64x2.shr_s(v, -1)/i64x2.shr_s(v, 63)
y = i64x2.shr_s(x, 63) is lowered to:
PSHUFD xmm_y, xmm_x, 0xF5
PSRAD xmm_y, 31
ARM64 processors
i8x16.shr_s(v, -1)/i8x16.shr_s(v, 7)
y = i8x16.shr_s(x, 7) is lowered to CMLT Vy.16B, Vx.16B, #0
i16x8.shr_s(v, -1)/i16x8.shr_s(v, 15)
y = i16x8.shr_s(x, 15) is lowered to CMLT Vy.8H, Vx.8H, #0
i32x4.shr_s(v, -1)/i32x4.shr_s(v, 31)
y = i32x4.shr_s(x, 31) is lowered to CMLT Vy.4S, Vx.4S, #0
i64x2.shr_s(v, -1)/i64x2.shr_s(v, 63)
y = i64x2.shr_s(x, 63) is lowered to CMLT Vy.2D, Vx.2D, #0
ARMv7 processors with NEON extension
i8x16.shr_s(v, -1)/i8x16.shr_s(v, 7)
y = i8x16.shr_s(x, 7) is lowered to VCLT.S8 Qy, Qx, #0
i16x8.shr_s(v, -1)/i16x8.shr_s(v, 15)
y = i16x8.shr_s(x, 15) is lowered to VCLT.S16 Qy, Qx, #0
i32x4.shr_s(v, -1)/i32x4.shr_s(v, 31)
y = i32x4.shr_s(x, 31) is lowered to VCLT.S32 Qy, Qx, #0
i64x2.shr_s(v, -1)/i64x2.shr_s(v, 63)
y = i64x2.shr_s(x, 63) is lowered to VSHR.S64 Qy, Qx, #63
Sign-replication is an often-used operation that replicates the sign bit of a SIMD lane into all bits of the lane. There are two reasons why we need to pay attention to sign-replication. First, it can be encoded in WebAssembly SIMD in several ways:
i8x16.shr_s(v, -1)/i16x8.shr_s(v, -1)/i32x4.shr_s(v, -1)/i64x2.shr_s(v, -1)i8x16.shr_s(v, 7)/i16x8.shr_s(v, 15)/i32x4.shr_s(v, 31)/i64x2.shr_s(v, 63)i8x16.neg(i8x16.shr_u(v, -1))/i16x8.neg(i16x8.shr_u(v, -1))/i32x4.neg(i32x4.shr_s(v, -1))/i64x2.neg(i64x2.shr_s(v, -1))i8x16.neg(i8x16.shr_u(v, 7))/i16x8.neg(i16x8.shr_u(v, 15))/i32x4.neg(i32x4.shr_s(v, 31))/i64x2.neg(i64x2.shr_s(v, 63))i8x16.lt_s(v, v128.const(0))/i16x8.lt_s(v, v128.const(0))/i32x4.lt_s(v, v128.const(0))/i64x2.lt_s(v, v128.const(0))Secondly, sign-replication can be lowered in many ways depending on the data type and the target instruction set, as noted by @jan-wassenberg in Sign Select instructions #124.
My suggestion is:
i8x16.shr_s(v, -1)/i16x8.shr_s(v, -1)/i32x4.shr_s(v, -1)/i64x2.shr_s(v, -1)andi8x16.shr_s(v, 7)/i16x8.shr_s(v, 15)/i32x4.shr_s(v, 31)/i64x2.shr_s(v, 63)as the canonical sign-replication instructions, and recommend that WebAssembly engines lower them differently that other arithmetic shift instructions.Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX512F and AVX512VL instruction sets
y = i64x2.shr_s(x, 63)is lowered toVPSRAQ xmm_y, xmm_x, 63x86/x86-64 processors with AVX instruction set
y = i8x16.shr_s(x, 7)(yis NOTx) is lowered to:VPXOR xmm_y, xmm_y, xmm_yVPCMPGTB xmm_y, xmm_y, xmm_xx = i8x16.shr_s(x, 7)is lowered to:VPXOR xmm_tmp, xmm_tmp, xmm_tmpVPCMPGTB xmm_x, xmm_tmp, xmm_xy = i16x8.shr_s(x, 15)is lowered toVPSRAW xmm_y, xmm_x, 15y = i32x4.shr_s(x, 31)is lowered toVPSRAD xmm_y, xmm_x, 31y = i64x2.shr_s(x, 63)is lowered to:VPSRAD xmm_y, xmm_x, 31VPSHUFD xmm_y, xmm_y, 0xF5x86/x86-64 processors with SSE2 instruction set
y = i8x16.shr_s(x, 7)(yis NOTx) is lowered to:PXOR xmm_y, xmm_yPCMPGTB xmm_y, xmm_xx = i8x16.shr_s(x, 7)is lowered to:MOVDQA xmm_tmp, xmm_xPXOR xmm_x, xmm_xPCMPGTB xmm_x, xmm_tmpy = i16x8.shr_s(x, 15)is lowered to:PXOR xmm_y, xmm_yPCMPGTW xmm_y, xmm_xx = i16x8.shr_s(x, 15)is lowered toPSRAW xmm_x, 15y = i32x4.shr_s(x, 31)is lowered to:PXOR xmm_y, xmm_yPCMPGTD xmm_y, xmm_xx = i32x4.shr_s(x, 31)is lowered toPSRAD xmm_x, 31y = i64x2.shr_s(x, 63)is lowered to:PSHUFD xmm_y, xmm_x, 0xF5PSRAD xmm_y, 31ARM64 processors
y = i8x16.shr_s(x, 7)is lowered toCMLT Vy.16B, Vx.16B, #0y = i16x8.shr_s(x, 15)is lowered toCMLT Vy.8H, Vx.8H, #0y = i32x4.shr_s(x, 31)is lowered toCMLT Vy.4S, Vx.4S, #0y = i64x2.shr_s(x, 63)is lowered toCMLT Vy.2D, Vx.2D, #0ARMv7 processors with NEON extension
y = i8x16.shr_s(x, 7)is lowered toVCLT.S8 Qy, Qx, #0y = i16x8.shr_s(x, 15)is lowered toVCLT.S16 Qy, Qx, #0y = i32x4.shr_s(x, 31)is lowered toVCLT.S32 Qy, Qx, #0y = i64x2.shr_s(x, 63)is lowered toVSHR.S64 Qy, Qx, #63