cuSPARSELt: A High-Performance CUDA Library for Sparse Matrix-Matrix Multiplication#

NVIDIA cuSPARSELt is a high-performance CUDA library dedicated to general matrix-matrix operations in which at least one operand is a structured sparse matrix with 50% sparsity ratio:

$D = Activation(\alpha op(A) \cdot op(B) + \beta op(C) + bias)$

where $op(A)/op(B)$ refers to in-place operations such as transpose/non-transpose, and $alpha, beta$ are scalars or vectors.

The cuSPARSELt APIs allow flexibility in the algorithm/operation selection, epilogue, and matrix characteristics, including memory layout, alignment, and data types.

Download: developer.nvidia.com/cusparselt/downloads

Provide Feedback: Math-Libs-Feedback@nvidia.com

Examples: cuSPARSELt Example 1, cuSPARSELt Example 2

Blog post:

Key Features#

NVIDIA Sparse MMA tensor core support

Mixed-precision computation support:

Input A/B

Input C

Output D

Compute

Block scaled

Support SM arch

FP32

FP32

FP32

FP32

No

8.0, 8.6, 8.7 9.0, 10.0, 10.1 11.0, 12.0, 12.1

BF16

BF16

BF16

FP32

FP16

FP16

FP16

FP32

FP16

FP16

FP16

FP16

No

9.0

INT8

INT8

INT8

INT32

No

8.0, 8.6, 8.7 9.0, 10.0, 10.1 11.0, 12.0, 12.1

INT32

INT32

FP16

FP16

BF16

BF16

INT8

INT8

INT8

INT32

No

8.0, 8.6, 8.7 9.0, 10.0, 10.1 11.0, 12.0, 12.1

INT32

INT32

FP16

FP16

BF16

BF16

E4M3

FP16

E4M3

FP32

No

9.0, 10.0, 10.1 11.0, 12.0, 12.1

BF16

E4M3

FP16

FP16

BF16

BF16

FP32

FP32

E5M2

FP16

E5M2

FP32

No

9.0, 10.0, 10.1 11.0, 12.0, 12.1

BF16

E5M2

FP16

FP16

BF16

BF16

FP32

FP32

E4M3

FP16

E4M3

FP32

A/B/D_OUT_SCALE = VEC64_UE8M0

D_SCALE = 32F

10.0, 10.1, 11.0 12.0, 12.1

BF16

E4M3

FP16

FP16

A/B_SCALE = VEC64_UE8M0

BF16

BF16

FP32

FP32

E2M1

FP16

E2M1

FP32

A/B/D_SCALE = VEC32_UE4M3

D_SCALE = 32F

10.0, 10.1, 11.0 12.0, 12.1

BF16

E2M1

FP16

FP16

A/B_SCALE = VEC32_UE4M3

BF16

BF16

FP32

FP32

Matrix pruning and compression functionalities
Activation functions, bias vector, and output scaling
Batched computation (multiple matrices in a single run)
GEMM Split-K mode
Auto-tuning functionality (see cusparseLtMatmulSearch())
NVTX ranging and Logging functionalities

Input A/B	Input C	Output D	Compute	Block scaled	Support SM arch
`FP32`	`FP32`	`FP32`	`FP32`	No	`8.0, 8.6, 8.7` `9.0, 10.0, 10.1` `11.0, 12.0, 12.1`
`BF16`	`BF16`	`BF16`	`FP32`
`FP16`	`FP16`	`FP16`	`FP32`
`FP16`	`FP16`	`FP16`	`FP16`	No	`9.0`
`INT8`	`INT8`	`INT8`	`INT32`	No	`8.0, 8.6, 8.7` `9.0, 10.0, 10.1` `11.0, 12.0, 12.1`
`INT32`	`INT32`
`FP16`	`FP16`
`BF16`	`BF16`
`INT8`	`INT8`	`INT8`	`INT32`	No	`8.0, 8.6, 8.7` `9.0, 10.0, 10.1` `11.0, 12.0, 12.1`
`INT32`	`INT32`
`FP16`	`FP16`
`BF16`	`BF16`
`E4M3`	`FP16`	`E4M3`	`FP32`	No	`9.0, 10.0, 10.1` `11.0, 12.0, 12.1`
`BF16`	`E4M3`
`FP16`	`FP16`
`BF16`	`BF16`
`FP32`	`FP32`
`E5M2`	`FP16`	`E5M2`	`FP32`	No	`9.0, 10.0, 10.1` `11.0, 12.0, 12.1`
`BF16`	`E5M2`
`FP16`	`FP16`
`BF16`	`BF16`
`FP32`	`FP32`
`E4M3`	`FP16`	`E4M3`	`FP32`	A/B/D_OUT_SCALE = `VEC64_UE8M0` D_SCALE = `32F`	`10.0, 10.1, 11.0` `12.0, 12.1`
`BF16`	`E4M3`
`FP16`	`FP16`	A/B_SCALE = `VEC64_UE8M0`
`BF16`	`BF16`
`FP32`	`FP32`
`E2M1`	`FP16`	`E2M1`	`FP32`	A/B/D_SCALE = `VEC32_UE4M3` D_SCALE = `32F`	`10.0, 10.1, 11.0` `12.0, 12.1`
`BF16`	`E2M1`
`FP16`	`FP16`	A/B_SCALE = `VEC32_UE4M3`
`BF16`	`BF16`
`FP32`	`FP32`

Support#

Supported SM Architectures: SM 8.0, SM 8.6, SM 8.7, SM 8.9, SM 9.0, SM 10.0, SM 10.1, SM 11.0, SM 12.0, SM 12.1
Supported CPU architectures and operating systems:

OS	CPU archs
`Windows`	`x86_64`
`Linux`	`x86_64`, `Arm64`

cuSPARSELt: A High-Performance CUDA Library for Sparse Matrix-Matrix Multiplication#

Key Features#

Support#

Index#