for (unsignedint i = 0; i < hA; ++i){ for (unsignedint j = 0; j < wB; ++j) { float Cij = 0; for (unsignedint k = 0; k < wA; ++k) { Cij += A[i][k] * B[k][j]; } C[i][j] = Cij ; } }
进一步,我们还需要了解矩阵的一维数据运算方式。矩阵的数据在内存中存储的格式是线性格式(行优先/列优先),如下所示,展示的是一种行优先的存储方式。可以通过索引计算来定位矩阵中的某个元素,比如第i行第j列的元素,在线性内存中的位置:i * w + j。w为矩阵的宽度。
运算的CPU实现代码 如下所示:
/* * float *C, *A , *B: data pointer of matrix C, A, B each. * unsigned int wA: width of A. * unsigned int wC: width of C, which equals height of B. * unsigned int hC: hegith of C, which equals height of A. */ voidmatrixMulCPU(float *C, constfloat *A, constfloat *B, unsignedint wA, unsignedint wC, unsignedint hC){ unsignedint hA = hC; unsignedint wB = wC; for (unsignedint i = 0; i < hA; ++i) for (unsignedint j = 0; j < wB; ++j) { double sum = 0; for (unsignedint k = 0; k < wA; ++k) { sum += (double)A[i * wA + k] * (double)B[k * wB + j]; } C[i * wB + j] = (float)sum; } }
对于一些访问频率高的数据,可以从全局内存转移到共享内存中,这样能够提升运算速度。在矩阵乘法中(C=A x B),要获得C矩阵的某一行(比如i行)数据,A矩阵中的i行数据需要与B矩阵的所有列数据都相乘一次。一般而言,数据都是在运算中从全局内存加载到寄存器中,那么A矩阵的i行数据在本次运算中需要加载B的列次(假设B有K列)。如果有共享内存,我们只需要将该数据从全局内存加载一次到共享内存,然后再反复使用。数据传输方式由:
2D block相比1D block,最大的差异是thread的编号idx由1维度变为了2维。在矩阵的乘法中,我们可以将矩阵拆成子矩阵,让每个block对应计算一个子矩阵。如下图所示,我们计算C=A x B,如果只获得C中某个子矩阵Cs(假设Cs的大小为M * M) , 只需要抽取A的M行数据,以及B的M列数据,进行运算。
Cs矩阵的具体运算可拆解为:Cs = As0 x Bs0 + As1 x Bs2 + ... + Asm x Bsm. 如下图所示,我们用宽度为M的方块去分割数据。这样每个小矩阵的大小都是M * M。那么,为什么要进行分割运算,直接运算不是很简洁?实际上就是为了使用共享内存,减少数据的加载次数。上面运算中,例如As0 x Bs0运算由于As0与Bs0矩阵可以足够小,都能加载到共享内存中,每个数据可减少M - 1次全局内存读写。
一般而言M * M设置的大小与CUDA中2D Block的大小一致,这样能够简化运算:
优化的代码关键如下:
template <int BLOCK_SIZE> __global__ voidMatMulKernel2DBlockMultiplesSize(float *C, float *A, float *B, int wA, int wB) { // ... omit init ... // Loop over all the sub-matrices of A and B // required to compute the block sub-matrix for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) { // As与Bs 加载到共享内存中: __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; //让As Bs的数据初始化,从原始数据中映射: As[ty][tx] = A[a + wA * ty + tx]; Bs[ty][tx] = B[b + wB * ty + tx]; // Synchronize to make sure the matrices are loaded __syncthreads(); #pragma unroll // 子矩阵的运算数据相加 for (int k = 0; k < BLOCK_SIZE; ++k) { Csub += As[ty][k] * Bs[k][tx]; } __syncthreads(); } // Write the block sub-matrix to device memory; // each thread writes one element // 最终结果让汇总: int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; C[c + wB * ty + tx] = Csub; }
$ ./matMul help Usage -device=n (n >= 0 for deviceID) -wA=WidthA -hA=HeightA (Width x Height of Matrix A) -wB=WidthB -hB=HeightB (Width x Height of Matrix B) -iter=n Iteration numbers of algorithm. Default:500 -algo=[0|1|2|3|4|5] 0: Test all, 1: MatMul_1D_KERENL, 2:MatMul_1D_KERNEL_WITH_SHARED_MEMORY, 3: MatMul_2D_KERENEL_BLOCK_MULTIPLES_SIZE, 4: MatMul_2D_KERNEL_ANY_SIZE 5: MatMul_CUBLAS_SGEMM_KERNEL Note: Outer matrix dimensions of A & B matrices must be equal.
运行效果(Test on A100):
./matMul wA=1024 hA=256 wB=128 hB=1024 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. MatrixA(1024,256), MatrixB(128,1024) ========================= 1D blocks without shared memory ================= Computing result using MatrixMul1DTest Shared Mem: 0 Warmup operation done Performance= 883.88 GFlop/s, Time= 0.076 msec, Size= 67108864 Ops, WorkgroupSize= 1024 threads/block Checking computed result for correctness: Result = PASS ========================= 1D blocks with shared memory =================== Computing result using MatrixMul1DTest Shared Mem: 1 Warmup operation done Performance= 227.81 GFlop/s, Time= 0.295 msec, Size= 67108864 Ops, WorkgroupSize= 1024 threads/block Checking computed result for correctness: Result = PASS ========================= 2D blocks with block multiples size ============= Computing result using MatMul2DTest Kernel. Warmup operation done Performance= 1120.85 GFlop/s, Time= 0.060 msec, Size= 67108864 Ops, WorkgroupSize= 1024 threads/block Checking computed result for correctness: Result = PASS ========================= 2D blocks with any size ======================== Computing result using MatMul2DTest Kernel. Spport any size, e.g. wA=1000 hA=312 wB=11 hB=1000. Warmup operation done Performance= 1303.89 GFlop/s, Time= 0.051 msec, Size= 67108864 Ops, WorkgroupSize= 1024 threads/block Checking computed result for correctness: Result = PASS ========================= CUBLAS Sgemm kernel ======================== Computing result using CUBLAS Sgemmm Kernel. Warmup operation done Performance= 7189.46 GFlop/s, Time= 0.009 msec, Size= 67108864 Ops,Checking computed result for correctness: Result = PASS