Gurwinder
Gurwinder GPU SDE | AI & Graphics @ Intel, Loves to Work on AI, Games & AR/VR

Comparing SYCL, OpenCL, and CUDA: Matrix Multiplication Example

Comparing SYCL, OpenCL, and CUDA: Matrix Multiplication Example

Comparing SYCL, OpenCL, and CUDA: Matrix Multiplication Example

Matrix multiplication is a core operation in scientific and engineering applications, often accelerated using specialized programming models like SYCL, OpenCL, and CUDA. These models leverage GPUs for parallel computation. Let’s delve into how matrix multiplication is implemented in each framework and compare their approaches.

Overview

Feature SYCL OpenCL CUDA
Language C++ C, C++ C, C++
Hardware CPUs, GPUs, FPGAs CPUs, GPUs, FPGAs, DSPs NVIDIA GPUs
Portability High, built on OpenCL High, open standard Low, NVIDIA GPUs only
Ease of Use High, modern C++ Moderate, requires knowledge of parallel computing High, optimized with a learning curve

CUDA

CUDA, developed by NVIDIA, is a widely used parallel computing platform and programming model for NVIDIA GPUs.

walking

In CUDA programming, efficient GPU utilization involves understanding threads, blocks, and grids:

Threads

A thread is the smallest unit of execution in CUDA. Each thread executes the same kernel function independently, but with different data. Threads within the same block can cooperate via shared memory and synchronization mechanisms.

Blocks

A block is a group of threads that execute the same kernel code simultaneously. Threads within a block can synchronize with each other and share data through shared memory.

  • Block Dimensions: Each block can have a maximum of 1,024 threads (as of CUDA capability 7.x). Blocks are organized in three dimensions (x, y, and z), allowing up to 1,024 threads per block in each dimension. The total number of threads in a block (blockDim.x * blockDim.y * blockDim.z) should not exceed the maximum block size supported by the GPU architecture.

Grids

A grid is a collection of blocks that execute the kernel function. Blocks in a grid can execute independently of each other and are scheduled on streaming multiprocessors (SMs) of the GPU.

  • Grid Dimensions: Grids are also organized in three dimensions (x, y, and z), allowing up to 65,535 blocks per grid dimension (gridDim.x, gridDim.y, and gridDim.z). The total number of blocks in a grid (gridDim.x * gridDim.y * gridDim.z) depends on the computation and the available resources on the GPU.

Relationship and Usage

Thread Indexing

Each thread within a grid has a unique index (threadIdx.x, threadIdx.y, threadIdx.z) that identifies its position within its block.

Block Indexing

Each block within a grid has a unique index (blockIdx.x, blockIdx.y, blockIdx.z) that identifies its position within the grid.

Grid Configuration

When launching a kernel, you specify the dimensions of the grid and the dimensions of each block (dim3 type in CUDA). This configuration determines how the kernel threads are organized and executed on the GPU.

CUDA Matrix Multiplication Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#include <cuda_runtime.h>
#include <iostream>

__global__ void matrixMul(float *A, float *B, float *C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < N && col < N) {
        float sum = 0.0f;
        for (int i = 0; i < N; ++i) {
            sum += A[row * N + i] * B[i * N + col];
        }
        C[row * N + col] = sum;
    }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
int main() {
    const int N = 1024;
    float *h_A, *h_B, *h_C;
    float *d_A, *d_B, *d_C;
    size_t size = N * N * sizeof(float);

    // Allocate memory on host
    h_A = (float*)malloc(size);
    h_B = (float*)malloc(size);
    h_C = (float*)malloc(size);

    // Initialize matrices h_A and h_B
    // ...

    // Allocate memory on device
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);

    // Copy matrices from host to device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Define grid and block dimensions
    dim3 blockSize(16, 16);
    dim3 gridSize((N + blockSize.x - 1) / blockSize.x, (N + blockSize.y - 1) / blockSize.y);

    // Launch kernel
    matrixMul<<<gridSize, blockSize>>>(d_A, d_B, d_C, N);

    // Copy result matrix from device to host
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
    return 0;
}

OpenCL

OpenCL is an open standard for parallel programming of heterogeneous systems, supporting CPUs, GPUs, and FPGAs.

OpenCL Matrix Multiplication Example

1
2
3
#include <CL/cl.h>
#include <iostream>
#define MAX_SOURCE_SIZE (0x100000)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
int main() {
    const int N = 1024;
    float *h_A, *h_B, *h_C;
    cl_mem d_A, d_B, d_C;
    cl_platform_id platform_id = NULL;
    cl_device_id device_id = NULL;
    cl_context context = NULL;
    cl_command_queue command_queue = NULL;
    cl_program program = NULL;
    cl_kernel kernel = NULL;
    cl_uint ret_num_devices;
    cl_uint ret_num_platforms;
    cl_int ret;

    size_t size = N * N * sizeof(float);
    h_A = (float*)malloc(size);
    h_B = (float*)malloc(size);
    h_C = (float*)malloc(size);

    // Initialize matrices h_A and h_B
    // ...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
    // Get platform and device information
    ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms);
    ret = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 1, &device_id, &ret_num_devices);

    // Create OpenCL context
    context = clCreateContext(NULL, 1, &device_id, NULL, NULL, &ret);

    // Create command queue
    command_queue = clCreateCommandQueue(context, device_id, 0, &ret);

    // Create memory buffers for matrices
    d_A = clCreateBuffer(context, CL_MEM_READ_ONLY, size, NULL, &ret);
    d_B = clCreateBuffer(context, CL_MEM_READ_ONLY, size, NULL, &ret);
    d_C = clCreateBuffer(context, CL_MEM_WRITE_ONLY, size, NULL, &ret);

    // Copy matrices from host to device
    ret = clEnqueueWriteBuffer(command_queue, d_A, CL_TRUE, 0, size, h_A, 0, NULL, NULL);
    ret = clEnqueueWriteBuffer(command_queue, d_B, CL_TRUE, 0, size, h_B, 0, NULL, NULL);
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
    // Create and build the compute program
    const char *source_str = "kernel void matrixMul(global float *A, global float *B, global float *C, int N) { "
                             "   int row = get_global_id(0); "
                             "   int col = get_global_id(1); "
                             "   if (row < N && col < N) { "
                             "       float sum = 0.0f; "
                             "       for (int i = 0; i < N; ++i) { "
                             "           sum += A[row * N + i] * B[i * N + col]; "
                             "       } "
                             "       C[row * N + col] = sum; "
                             "   } "
                             "}

";

    program = clCreateProgramWithSource(context, 1, (const char **)&source_str, NULL, &ret);
    ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL);

    // Create the OpenCL kernel
    kernel = clCreateKernel(program, "matrixMul", &ret);
    clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&d_A);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&d_B);
    clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&d_C);
    clSetKernelArg(kernel, 3, sizeof(int), (void *)&N);
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
    // Execute the OpenCL kernel
    size_t global_item_size[2] = {N, N};
    size_t local_item_size[2] = {16, 16};
    ret = clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_item_size, local_item_size, 0, NULL, NULL);

    // Copy result matrix from device to host
    ret = clEnqueueReadBuffer(command_queue, d_C, CL_TRUE, 0, size, h_C, 0, NULL, NULL);

    // Clean up
    ret = clFlush(command_queue);
    ret = clFinish(command_queue);
    ret = clReleaseKernel(kernel);
    ret = clReleaseProgram(program);
    ret = clReleaseMemObject(d_A);
    ret = clReleaseMemObject(d_B);
    ret = clReleaseMemObject(d_C);
    ret = clReleaseCommandQueue(command_queue);
    ret = clReleaseContext(context);
    return 0;
}

SYCL

SYCL is a higher-level C++ programming model built on OpenCL, providing a more user-friendly approach to GPU programming.

walking

SYCL Matrix Multiplication Example

1
2
3
4
5
6
7
8
9
10
11
12
#include <CL/sycl.hpp>
#include <iostream>

using namespace sycl;

int main() {
    const int N = 1024;
    float *h_A = new float[N * N];
    float *h_B = new float[N * N];
    float *h_C = new float[N * N];
    // Initialize matrices h_A and h_B
    // ...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
    // Create SYCL queue
    queue q;

    {
        // Create buffers
        buffer<float, 2> d_A(h_A, range<2>(N, N));
        buffer<float, 2> d_B(h_B, range<2>(N, N));
        buffer<float, 2> d_C(h_C, range<2>(N, N));

        // Submit command group to queue
        q.submit([&](handler &cgh) {
            // Accessors
            auto A = d_A.get_access<access::mode::read>(cgh);
            auto B = d_B.get_access<access::mode::read>(cgh);
            auto C = d_C.get_access<access::mode::write>(cgh);

            // Kernel
            cgh.parallel_for<class matrixMul>(range<2>(N, N), [=](id<2> index) {
                int row = index[0];
                int col = index[1];
                float sum = 0.0f;
                for (int i = 0; i < N; ++i) {
                    sum += A[row][i] * B[i][col];
                }
                C[row][col] = sum;
            });
        });
    }
    // Buffers go out of scope and data is copied back to host

Comparison

Programming Model

  • CUDA: Directive-based approach with fine-grained GPU control.
  • OpenCL: Runtime API for C-like parallel computations across various architectures.
  • SYCL: Single-source C++ programming with simplified memory management.

Memory Management

  • CUDA: Explicit allocation (cudaMalloc, cudaMemcpy) and deallocation (cudaFree).
  • OpenCL: Command queue manages data transfer (clEnqueueWriteBuffer, clEnqueueReadBuffer) and kernel execution.
  • SYCL: Automatic memory handling with C++ integration.

Ease of Use and Portability

  • CUDA: High performance and NVIDIA GPU integration but less portable.
  • OpenCL: Broad platform support but may require more optimization effort.
  • SYCL: Combines OpenCL’s portability with easier C++ programming, ideal for maintainability.

Conclusion

Choosing between CUDA, OpenCL, and SYCL depends on performance needs, hardware availability, and developer expertise. CUDA excels in NVIDIA GPU performance, while OpenCL offers broader hardware support and SYCL provides a user-friendly C++ interface with portability.

comments powered by Disqus