Introduction

In this post, I explore the performance characteristics of a simple vector addition task implemented on both CPU and GPU using CUDA.
This experiment is part of my hands-on learning in GPU programming and performance profiling.


Environment

  • CUDA Toolkit: 12.6.20
  • Platform: Windows 11
  • GPU: NVIDIA RTX 3060
  • Profiler: cudaEventElapsedTime, std::chrono

Implementation

CPU Version

void vectorAddCPU(const float* A, const float* B, float* C, const int N) {
    for (int i = 0; i < N; ++i)
        C[i] = A[i] + B[i];
}

GPU Version

__global__ void vectorAdd(const float* A, const float* B, float* C, int N) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < N)
        C[i] = A[i] + B[i];
}

Benchmark Results

I tested both versions across varying data sizes. ★ indicates the faster method

N (Elements)CPU Time (ms)GPU Time (ms)
512★0.00090.5263
1,000,0001.5718★0.5168
1,000,000,0002670.95★36.4035

Analysis

  • For small arrays (≤ 512), the CPU performs faster due to GPU kernel launch overhead. Memory transfer overhead is ignored in this evaluation.
  • Around N = 1 million, GPU starts to outperform CPU due to its massive parallelism.
  • Kernel execution is extremely fast on GPU, but data transfer cost can be significant for smaller problems.

What I Learned

  • Always guard against out-of-bounds access in CUDA kernels using if (i < N).
  • For real-time or low-latency systems, overlapping transfers (cudaMemcpyAsync) with kernel execution may be essential.
  • CUDA features
    • __global__ : Indicates the device code
    • cudaMalloc : GPU malloc()
    • cudaMemcpy : Copy data between host and device
    • cudaMemcpyHostToDevice/DeviceToHost : Defines the direction of memory transfer
    • cudaEventCreate : Creates a timing event marker
    • cudaEventSynchronize : Waits for the completion of device operations
    • cudaEventElapsedTime : Gets elapsed time using GPU timing
    • cudaFree : GPU free()

GitHub

👉 CUDA Examples GitHub Repository


Thanks for reading!