Overview

In this session, I explored how memory transfer times vary across several CUDA memory allocation strategies by measuring the time it takes to transfer 2GB of data using:

  • cudaMalloc (device memory)
  • cudaMallocManaged (unified memory)
  • cudaHostAllocMapped (zero-copy)

All timings were measured by a custom RAII-style C++ timer class that automatically logs elapsed time when leaving scope.

Measurement Setup

Each experiment allocates 2GB of memory and transfers it either:

  • from host to device,
  • from device to host,
  • or allows device to directly access host memory (zero-copy).

The elapsed time is measured using std::chrono::high_resolution_clock.


Results

cudaMalloc HtoD Elapsed time: 186.46 ms
cudaMalloc DtoH Elapsed time: 213.094 ms
cudaMallocManaged HtoD PrefetchAsync Elapsed time: 0.0199 ms
cudaMallocManaged HtoD Call kernel Elapsed time: 1655.28 ms
Refer unified_ptr from CPU: o
cudaMallocManaged DtoH Elapsed time: 0.6585 ms
Zero-Copy HtoD Elapsed time: 82.9358 ms
host_ptr[0] = 0
Zero-Copy DtoH Elapsed time: 0.0522 ms

Observations

  • cudaMalloc provides stable and reasonable transfer speeds in both directions.
  • ⚠️ cudaMallocManaged performs very poorly when a GPU kernel attempts to access the data. This is likely due to page fault–based migration via unified virtual memory. The kernel’s actual execution time is currently not isolated.
  • ⚠️ Prefetching unified memory with cudaMemPrefetchAsync() is extremely fast—but this may be due to optimization bypassing the actual transfer (e.g., no page migration because the memory is already resident).
  • ⚠️ Zero-Copy memory appears very fast, especially for device-to-host (DtoH) reads. However, it is unclear whether some hardware optimization or caching is hiding the true latency.

Technical Notes

  • Memory size: 2GB (1ULL << 31)
  • Custom class AutoTimeLogger logs time upon destruction, simplifying measurement.
  • Experiments were built and run using nvcc and MSVC on Windows 11 with a GeForce RTX 3060.

Next Steps

To improve the precision and fairness of future measurements:

  • 🔍 Use cudaEventRecord() to measure only GPU-side kernel execution time, separating transfer from computation.
  • ⚖️ Use the same kernel across all memory types to enable fair comparison.
  • 🧠 Design access patterns that defeat GPU-side caching to better reflect raw transfer performance.

Thanks for reading!
👉 GitHub Repo (cuda-examples)