CUDA Memory Transfer Timing: malloc, managed, and zero-copy

Overview

In this session, I explored how memory transfer times vary across several CUDA memory allocation strategies by measuring the time it takes to transfer 2GB of data using:

cudaMalloc (device memory)
cudaMallocManaged (unified memory)
cudaHostAllocMapped (zero-copy)

All timings were measured by a custom RAII-style C++ timer class that automatically logs elapsed time when leaving scope.

Measurement Setup

Each experiment allocates 2GB of memory and transfers it either:

from host to device,
from device to host,
or allows device to directly access host memory (zero-copy).

The elapsed time is measured using std::chrono::high_resolution_clock.

Results

cudaMalloc HtoD Elapsed time: 186.46 ms
cudaMalloc DtoH Elapsed time: 213.094 ms
cudaMallocManaged HtoD PrefetchAsync Elapsed time: 0.0199 ms
cudaMallocManaged HtoD Call kernel Elapsed time: 1655.28 ms
Refer unified_ptr from CPU: o
cudaMallocManaged DtoH Elapsed time: 0.6585 ms
Zero-Copy HtoD Elapsed time: 82.9358 ms
host_ptr[0] = 0
Zero-Copy DtoH Elapsed time: 0.0522 ms

Observations

✅ cudaMalloc provides stable and reasonable transfer speeds in both directions.
⚠️ cudaMallocManaged performs very poorly when a GPU kernel attempts to access the data. This is likely due to page fault–based migration via unified virtual memory. The kernel’s actual execution time is currently not isolated.
⚠️ Prefetching unified memory with cudaMemPrefetchAsync() is extremely fast—but this may be due to optimization bypassing the actual transfer (e.g., no page migration because the memory is already resident).
⚠️ Zero-Copy memory appears very fast, especially for device-to-host (DtoH) reads. However, it is unclear whether some hardware optimization or caching is hiding the true latency.

Technical Notes

Memory size: 2GB (1ULL << 31)
Custom class AutoTimeLogger logs time upon destruction, simplifying measurement.
Experiments were built and run using nvcc and MSVC on Windows 11 with a GeForce RTX 3060.

Next Steps

To improve the precision and fairness of future measurements:

🔍 Use cudaEventRecord() to measure only GPU-side kernel execution time, separating transfer from computation.
⚖️ Use the same kernel across all memory types to enable fair comparison.
🧠 Design access patterns that defeat GPU-side caching to better reflect raw transfer performance.

Thanks for reading!
👉 GitHub Repo (cuda-examples)

Overview#

Measurement Setup#

Results#

Observations#

Technical Notes#

Next Steps#