Optimization: asynchronous kernel launches and copies
CUDA kernels and copies between Host and GPU memory could be launched asynchronously, alleviating synchronization delays between GPU and CPU and the overhead of kernel launches.
This would make rigorous fine-grained synchronization necessary. Potentially the caching mechanism could be extended with a "modification_finished" routine that creates a CUDA fence/synchronization object, after which the next call to "read_cache" could wait for the GPU to finish the modifications.
It has to be determined whether such modifications are feasible and worthwhile (How many full synchronizations will still be necessary? How many kernel launches could actually happen asynchronously/simultanously? Etc.).