2012.05.11 in AMD, GCN, GPU, ISA, Nvidia, OpenCL, PTX, and Tahiti
GCN OpenCL memory fences update (and inline PTX)
This is an update for
my previous
post about the global behaviour of mem_fence() on existing GPUs
for ones which have started existing since then.
On previous AMD architectures the caches were not really used except
for read only images. The latest Tahiti/GCN GPUs have a read/write,
incoherent L1 cache local to a compute unit. Since a single workgroup
will always run on a single compute unit, memory will be consistent in
that group using the cache.
According to the OpenCL specification, global memory is only
guaranteed to be consistent outside of a workgroup, after an execution
barrier, i.e. the kernel is finished, so memory will be consistent
before the next invocation. I found this to be really annoying and
would ruin my kernels, and in some cases had a high overhead from the
multiple kernel launches.
The write does seem to be committed to memory like the IL
documentation would indicate, however the read is still problematic
outside of the workgroup. You must bypass the L1 cache in order to
ensure reading an updated value.
For some cases I found it faster and more convenient to use atomics to
bypass the L1 cache (e.g. read any given int value with
atomic_or(&address, 0)).
In the future when the GDS hardware becomes available as an extension, it will probably be a better option for global synchronization. It's been in the hardware at least since Cayman (and maybe Evergreen?) but we don't (yet) have a way to access it from the OpenCL layer.
On the Nvidia side, there is the potential that mem_fence() will stop providing a truly global fence in a future compiler update. Since CUDA 4.0 or so the OpenCL compiler has supported inline PTX. You can get the same effect as __threadfence() by using the membar.gl instruction directly:
inline void strong_global_mem_fence_ptx() { asm("{\n\t" "membar.gl;\n\t" "}\n\t" ); }blog comments powered by Disqus