Help has moved to the Planio platform. All logins and passwords remained the same. All users will be able to login and use Redmine just as before. Read more...

Bug #445697

Computing density modes for low density on GPU

Added by Jörg Bartnick about 6 years ago. Updated about 6 years ago.

Start date:
Due date:
% Done:


Estimated time:


When trying to compute the density modes on the graphics cards in configurations with low number density (~0.001) my simulation fails. A similiar configuration for just higher number density does not yield an error.

[15-06-2015 05:41:43.253392] number of A particles: 2500
[15-06-2015 05:41:43.326026] number of B particles: 2500
[15-06-2015 05:41:43.444043] edge lengths of simulation domain: 2402.81 2080.9
[15-06-2015 06:03:31.098551] blocking_scheme: register intermediate scattering function (A/B)
[15-06-2015 06:03:31.114045] blocking_scheme: register intermediate scattering function (B/B)
[15-06-2015 06:03:31.781614] [ERROR] density_mode (A): failed to compute density modes on GPU
[15-06-2015 06:03:31.923111] [ERROR] [CUDA] invalid configuration argument
stack traceback:
[C]: in function 'sample'
... [...] : in function 'liquid'
... [...] :325: in main chunk
[15-06-2015 06:03:31.923181] [WARNING] HALMD aborted


#1 Updated by Felix Höfling about 6 years ago

Hi Jörg,

Thank you for reporting the issue! What are the CUDA dimensions for
launching the density_mode kernel? (Compile in Debug or RelWithDebInfo
mode and run with higher verbosity.)

One reason for the exception could be that the requested shared memory
exceeds the available memory. Is there a concurrent process running on the
GPU? If so, you may try to reduce DEVICE_SCALE in the CMake configuration,
this reduces the number of threads per block and the ressources needed for
each block.

The following line looks a bit dangerous to me:
, dim_(50, 64 << DEVICE_SCALE) // at most 512 threads per block

Where does the 50 come from? It means that the kernel shall run in a grid
of 50 blocks of 512 threads each. In your configuration of 5000 particles,
10 blocks would be sufficient.

Finally, the CUDA compute capability ≥ 2.0 (since Fermi) allows for 1024
threads per block. This shouldn't be a problem in your case, but the code
needs some changes too with respect to this limit. See also

By the way: note that very large box sizes may give rise to additional
round-off errors. Your choice of L=2400 may still be safe, just to alert

Best regards,


Also available in: Atom PDF