Project

General

Profile

Bug #699590

Failing test: verlet_nvt_hoover/gpu/dsfloat/2d (energy drift)

Added by Felix Höfling 28 days ago. Updated 11 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
Due date:
% Done:

0%


Description

The unit test verlet_nvt_hoover/gpu/dsfloat/2d is failing constantly on (old) Tesla T10 GPUs (compute capability 1.3) due to a energy drift (5e-3 during a short run). The issue occurs since 26/7, and it does not occur on more recent hardware, so low priority.

For details, see, e.g., http://my.cdash.org/testDetails.php?test=35063120&build=1293332

.../test/unit/mdsim/integrators/verlet_nvt_hoover.cpp(205): error in "verlet_nvt_hoover_gpu_dsfloat_2d": absolute value of max_en_diff / fabs(en_nhc0){0.0050164498956183183} exceeds 5.0000000000000002e-05
Absolute tolerance on centre-of-mass velocity: 1.19209e-08
Relative tolerance on instantaneous temperature: 0.11619
.../test/unit/mdsim/integrators/verlet_nvt_hoover.cpp(229): error in "verlet_nvt_hoover_gpu_dsfloat_2d": difference{0.125799} between thermodynamics->temp(){0.88825795959221432} and temp{1} exceeds 0.1161895

History

#1 Updated by Felix Höfling 28 days ago

I could reproduce the issue several times with commit:0c71140 (13/7). Still, bisecting is not straightforward since the problem appears to depend sensitively on details of the configuration and the way the source is checked out.

Perhaps I should do a memory check next.

#2 Updated by Felix Höfling 28 days ago

cuda_memtest didn't show any problems.

I could not reproduce the issue with commit:c5ee629, which is just preceding commit:0c71140. The latter, however, introduces changes merely to H5MD writers, mostly in the Lua files.

#3 Updated by Felix Höfling 28 days ago

  • Status changed from New to In Progress

#4 Updated by Felix Höfling 26 days ago

The same problem occurs for an old GTX 260 card, which has compute capability 1.3:
http://my.cdash.org/testDetails.php?test=35136024&build=1295596

Support for these devices will be dropped very soon by us. Still, it would be nice to understand what is going on. The bug appears to be at the heart of the MD integration.

#5 Updated by Felix Höfling 11 days ago

  • Subject changed from Failing dsfloat tests on Tesla T10 hardware to Failing dsfloat test (energy drift)
  • Priority changed from Low to Normal

Now, the energy drift is also present recent hardware, see http://my.cdash.org/testDetails.php?test=35532928&build=1304667

[13-10-2017 10:14:23.909773] NVIDIA driver version: NVIDIA UNIX x86_64 Kernel Module  375.66  Mon May  1 15:29:16 PDT 2017
[13-10-2017 10:14:23.909889] CUDA driver version: 8.0
[13-10-2017 10:14:23.909898] CUDA runtime version: 8.0
[13-10-2017 10:14:23.910274] GPU: 0
[13-10-2017 10:14:23.910283] GPU name: GeForce GTX 960
[13-10-2017 10:14:23.910286] GPU total global memory: 4234280960 bytes
[13-10-2017 10:14:23.910289] GPU shared memory per block: 49152 bytes
[13-10-2017 10:14:23.910292] GPU registers per block: 65536
[13-10-2017 10:14:23.910294] GPU warp size: 32
[13-10-2017 10:14:23.910296] GPU maximum number of threads per block: 1024
[13-10-2017 10:14:23.910298] GPU total constant memory: 65536
[13-10-2017 10:14:23.910300] GPU clock frequency: 1291000 kHz
[13-10-2017 10:14:23.910302] GPU multiprocessor count: 8
[13-10-2017 10:14:23.910304] GPU maximum resident threads per multiprocessor: 2048
[13-10-2017 10:14:23.910306] CUDA compute capability: 5.2
[13-10-2017 10:14:23.910308] CUDA compute version: 2.0

[13-10-2017 10:14:23.913277] random (GPU): set RNG seed: 1577688792

.../src/test/unit/mdsim/integrators/verlet_nvt_hoover.cpp(205): error: in "verlet_nvt_hoover_gpu_dsfloat_2d": absolute value of max_en_diff / fabs(en_nhc0){0.00011049097294000284} exceeds 5.0000000000000002e-05
Absolute tolerance on centre-of-mass velocity: 1.19209e-08
Relative tolerance on instantaneous temperature: 0.11619
Absolute tolerance on centre-of-mass velocity: 0.00402331
Relative tolerance on temperature: 0.00424547
Relative tolerance on kinetic part of specific heat: 0.232611

#6 Updated by Felix Höfling 11 days ago

  • Subject changed from Failing dsfloat test (energy drift) to Failing test: verlet_nvt_hoover/gpu/dsfloat/2d (energy drift)

Also available in: Atom PDF