I've been having this odd problem with the dcopy function in the ACML blas library. When copying a lot of elements (~10**9), dcopy seems to fail at a certain point. I tested this with the included small fortran program. The odd thing is that with my system blas everything works fine. When I tried MKL, it also failed at the exact same point. When I run the code on an intel processor, it also runs fine.
I used ACML 5.2.0 on AMD Opteron 6276, Linux 3.2.0-23 x86_64, and compiled as follows: gfortran test_dcopy.f /opt/acml5.2.0/gfortran64/lib/libacml.a
Update: it seems to be a problem that was introduced in version 4.3.0. The new features section of 4.3.0 mentions: "Level 1 BLAS routines have been tuned for AMD Istanbul processors. Routines affected include xDOT, xCOPY, xAXPY, and xSCAL routines."
When I tried my small test case with dcopy from ACML 4.2.0 it works, with 4.3.0 it fails.
I think this will be considered a bug. The problem is caused by an intermediate result that multiplies N by the element size. Any time this overflows a 32-bit integer the routine will fail. This problem will happen in any of the Level 1 copy routines, at different values of N depending on the size of an array element.
The test does work if you build and link with the 64-bit integer library, and that may be considered a work around.
For all of the blas routines, at some point arrays are too large to use 32-bit address computation and it is necessary to use the 64-bit integer libraries. We can change the copy routines to delay the size at which this occurs - as you point out it used to work!
This will be resolved in our upcoming 5.3 release.
Since I can't link ScaLAPACK which is an integer*4 build to the ACML integer*8 library version, the work-around isn't really an option for me.
Interestingly, the MKL bug only affects Opteron processors, and it also occurs after some optimization of the ?copy codes in version 10.2. So I wondered, do ACML and MKL share some code?
Anyway, thanks for the info, I'll keep an eye out for 5.3.