Report error when running benchmark using gpu

I’m running benchmarks in concrete-ml, the file is benchmarks/deep_learning.py. I set add “device = ‘cuda’” in compile_torch_model in line 673
Then I got error:

python: /concrete/compilers/concrete-compiler/compiler/lib/Runtime/GPUDFG.cpp:1769: void* stream_emulator_init(): Assertion cudaGetDeviceProperties(&properties, 0) == cudaSuccess' failed. Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var LLVM_SYMBOLIZER_PATH` to point to it):
0 libLLVM-17git-0add3cf5.so 0x00007fcabbfd3b51 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 225
1 libLLVM-17git-0add3cf5.so 0x00007fcabbfd1564
2 libc.so.6 0x00007fccbf401520
3 libc.so.6 0x00007fccbf4559fc pthread_kill + 300
4 libc.so.6 0x00007fccbf401476 raise + 22
5 libc.so.6 0x00007fccbf3e77f3 abort + 211
6 libc.so.6 0x00007fccbf3e771b
7 libc.so.6 0x00007fccbf3f8e96
8 libConcretelangRuntime-2f46564e.so 0x00007fcab84e85fb stream_emulator_init + 2555
9 sharedlib.so 0x00007fcb6a05e1ff concrete__clear_forward_proxy + 63

version:
concrete-python-2.9.0(gpu)
concrete-ml-1.8.0

I can’t solve this by downgrade, neither by installing a llvm using apt

Hello @Eric_Jay ,

The error look likes you don’t have cuda abilities, are you sure that the cuda is well setup in your host?

Can you try to compile and execute a simple program that access to the cuda device to see if that works outside concrete?

@yundsi ,Thank you for your reply.
Yeah, I have a NVIDIA A100 80GB PCIe on my server, and I can run the built-in model by function compile(rather than compile_torch_model) with device=‘cuda’.
In addition, I printed check_gpu_available() and got a True.

In detail:
Driver Version: 570.124.04
CUDA Version: 12.8

Besides, is there any guide for building concrete (especially gpu version) from source so I can trace and debug?

You can find a guide here to build concrete from sources. However that build the CPU only version, there are no specific guide to build the GPU version, but you just need to specify the CUDA_PATH and CUDA_SUPPORT=ON.

Then last step to build the wheel is to make whl the you should have the .whl file in dist directory.

You can also take a look at the release worfklow, but with my previous instruction you should have enought.

Let me known if I can help more.

PS: On our side we only test on gpu hosts with H100, saying that it should work on A100.

@yundsi Thank you very much! Over the past two days, I have already tried building according to the documentation in concrete-python and concrete-compiler/compiler, but I still have some issues. I have listed my steps here. Could you please help me check if there are any problems?
I clone concrete repo with option --recursive, then checkout to tag v2.9.0-gpu.
Convert CUDA_SUPPORT in /concrete-compiler/compiler/Makefile to ON, run “make install-hpx-from-source”, “make” and “make python-bindings”.
In directory concrete-python, export COMPILER_BUILD_DIRECTORY=$(pwd)/build.
Change version in version.txt to 2.9.0, run “pip install --use-pep517 -e .”
But “make pytest” in directory concrete-python and “make run-python-tests” in compiler reported lots of FAILs. When I running benchmarks in concrete-ml using the library I built, got errors like
" /tmp/tmpepoubyyn/sharedlib.so: undefined symbol: _dfr_start
concrete/compilers/concrete-compiler/compiler/include/concretelang/Runtime/runtime_api.h:28:void _dfr_start (int64_t, void *);"
I suspect there might be an issue with the installation of the compiler, but I am at a loss as to how to resolve it.

_dfr_start is a symbol of our runtime, I guess you not load the RUNTIME_lIBRARY (with LD_PRELOAD). I guess is because your COMPILER_BUILD_DIRECTORY is not well set as you set it from the concrete-python directory and not the concrete-compiler/compiler directory.

You can debug using make cp_activate that will print environement variables.