Report error when running benchmark using gpu

Eric_Jay · April 22, 2025, 10:33am

I’m running benchmarks in concrete-ml, the file is benchmarks/deep_learning.py. I set add “device = ‘cuda’” in compile_torch_model in line 673
Then I got error:

python: /concrete/compilers/concrete-compiler/compiler/lib/Runtime/GPUDFG.cpp:1769: void* stream_emulator_init(): Assertion `cudaGetDeviceProperties(&properties, 0) == cudaSuccess' failed. Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var` LLVM_SYMBOLIZER_PATH` to point to it):
0 libLLVM-17git-0add3cf5.so 0x00007fcabbfd3b51 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 225
1 libLLVM-17git-0add3cf5.so 0x00007fcabbfd1564
2 libc.so.6 0x00007fccbf401520
3 libc.so.6 0x00007fccbf4559fc pthread_kill + 300
4 libc.so.6 0x00007fccbf401476 raise + 22
5 libc.so.6 0x00007fccbf3e77f3 abort + 211
6 libc.so.6 0x00007fccbf3e771b
7 libc.so.6 0x00007fccbf3f8e96
8 libConcretelangRuntime-2f46564e.so 0x00007fcab84e85fb stream_emulator_init + 2555
9 sharedlib.so 0x00007fcb6a05e1ff concrete__clear_forward_proxy + 63

version:
concrete-python-2.9.0(gpu)
concrete-ml-1.8.0

I can’t solve this by downgrade, neither by installing a llvm using apt

yundsi · April 23, 2025, 8:20am

Hello @Eric_Jay ,

The error look likes you don’t have cuda abilities, are you sure that the cuda is well setup in your host?

Can you try to compile and execute a simple program that access to the cuda device to see if that works outside concrete?

Eric_Jay · April 23, 2025, 10:18am

@yundsi ,Thank you for your reply.
Yeah, I have a NVIDIA A100 80GB PCIe on my server, and I can run the built-in model by function compile(rather than compile_torch_model) with device=‘cuda’.
In addition, I printed check_gpu_available() and got a True.

In detail:
Driver Version: 570.124.04
CUDA Version: 12.8

Eric_Jay · April 23, 2025, 10:31am

Besides, is there any guide for building concrete (especially gpu version) from source so I can trace and debug?

yundsi · April 24, 2025, 2:07pm

You can find a guide here to build concrete from sources. However that build the CPU only version, there are no specific guide to build the GPU version, but you just need to specify the CUDA_PATH and CUDA_SUPPORT=ON.

Then last step to build the wheel is to make whl the you should have the .whl file in dist directory.

You can also take a look at the release worfklow, but with my previous instruction you should have enought.

Let me known if I can help more.

PS: On our side we only test on gpu hosts with H100, saying that it should work on A100.

Eric_Jay · April 24, 2025, 5:04pm

@yundsi Thank you very much! Over the past two days, I have already tried building according to the documentation in concrete-python and concrete-compiler/compiler, but I still have some issues. I have listed my steps here. Could you please help me check if there are any problems?
I clone concrete repo with option --recursive, then checkout to tag v2.9.0-gpu.
Convert CUDA_SUPPORT in /concrete-compiler/compiler/Makefile to ON, run “make install-hpx-from-source”, “make” and “make python-bindings”.
In directory concrete-python, export COMPILER_BUILD_DIRECTORY=$(pwd)/build.
Change version in version.txt to 2.9.0, run “pip install --use-pep517 -e .”
But “make pytest” in directory concrete-python and “make run-python-tests” in compiler reported lots of FAILs. When I running benchmarks in concrete-ml using the library I built, got errors like
" /tmp/tmpepoubyyn/sharedlib.so: undefined symbol: _dfr_start
concrete/compilers/concrete-compiler/compiler/include/concretelang/Runtime/runtime_api.h:28:void _dfr_start (int64_t, void *);"
I suspect there might be an issue with the installation of the compiler, but I am at a loss as to how to resolve it.

yundsi · April 25, 2025, 12:27pm

_dfr_start is a symbol of our runtime, I guess you not load the RUNTIME_lIBRARY (with LD_PRELOAD). I guess is because your COMPILER_BUILD_DIRECTORY is not well set as you set it from the concrete-python directory and not the concrete-compiler/compiler directory.

You can debug using make cp_activate that will print environement variables.

Eric_Jay · April 26, 2025, 8:25am

@yundsi My previous statement was incorrect. My COMPILER_BUILD_DIRECTORY variable is indeed set to compiler/build, but the LD_PRELOAD and PYTHONPATH variables were previously missing. After properly configuring these two variables, a new error has occurred:
“RuntimeError: Can’t emit artifacts: Command failed:ar rcs /tmp/tmp0v13lc8v/staticlib.a /tmp/tmp0v13lc8v/program.module-0.mlir.o 2>&1
Code:35584
Segmentation fault”
There is no program.module-0.mlir.o in /tmp/tmp0v13lc8v/ and staticlib.a generated is empty.
The object file name is generated by function “setCompilationResult” in CompilerEngine.cpp by code
“auto sourceName = module->getSourceFileName();
if (sourceName == “” || sourceName == “LLVMDialectModule”) {
sourceName = this->outputDirPath + “/program.module-” +
std::to_string(objectsPath.size()) + “.mlir”;”

yundsi · April 28, 2025, 7:07am

@Eric_Jay I can’t understand why your object file is nor here, I guess if there are an error on object generation it should be reported…

Eric_Jay · April 28, 2025, 7:43am

@yundsi I just successfully compiled the library in the nvidia/cuda image, and it seems to be an issue with the environment. However, I encountered the exact same error as the GPU version of concrete-python from pip, just like what I mentioned in my first post. Could you let me know what system environment you used during the build? Also, could you recommend a base Docker image?
Update: I successfully ran this benchmark in image manylinux_2_28_x86_64, Thanks so much for your patience and help @yundsi

yundsi · April 29, 2025, 6:57am

@Eric_Jay good to read that, happy hacking!