I’m running benchmarks in concrete-ml, the file is benchmarks/deep_learning.py. I set add “device = ‘cuda’” in compile_torch_model in line 673
Then I got error:
python: /concrete/compilers/concrete-compiler/compiler/lib/Runtime/GPUDFG.cpp:1769: void* stream_emulator_init(): Assertion cudaGetDeviceProperties(&properties, 0) == cudaSuccess' failed. Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var LLVM_SYMBOLIZER_PATH` to point to it):
0 libLLVM-17git-0add3cf5.so 0x00007fcabbfd3b51 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 225
1 libLLVM-17git-0add3cf5.so 0x00007fcabbfd1564
2 libc.so.6 0x00007fccbf401520
3 libc.so.6 0x00007fccbf4559fc pthread_kill + 300
4 libc.so.6 0x00007fccbf401476 raise + 22
5 libc.so.6 0x00007fccbf3e77f3 abort + 211
6 libc.so.6 0x00007fccbf3e771b
7 libc.so.6 0x00007fccbf3f8e96
8 libConcretelangRuntime-2f46564e.so 0x00007fcab84e85fb stream_emulator_init + 2555
9 sharedlib.so 0x00007fcb6a05e1ff concrete__clear_forward_proxy + 63
@yundsi ,Thank you for your reply.
Yeah, I have a NVIDIA A100 80GB PCIe on my server, and I can run the built-in model by function compile(rather than compile_torch_model) with device=‘cuda’.
In addition, I printed check_gpu_available() and got a True.
In detail:
Driver Version: 570.124.04
CUDA Version: 12.8
You can find a guide here to build concrete from sources. However that build the CPU only version, there are no specific guide to build the GPU version, but you just need to specify the CUDA_PATH and CUDA_SUPPORT=ON.
Then last step to build the wheel is to make whl the you should have the .whl file in dist directory.
You can also take a look at the release worfklow, but with my previous instruction you should have enought.
Let me known if I can help more.
PS: On our side we only test on gpu hosts with H100, saying that it should work on A100.
@yundsi Thank you very much! Over the past two days, I have already tried building according to the documentation in concrete-python and concrete-compiler/compiler, but I still have some issues. I have listed my steps here. Could you please help me check if there are any problems?
I clone concrete repo with option --recursive, then checkout to tag v2.9.0-gpu.
Convert CUDA_SUPPORT in /concrete-compiler/compiler/Makefile to ON, run “make install-hpx-from-source”, “make” and “make python-bindings”.
In directory concrete-python, export COMPILER_BUILD_DIRECTORY=$(pwd)/build.
Change version in version.txt to 2.9.0, run “pip install --use-pep517 -e .”
But “make pytest” in directory concrete-python and “make run-python-tests” in compiler reported lots of FAILs. When I running benchmarks in concrete-ml using the library I built, got errors like
" /tmp/tmpepoubyyn/sharedlib.so: undefined symbol: _dfr_start
concrete/compilers/concrete-compiler/compiler/include/concretelang/Runtime/runtime_api.h:28:void _dfr_start (int64_t, void *);"
I suspect there might be an issue with the installation of the compiler, but I am at a loss as to how to resolve it.
_dfr_start is a symbol of our runtime, I guess you not load the RUNTIME_lIBRARY (with LD_PRELOAD). I guess is because your COMPILER_BUILD_DIRECTORY is not well set as you set it from the concrete-python directory and not the concrete-compiler/compiler directory.
You can debug using make cp_activate that will print environement variables.
@yundsi My previous statement was incorrect. My COMPILER_BUILD_DIRECTORY variable is indeed set to compiler/build, but the LD_PRELOAD and PYTHONPATH variables were previously missing. After properly configuring these two variables, a new error has occurred:
“RuntimeError: Can’t emit artifacts: Command failed:ar rcs /tmp/tmp0v13lc8v/staticlib.a /tmp/tmp0v13lc8v/program.module-0.mlir.o 2>&1
Code:35584
Segmentation fault”
There is no program.module-0.mlir.o in /tmp/tmp0v13lc8v/ and staticlib.a generated is empty.
The object file name is generated by function “setCompilationResult” in CompilerEngine.cpp by code
“auto sourceName = module->getSourceFileName();
if (sourceName == “” || sourceName == “LLVMDialectModule”) {
sourceName = this->outputDirPath + “/program.module-” +
std::to_string(objectsPath.size()) + “.mlir”;”
@yundsi I just successfully compiled the library in the nvidia/cuda image, and it seems to be an issue with the environment. However, I encountered the exact same error as the GPU version of concrete-python from pip, just like what I mentioned in my first post. Could you let me know what system environment you used during the build? Also, could you recommend a base Docker image?
Update: I successfully ran this benchmark in image manylinux_2_28_x86_64, Thanks so much for your patience and help @yundsi