I have two ciphertext vectors, each containing 100 ciphertexts. I need to perform element-wise multiplication of these two vectors. Currently, I am using Rayon for parallel processing, but the speed is still not satisfactory. I am looking for a more efficient method to perform this operation, ideally in a single step, and I want to fully utilize parallel processing capabilities.

Specific Requirements:

Input: Two vectors of length 100 containing ciphertexts.

Output: A vector of length 100, where each element is the product of the corresponding elements from the input vectors.

Optimization Goal: Improve the execution speed of the element-wise multiplication operation.

Methods Tried:

Using the Rayon library for parallel processing, as shown in the code like below（not real code）:

use rayon::prelude::*;
fn main() {
let vec1: Vec<Ciphertext> = vec![...]; // Assuming Ciphertext is your type for ciphertexts
let vec2: Vec<Ciphertext> = vec![...];
let result: Vec<Ciphertext> = vec1.par_iter()
.zip(vec2.par_iter())
.map(|(a, b)| a * b)
.collect();
}

Issue:

Even with the use of the Rayon library, the processing speed is not ideal. I need a more efficient solution.

Assistance Needed:

How to utilize more efficient computation methods (such as SIMD) for element-wise multiplication?

How to further optimize the use of Rayon to ensure parallel processing fully utilizes hardware resources?

Is it possible to use GPU acceleration to enhance the computation speed?

the multiplication is already a heavy operation and is itself multithreaded on CPU.

SIMD is not usable on a big operation such as the radix multiplication, a single PBS makes heavy use of SIMD registers already, and there are potentially hundreds of those in a radix multiplication.

you can try using the par_chunks from rayon, this can distribute the load among threads and may yield better performance, instead of processing all multiplications at once

So if you have reached the limit of the current hardware you may want to go bigger or indeed potentially see if GPU acceleration will improve things, normally for the high level API you can use the GPU using this guide: GPU acceleration | TFHE-rs

Is GPU acceleration only applicable to operators, or can GPUs also accelerate operations such as the element-wise multiplication of two arrays as described above? Theoretically, GPUs have enough threads to perform these operations directly in parallel. (If my understanding is incorrect, please point it out). Does TFHE-rs provide such GPU-accelerated optimizations for these operations? If not, is there a simple implementation for this kind of operation?

currently we don’t have optimized array operations, and GPUs already use quite a bit of threads to compute a PBS required for a multiplication i.e. 1 mul requires a lot of threads on CPU or GPU

I think you can schedule more operations on the GPU and the driver should be able to schedule them @Agnes may be able to tell you more on the usage of the GPU

Indeed you should be able to launch multiplications in parallel from Rust with rayon, and have them be executed on the GPU. Since a multiplication does not always use up all the GPU capacity you should get some improvement compared to a sequential loop of multiplications on the GPU. We haven’t tried to benchmark it yet though.

thankyou very much。Yes, I did exactly as you said before, but the result will come out quickly, yet it still doesn’t meet my requirements, which is why I’m asking if you have a better method to achieve the multiplication of two arrays.

Could you maybe share the code you’re using to execute the multiplication of the two vectors? What would you need to achieve in terms of latency for the multiplication of 2 vectors of 100 ciphertexts?

At the moment I think the best you can do is something like this:

use std::time::Instant;
use tfhe::{ConfigBuilder, set_server_key, FheUint8, ClientKey, CompressedServerKey};
use tfhe::prelude::*;
use tfhe::shortint::parameters::PARAM_GPU_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_3_KS_PBS;
use rayon::prelude::*;
fn main() {
let config = ConfigBuilder::with_custom_parameters(PARAM_GPU_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_3_KS_PBS, None).build();
let client_key= ClientKey::generate(config);
let compressed_server_key = CompressedServerKey::new(&client_key);
let gpu_key = compressed_server_key.decompress_to_gpu();
let clear_a = 7u8;
let clear_b = 2u8;
let mut vec_a: Vec<FheUint8> = Vec::with_capacity(100);
let mut vec_b: Vec<FheUint8> = Vec::with_capacity(100);
for _ in 0..100 {
let a = FheUint8::encrypt(clear_a, &client_key);
let b = FheUint8::encrypt(clear_b, &client_key);
vec_a.push(a);
vec_b.push(b);
}
let start = Instant::now();
let _: Vec<FheUint8> = vec_a.par_iter()
.zip(vec_b.par_iter())
.map(|(l, r)| {
set_server_key(gpu_key.clone());
l * r})
.collect();
let duration = start.elapsed();
println!("Time elapsed in vec mul is: {:?}", duration);
}

With this, each multiplication between two encrypted integers will happen asynchronously on the GPU. Note that if you have a machine with several GPUs that have peer access to each other via NVLink, performance will be further improved as the multiplications will happen on several GPUs.

The multiplication between two encrypted integers is an expensive operation requiring the execution of many programmable bootstraps. The GPU execution leverages as much parallelism as possible for each of these bootstraps, but still one PBS on GPU takes several milliseconds (performance varies a lot depending on the GPU model). Using bigger GPUs like A100 or H100 will give you the best performance.