About TFHE-rs Parallel operation

Rocky_Lopez · July 10, 2024, 12:29pm

Problem Description:

I have two ciphertext vectors, each containing 100 ciphertexts. I need to perform element-wise multiplication of these two vectors. Currently, I am using Rayon for parallel processing, but the speed is still not satisfactory. I am looking for a more efficient method to perform this operation, ideally in a single step, and I want to fully utilize parallel processing capabilities.

Specific Requirements:

Input: Two vectors of length 100 containing ciphertexts.
Output: A vector of length 100, where each element is the product of the corresponding elements from the input vectors.
Optimization Goal: Improve the execution speed of the element-wise multiplication operation.

Methods Tried:

Using the Rayon library for parallel processing, as shown in the code like below（not real code）:

use rayon::prelude::*;

fn main() {
    let vec1: Vec<Ciphertext> = vec![...];  // Assuming Ciphertext is your type for ciphertexts
    let vec2: Vec<Ciphertext> = vec![...];

    let result: Vec<Ciphertext> = vec1.par_iter()
                                      .zip(vec2.par_iter())
                                      .map(|(a, b)| a * b)
                                      .collect();
}

Issue:

Even with the use of the Rayon library, the processing speed is not ideal. I need a more efficient solution.

Assistance Needed:

How to utilize more efficient computation methods (such as SIMD) for element-wise multiplication?
How to further optimize the use of Rayon to ensure parallel processing fully utilizes hardware resources?
Is it possible to use GPU acceleration to enhance the computation speed?

IceTDrinker · July 10, 2024, 12:52pm

hello @Rocky_Lopez

the multiplication is already a heavy operation and is itself multithreaded on CPU.

SIMD is not usable on a big operation such as the radix multiplication, a single PBS makes heavy use of SIMD registers already, and there are potentially hundreds of those in a radix multiplication.

you can try using the par_chunks from rayon, this can distribute the load among threads and may yield better performance, instead of processing all multiplications at once

So if you have reached the limit of the current hardware you may want to go bigger or indeed potentially see if GPU acceleration will improve things, normally for the high level API you can use the GPU using this guide: GPU acceleration | TFHE-rs

Cheers

Rocky_Lopez · July 10, 2024, 1:11pm

Is GPU acceleration only applicable to operators, or can GPUs also accelerate operations such as the element-wise multiplication of two arrays as described above? Theoretically, GPUs have enough threads to perform these operations directly in parallel. (If my understanding is incorrect, please point it out). Does TFHE-rs provide such GPU-accelerated optimizations for these operations? If not, is there a simple implementation for this kind of operation?

IceTDrinker · July 10, 2024, 1:26pm

currently we don’t have optimized array operations, and GPUs already use quite a bit of threads to compute a PBS required for a multiplication i.e. 1 mul requires a lot of threads on CPU or GPU

I think you can schedule more operations on the GPU and the driver should be able to schedule them @Agnes may be able to tell you more on the usage of the GPU

IceTDrinker · July 10, 2024, 1:26pm

Did you make sure you are running with the --release flag btw @Rocky_Lopez ?

Rocky_Lopez · July 10, 2024, 1:35pm

thankyou， yes I used this command .
cargo run --release

Agnes · July 10, 2024, 2:14pm

Indeed you should be able to launch multiplications in parallel from Rust with rayon, and have them be executed on the GPU. Since a multiplication does not always use up all the GPU capacity you should get some improvement compared to a sequential loop of multiplications on the GPU. We haven’t tried to benchmark it yet though.

Rocky_Lopez · July 10, 2024, 2:40pm

thankyou very much。Yes, I did exactly as you said before, but the result will come out quickly, yet it still doesn’t meet my requirements, which is why I’m asking if you have a better method to achieve the multiplication of two arrays.

Agnes · July 10, 2024, 3:07pm

Could you maybe share the code you’re using to execute the multiplication of the two vectors? What would you need to achieve in terms of latency for the multiplication of 2 vectors of 100 ciphertexts?

Rocky_Lopez · July 10, 2024, 4:24pm

my function is very easy(because I learnt rust to use TFHE-rs. only two days.)


pub fn homo_vec_mul(left: &Vec<FheUint8>, right: &Vec<FheUint8>) -> Vec<FheUint8> {
    assert_eq!(left.len(), right.len());

    left.par_iter() 
        .zip(right.par_iter()) 
        .map(|(l, r)| l * r)
        .collect() 
}

my main function use the default set.
So if you have any idea to improve the execution efficiency.please tell me！thank you very much！Agnes

Agnes · July 11, 2024, 7:53am

Hey @Rocky_Lopez!

At the moment I think the best you can do is something like this:

use std::time::Instant;
use tfhe::{ConfigBuilder, set_server_key, FheUint8, ClientKey, CompressedServerKey};
use tfhe::prelude::*;
use tfhe::shortint::parameters::PARAM_GPU_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_3_KS_PBS;
use rayon::prelude::*;

fn main() {

    let config = ConfigBuilder::with_custom_parameters(PARAM_GPU_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_3_KS_PBS, None).build();

    let client_key= ClientKey::generate(config);
    let compressed_server_key = CompressedServerKey::new(&client_key);
    let gpu_key = compressed_server_key.decompress_to_gpu();

    let clear_a = 7u8;
    let clear_b = 2u8;

    let mut vec_a: Vec<FheUint8> = Vec::with_capacity(100);
    let mut vec_b: Vec<FheUint8> = Vec::with_capacity(100);
    for _ in 0..100 {
        let a = FheUint8::encrypt(clear_a, &client_key);
        let b = FheUint8::encrypt(clear_b, &client_key);
        vec_a.push(a);
        vec_b.push(b);
    }

    let start = Instant::now();
    let _: Vec<FheUint8> = vec_a.par_iter()
        .zip(vec_b.par_iter())
        .map(|(l, r)| {
            set_server_key(gpu_key.clone());
            l * r})
        .collect();
    let duration = start.elapsed();

    println!("Time elapsed in vec mul is: {:?}", duration);

}

With this, each multiplication between two encrypted integers will happen asynchronously on the GPU. Note that if you have a machine with several GPUs that have peer access to each other via NVLink, performance will be further improved as the multiplications will happen on several GPUs.

The multiplication between two encrypted integers is an expensive operation requiring the execution of many programmable bootstraps. The GPU execution leverages as much parallelism as possible for each of these bootstraps, but still one PBS on GPU takes several milliseconds (performance varies a lot depending on the GPU model). Using bigger GPUs like A100 or H100 will give you the best performance.

I hope this helps