Hi, I’m trying to find the most performant way to handle 256-bit data in TFHE-rs.
My computation is:
- A 256-bit ciphertext XOR with a 256-bit plaintext
- count_ones on the 256-bit result
- Maybe, choose the minimum among several 256-bit ciphertexts
I can come up with several approaches to represent the 256-bit ciphertext, and I wonder about the tradeoff between them.
Approach 1: use a single FheUint256 for all operations
Approach 2: use [FheUint8; 32] or [FheUint32; 8] for XOR and count_ones
Approach 3: use [FheBool; 256] for XOR
I’m trying to understand if I should favor the simple high-level API or get my hands dirty with manual decomposition for better performance.
Thanks for any help!
4 Likes
Given that the operations you want to perform are all implemented in tfhe-rs, your best bet is to use a single FheUint256
3 Likes
Thanks for your answer! It feels much easier to use a single FheUint256 than [FheUint8; 32], and it works well!
I noticed that tfhe-rs also supports GPU acceleration, so I tried the xor and count_ones operations for FheUint256 with a CudaServerKey. However, it says that count_ones is not supported on cuda devices.
pub fn count_ones(&self) -> super::FheUint32 {
global_state::with_internal_keys(|key| match key {
InternalServerKey::Cpu(cpu_key) => {
let result = cpu_key
.pbs_key()
.count_ones_parallelized(&*self.ciphertext.on_cpu());
let result = cpu_key.pbs_key().cast_to_unsigned(
result,
super::FheUint32Id::num_blocks(cpu_key.pbs_key().message_modulus()),
);
super::FheUint32::new(result, cpu_key.tag.clone())
}
#[cfg(feature = "gpu")]
InternalServerKey::Cuda(_) => {
panic!("Cuda devices do not support count_ones yet");
}
#[cfg(feature = "hpu")]
InternalServerKey::Hpu(_device) => {
panic!("Hpu does not support this operation yet.")
}
})
}
This seems slightly off the topic, but I wonder if there is any workaround for this? For example, maybe moving the FheUint256 results to cpu like in torch?
I found that most operations are already supported on GPU (this is great, thank the team!), but count_ones/count_zeros is still missing. Is there some technical reason that makes the implementation difficult? I don’t really understand what’s going on under the hood.
Update: I found cuda_memcpy_async_to_cpu
in tfhe-cuda-backend, so it seems that indeed it is possible to move things from gpu to cpu. But this also seems like a very low level api, don’t know how to use it.
1 Like