FheUint256 vs [FheUint8; 32] for 256-bit XOR and pop_count

luanxiaokun · July 7, 2025, 12:33pm

Hi, I’m trying to find the most performant way to handle 256-bit data in TFHE-rs.

My computation is:

A 256-bit ciphertext XOR with a 256-bit plaintext
count_ones on the 256-bit result
Maybe, choose the minimum among several 256-bit ciphertexts

I can come up with several approaches to represent the 256-bit ciphertext, and I wonder about the tradeoff between them.

Approach 1: use a single FheUint256 for all operations
Approach 2: use [FheUint8; 32] or [FheUint32; 8] for XOR and count_ones
Approach 3: use [FheBool; 256] for XOR

I’m trying to understand if I should favor the simple high-level API or get my hands dirty with manual decomposition for better performance.

Thanks for any help!

tmontaigu · July 7, 2025, 4:00pm

Given that the operations you want to perform are all implemented in tfhe-rs, your best bet is to use a single FheUint256

luanxiaokun · July 8, 2025, 4:43am

Thanks for your answer! It feels much easier to use a single FheUint256 than [FheUint8; 32], and it works well!

I noticed that tfhe-rs also supports GPU acceleration, so I tried the xor and count_ones operations for FheUint256 with a CudaServerKey. However, it says that count_ones is not supported on cuda devices.

 pub fn count_ones(&self) -> super::FheUint32 {
        global_state::with_internal_keys(|key| match key {
            InternalServerKey::Cpu(cpu_key) => {
                let result = cpu_key
                    .pbs_key()
                    .count_ones_parallelized(&*self.ciphertext.on_cpu());
                let result = cpu_key.pbs_key().cast_to_unsigned(
                    result,
                    super::FheUint32Id::num_blocks(cpu_key.pbs_key().message_modulus()),
                );
                super::FheUint32::new(result, cpu_key.tag.clone())
            }
            #[cfg(feature = "gpu")]
            InternalServerKey::Cuda(_) => {
                panic!("Cuda devices do not support count_ones yet");
            }
            #[cfg(feature = "hpu")]
            InternalServerKey::Hpu(_device) => {
                panic!("Hpu does not support this operation yet.")
            }
        })
    }

This seems slightly off the topic, but I wonder if there is any workaround for this? For example, maybe moving the FheUint256 results to cpu like in torch?

I found that most operations are already supported on GPU (this is great, thank the team!), but count_ones/count_zeros is still missing. Is there some technical reason that makes the implementation difficult? I don’t really understand what’s going on under the hood.

Update: I found cuda_memcpy_async_to_cpu in tfhe-cuda-backend, so it seems that indeed it is possible to move things from gpu to cpu. But this also seems like a very low level api, don’t know how to use it.