How Does Concrete Utilize All CPU Cores for Parallel Execution?

I noticed that Concrete-ML utilizes all the cores of my computer. I would like to ask how it achieves parallel execution. I don’t think it’s using OpenMP because setting OMP_NUM_THREADS doesn’t affect the execution time. I suspect tfhe-rs implements parallelism using rayon. I looked into the implementation of programmable_bootstrap_lwe_ciphertext_mem_optimized, and it ultimately calls functions like polynomial_wrapping_monic_monomial_mul_and_subtract. The implementation of these functions is a for loop:

for ((dst, src), src_orig) in dst.iter_mut().zip(src).zip(src_orig) {
    *dst = src.wrapping_neg().wrapping_sub(*src_orig);

But it doesn’t seem to be executing in parallel as it doesn’t use rayon. So, I would like to ask, how exactly does Concrete achieve parallel execution? Is it tfhe-rs using rayon? If so, then functions like polynomial_wrapping_monic_monomial_mul_and_subtract wouldn’t be the main computational functions. Where can I find the files that contain the main computational functions?

Hello @IsaacQuebec

The polynomial multiplication is already optimized and does not benefit from threading for reasonable parameters sets as tasks are too small to benefit from threading given the cost of synchronization

Main function is the add_external_product_assign