How to force single-thread execution with C API?

Hi there,
I am trying to use the C API of TFHE-rs. I observed that it can utilize multiple threads even if only one thread is calling a computation function (e.g. fhe_uint32_xxx).
However, my program itself is multithreaded, and the multithreading in TFHE-rs is oversubscribing so that the other threads of my program experience a performance loss.
Is it possible to force the library to execute within the current thread? (i.e. it uses exact X threads when X concurrent threads of my program are calling the library) I have looked deeply into tfhe.h but there seems nothing about threading.
I would appreciate any help!

Hello

Limiting number of threads used by tfhe-rs is not possible via the C API.
Potentially we could add an entry point to globally limit the number of threads tfhe-rs can use (it may not apply to every single thing happening in the lib). But this would have a drawback:
This limit would be global, not per thread/operation, for example if you limit the number of threads to 1, and your application calls tfhe-rs in 2 threads, then these operations would fight for the same thread

Also as you can imagine, operations are multithreaded because that’s one of the ways we make FHE faster, if you limit the number of threads, operations will become much slower.

FHE is not really in a state where its useable/meant to be used on ‘consumer’ grade hardware, e.g. our benchmarks are made on machines with 192 threads

What kind of ciphertext are you using, what operations do you use and how many CPU threads/cores do you have ?

Hi,

First, thank you for your reply!

To be precise, my targeting workload is similar to a FHE server. Each request it receives needs to operate on two FHE ciphertexts (e.g. FheUint32) and return the result. My server currently has 144 threads.

If there is only one request at once, the desired solution is obliviously to utilize all 144 threads to process the request as fast as possible.

However, in my use case, there is a large number of independent requests (say 100000) arriving at the same time. I don’t care about the latency of each request, but I do care about the overall throughput. I mainly have these choices:

  1. Process the requests one by one, each using all 144 threads. This is currently implementable with the C API, but the CPU resources are not fully utilized, because if one thread completes its work first, it has to wait for the other threads to finish.

  2. Process the requests in parallel (max. 144 requests in parallel), calling the C API, so that each request itself may use up to 144 threads. This is implementable, and the CPU usage is full, but the requests issued simultaneously are fighting for the 144 threads. Each request also suffers from the overhead to synchronize among the threads.

  3. (Desired) process the requests in parallel (max. 144 requests in parallel), but each request is executed sequentially. If this is possible, the CPU usage is full, and the requests do not fight for the threads. The overhead to synchronize among the threads is also avoided. However, this is currently unimplementable with the C API, and I am not sure whether it is possible directly with Rust. (Even if this is possible in Rust, as other parts of my application are currently in C++, connecting them is also a huge headache to me.)

As far as I know, the part that mainly consumes computing resources in TFHE is FFT computation. Is it possible to configure the FFT part, so that it can limit the computation within the current thread? If so, the problem may be solved.

The multi-threading you see is not happening at the FFT level, its much higher in the stack

Can you try the C API in the c-threading branch ? I added constructs that should allow to try limiting the number of threads per operations and see what it does to your example.

There is an example of how to use it in test_high_level_threading.c

Hi,

I have tried the new C API in the c-threading branch, and it totally solves my problem, no oversubscribing any more. Regarding my example, the FHE part has a 10%~20% higher throughput than the previous method (144 requests fighting for 144 threads).

Thank you very much for your prompt reply and for adding this new feature to the code. The API design is also excellent, considering the server key and allowing multiple (not necessarily one) threads from the same API caller context.

Glad it worked, Its going to make its way into the main branch soon™