Efficiency of Single-Function (Tensorized) vs. Composed Circuit Designs in Concrete

Hi all,

I’m comparing two FHE circuit designs using Concrete:

  • In one design, I compose several smaller functions (each handling one step of the computation) and use operations like np.dot for inner products.
  • In the other design, I implement the entire computation in a single function that uses a tensorized summation (i.e. np.sum(a * b)).

I’ve observed that the single‐function (tensorized) approach is noticeably faster and more efficient. Could you explain why using np.sum might lead to better performance? For example, is it due to reduced table lookups or improved parallelism (e.g., better utilization of loop parallelism and tensorization)?

Also, assuming both implementations provide equivalent mathematical outputs, is the single‐function approach just as secure (from an FHE perspective) as the composed design? In other words, can I confidently use one integrated function instead of composing several smaller ones without compromising security?

Any insights or pointers to relevant parts of the optimization guide would be greatly appreciated.

Thanks in advance!

Hello @mahzad_mahdavisharif

I’ll let the tech teams answer, but certainly, it would help them if you could give the two functions that you compare, in terms of speed:

  • the first one with np.dot
  • the second one with np.sum

Regarding security: I can already help you. Any function which is compiled by Concrete has 128b of security, so there is none of them which is weaker.
Cheers

Hi @benoit,

Thanks for your reply.

I ran an extra experiment where I changed the inner product implementation from using np.dot to np.sum(a * b) in the composed design. Based on my experiments, the difference between using np.dot and np.sum(a * b) for the inner product wasn’t significant. Instead, the main performance difference appears when comparing a composed design (where the circuit is broken into several smaller functions that are chained together) versus a single, integrated (monolithic) function that performs the complete computation in one go. The single‑function design is noticeably faster and more efficient.

My interpretation is that composing several FHE functions introduces additional overhead—for example, extra refresh operations, inter‑function data transfers, or the fact that the composed functions must run sequentially because each function’s output is needed for the next step. The compiler may not optimize these inter‑function dependencies as aggressively as it can when the entire computation is expressed in a single circuit.

I’d like to know if others have observed similar behavior and whether there are any recommendations on mitigating this composition overhead without compromising security.

Thanks in advance!

Hello @mahzad_mahdavisharif

To help you, really, we would need your code: else, it’s very vague and we can’t really help you.

When you speak about smaller functions, are you speaking about:

In particular, you might want to read With modules | Concrete if you use modules.

Cheers

Hello @mahzad_mahdavisharif,

There might have several reasons explaining why the composed function run slower than a single monolithic function, as @benoit says to clearly understand we need the actual use case and analysis compiler outputs (and/or compiler verbosity), but here some reasons I can imagine that explain differents performances:

  • A monolithic function give more information on the computation graph dependencies, i.e. the compiler can better infer the noise propagation in the computation graph and choose better cryptographic parameter in term of performance. The compiler should not insert automatic refresh as you mention. Specifing the right wiring (With modules | Concrete) in composed function may give enought information to the compiler to optimize with the same crypto parameters.
  • A monolithic function give more information aboout the concurrency of the program and let the compiler better optimize the concurrency. As example if there are two concurrent task in the computation graph the compiler runtime will handle run that concurrently, if its on composed function it will be handle at another level (and only if you enable auto_schedule Configure | Concrete) but it may be not too efficient at the compiler runtime as it will involve another scheduler and can create conflicts between the both schedulers.
  • Then giving a monolithic function to the compiler can give more oportunity to find patterns to optimize, merge operations, reuse itermediate buffers etc…
  • Finally there are a sligh runtime overhead to use composition more or less what you interpret as inter-function data-transfert, or additional overhead because of the function call, maybe caches invalidation etc…

But as @benoit mention nothing here will impact the security.

Thanks @benoit and @yundsi for the discussion and your valuable insights. In my tests, the single-function implementation is noticeably faster. Surprisingly, the encryption size is also significantly larger—about twice as large—when using multiple modular functions. It seems that handling the logic as one integrated function helps reduce the overhead associated with multiple function calls in the modular implementation, allowing the underlying FHE process to run more efficiently.

Here’s the simplified modular approach:

@fhe.module()
class GeneralizedModule:
    @fhe.function({"x": "encrypted", "y": "clear"})
    def step1_func(x, y):
        result = (x * y).astype(np.int64)
        return fhe.refresh(result)

    @fhe.function({"x": "encrypted", "y": "clear"})
    def step2_func(x, y):
        result = ((1 - x) * y).astype(np.int64)
        return fhe.refresh(result)

And here’s the single-function implementation:

@fhe.module()
class GeneralizedSingleFunction:
    @fhe.function({"x": "encrypted", "y": "clear"})
    def combined_func(x, y):
        product = x * y
        output1 = np.sum(product).astype(np.int64)
        output2 = np.sum((1 - product) * y).astype(np.int64)
        return output1, output2

I appreciate your insights—it’s been helpful to refine this understanding of the performance differences.

Many thanks

Hey @mahzad_mahdavisharif,

First I don’t understand the motivation to use the astype, this indeed often used to convert float values into integer values, here I don’t see the motivation. Then the both code are not equivalent from what I understand, indeed I the second module you have a np.sum that is not present in the first one.

Putting that aside here my insights on the your code example and removing the astype in both example as to be honest it should be a noop but from what I see it involved a PBS (we will take a look and fix the bug), which explain the the GeneralizedSingleFunction compiles without complaining about the noise propagation. So I will comment and the code below

Here my insights:

  • Modules are intended to be composable, and by default fully composable, that means the compiler by default expect that all output can be reused as inputs. You can specify the wiring rules by hand or with the automatic wiring. In the code you share that means intermediate results must be used as inputs of all function that make things not equivalent. Moreover as the compiler expect that spec1_func result may be the input of spec1_func you need to add an fhe.refresh that basically imply a PBS that is a costly operation that is not present in the combined_func. And as well it lead to a different noise analysis and it explain that the crypto parameters are not equals (what you see on the size of encrypted data).
  • As I said before with the combined_func the compiler can analyze that the output1 and output2 can be evaluted concurently, that not the case with the GeneralizedModule (assuming there are a function to compute the equivalence of output1). That can be mitigated with the auto_schedule feature that I mention in my previous reply.

I hope this help, next for more analysis could you provide a full example with inputsets, compilation options and evaluation code.

Thanks, @yundsi , for your detailed reply. To be honest, my main codebase is quite lengthy and complex, so I created this simplified example to highlight the issue. In my implementation, I also have some other computations and functions that rely on these function results, but to avoid confusion, I excluded those details from this discussion. My earlier response mistakenly omitted the sum for the example, so here is the updated version of my code, which more closely resembles the real implementation:

Modular Approach:

@fhe.module()
class GeneralizedModule:
    @fhe.function({"x": "encrypted", "y": "clear"})
    def step1_func(x, y):
        result = np.sum((x * y)).astype(np.int64)
        return fhe.refresh(result)

    @fhe.function({"x": "encrypted", "y": "clear"})
    def step2_func(x, y):
        result = np.sum(((1 - x) * y)).astype(np.int64)
        return fhe.refresh(result)

Single-Function Implementation:

@fhe.module()
class GeneralizedSingleFunction:
    @fhe.function({"x": "encrypted", "y": "clear"})
    def combined_func(x, y):
        product = x * y
        output1 = np.sum(product).astype(np.int64)
        output2 = np.sum((1 - product) * y).astype(np.int64)
        return output1, output2

1. Using astype to Prevent Errors

Regarding the point you raised in your previous answer, I consistently encountered errors when I removed the astype function. Without astype, the implementations failed to compile, producing the following error:

“Compilation Failed: Program cannot be composed (see Common errors | Concrete): At location HP_chunked_concrete.py:109:0: The noise of the node 0 is contaminated by noise coming straight from the input (partition: 0, coeff: 1000.00).”

The only change I made was the removal of astype. It seems weird, but without astype, the noise level seems to rise to a point that prevents successful composition. Although it doesn’t seem entirely logical at first glance, I haven’t found a workaround other than retaining astype. It would be helpful if you have any insight or comment about the reason behind that.

2. Overhead in Modular Composition

I’ve tested fhe.Wired and automatic module tracing based on the examples in the library documentation, but neither had a significant impact on computation time or size. The observations we discussed seem accurate: modular composition introduces overhead, and that likely explains why the single-function implementation is faster. I believe this reasoning is correct, but if you have any additional insights, I’d appreciate them.

3. The Impact of Chunking

I also tried to implement my model in different schemes to compare unified logic implemented across different FHE schemes (e.g., BGV, BFV). As you know, we have some limitations in setting the parameters in FHE before hitting the noise level, and in other FHE libraries, we need to set them manually. Based on this fact, in my implementations, I considered chunking over my data. I tried to maintain consistency in the concrete implementation by using a similar chunking method. However, I found that chunking has surprising effects on both computation time and ciphertext size. When I increase the size of the chunk (decrease the number of chunks), the time decreases, and the encryption size increases to a certain level, after which it stabilizes and does not change.

After observing this behavior, I tried to understand more about the parameters selected automatically behind the scenes in concrete to optimize the parameters and compile the code. So I tried to understand the limitations we have for input and, based on the documentation you had about common errors, I think it would be useful if I use show_bit_width_constraints=True and understand the size of the input, as well as the limitations I have in defining the input chunk size and its optimal value. However, the result I got was confusing and not understandable to me. It was:

Bit-Width Constraints for all
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
%0:
    all.%0 >= 1
%1:
    all.%1 >= 1
%2:
    all.%2 >= 1
    all.%0 == all.%1
    all.%1 == all.%2
%3:
    all.%3 >= 1
    all.%2 == all.%3
%4:
    all.%4 >= 1
%5:
    all.%5 >= 1
%6:
    all.%6 >= 1
    all.%5 == all.%2
    all.%2 == all.%6
%7:
    all.%7 >= 1
    all.%6 == all.%1
    all.%1 == all.%7
%8:
    all.%8 >= 1
    all.%7 == all.%8
%9:
    all.%9 >= 1
%10:
    all.%10 >= 1
    all.%4 == all.%9
    all.%4 >= 2
    all.%9 >= 2
%11:
    all.%11 >= 1

I would appreciate it if you could tell me how I can exactly see the parameters we usually use to set an FHE model, like polynomial moduli, plaintext moduli, and others, and if there are any limitations regarding the size of the input. How can I understand when I will hit the 16-bit size you mentioned as the input size limitation?

4. Matrix Operations and Supported Functions

As my input data resembles a database table, I’ve considered using matrices for some computations. I noticed np.matmul is listed among the supported operations, and I also found this discussion(General NxN matrix multiplication - #6 by dalvi) on NxN matrix multiplication. In my code, I think using matrix formats might be advantageous in my computations. However, I’m unsure if concrete-python fully supports matrix operations. If you have any guidance or examples, I’d greatly appreciate it.

5. Process Killed Message

After running my code, I can compile it completely without any error, and the results are correct and accurate. However, every time I get this message as the last line of my output:

zsh: killed     python name_of_my_code.py

I cannot understand what the reason for this is or if it’s something I should be concerned about.

Sorry, it became long and complex. Thank you in advance for your help.

1. Using as type to prevent erros
So for the first point, the astype, as I said in my previous answer, introduce a pbs (which should not be the case btw) so basically that it refresh the noise.

Removing astype and replacing it by fhe.refresh should work. The error you got by removing astype means that there are a possible chain without noise refreshing. Here without any astype or refresh (or any ops that imply a pbs) the noise strictly growth at each call of a function and the compiler cannot find a “maximum noise” as it is not bounded.

The following code without astype should compile

@fhe.module()
class GeneralizedModule:
    @fhe.function({"x": "encrypted", "y": "clear"})
    def step1_func(x, y):
        result = np.sum((x * y))
        return fhe.refresh(result)

    @fhe.function({"x": "encrypted", "y": "clear"})
    def step2_func(x, y):
        result = np.sum(((1 - x) * y))
        return fhe.refresh(result)
@fhe.module()
class GeneralizedSingleFunction:
    @fhe.function({"x": "encrypted", "y": "clear"})
    def combined_func(x, y):
        product = x * y
        output1 = np.sum(product)
        output2 = np.sum((1 - product) * y)
        return fhe.refresh(output1), fhe.reftesh(output2)

In your example with the astype in the step1_func you have astype and fhe.refresh that make two layer of pbs instead of one in the combined_func (and I guess the compiler doesn’t simplify).

3. The Impact of Chunking

You can run the compiler with compiler_verbose_mode=True and/or compiler_debug_mode=True to see exactly what happens both in the transpilation pipeline (from python to mlir) and in the compilation pipeline. And see all bitwidth assignement then FHE parameter assignement.
That hard to say why you see performances increasing by increasing the size of chunk, that depends on your implementation. I guess increasing the size of chunk decrease the number of operations, but make crypto parameter bigger. There are a tradeoff here. As example in tfhe-rs integers are encoded in a radix way with different chunk size, the optimal for the current crypto optimization was chunks of 4bits.

4. Matrix Operations and Supported Functions
Matrix (np.array) are supported by concrete-python and here (Supported operations | Concrete) the list of supported operations. As well if something is missing you can create a feature request.

5. Process Killed Message
Hard to say why, maybe your program take too much memory and the OOM killer kill your process. There are statistics on your circuit that evaluate the memory footprint.

Thanks!