Faster Compilation and Simulation Tips

lfolkerts · August 18, 2023, 3:29pm

I am working on compiling a simple function on a local server, but the compile times are very long as the vector length increases. Ideally, I would be able to compile a 32-bit and 64-bit variation, but the program takes too long.

def encrypted_compare(y_pred, ranks):
    args = y_pred
    #argmax
    argcnt = arr_size

    for argi in range(argcnt):
        for argj in range(argi+1, argcnt):
            dist = (args[argi] > args[argj])
            ranks[argi] += dist
            ranks[argj] += (1-dist)
    return ranks

with timing outputs (in seconds) for compilation/simulation of 150 samples, FHE evaluation is a single sample

Compile Time: 109.4667980670929 for vector length 16
Sim Time: 104.05286407470703 for vector length 16
Run Time: 2.2025949954986572 for vector length 16

Compile Time: 2072.8259332180023 for vector length 23
Sim Time: 2047.2249283790588 for vector length 23
Run Time: 13.927557945251465 for vector length 23

So my questions are:

Is there any tips/tricks to speed up compilation time for my function?
I noticed compilation uses a single core out of 24 available. Is there a way to use all CPU cores for compilation?
In concrete version 1.0.0, simulation takes seconds. In concrete version 2.2.0, simulation takes almost as long as full FHE. Why did the concrete simulation slow down across versions?

The full code and version 2.2.0/1.0.0 output timings are located here.

umutsahin · August 21, 2023, 9:29am

Hi @lfolkerts,

You have an O(n^2) algorithm and loops are unrolled. So lots and lots of operations are created for this circuit, which takes a lot of time to compile.

We’re aware of the issue, and it’s one of our goals to support loops efficiently in the near future!

As for your questions:

Is there any tips/tricks to speed up compilation time for my function?

You can try to tensorize the program!

def compare2(args, ranks):
    for i in range(args.size - 1):
        dist = (args[i] > args[i+1:])
        ranks[i] += np.sum(dist)
        ranks[i+1:] += (1 - dist)
    return ranks

This results in 182 nodes compared to 1321 nodes for the original code (for array of size 16) and it’s no longer exponential, so it should be much better for larger sizes!

You can also make the loop parallel by moving modification of ranks in another loop like so:

def compare3(args):
    dists = []
    for i in range(args.size - 1):
        dists.append(args[i] > args[i+1:])

    ranks = fhe.zeros(args.size)
    for i in range(args.size - 1):
        dist = dists[i]
        ranks[i] += np.sum(dist)
        ranks[i+1:] += (1 - dist)
    return ranks

I haven’t tested it, but it should result in much shorter runtime!

Lastly, you can try to remove the first dependency in the second loop (ranks[i] += np.sum(dist)) by creating ranks using dists instead of creating zeros and modifying it:

def compare4(args):
    dists = []
    for i in range(args.size - 1):
        dists.append(args[i] > args[i+1:])

    ranks = fhe.array([np.sum(dist) for dist in dists] + [0])
    for i in range(args.size - 1):
        ranks[i+1:] += (1 - dists[i])
    return ranks

This version results in 138 nodes

You can probably make the last addition in a tree style (explained just below) and enable dataflow to make it even faster, but that part will only take a fraction of the whole execution so I don’t think it’s necessary

(tree style addition)

a     b      c     d
 \   /        \   /
  a+b          c+d
    \         /
     \       /
      \     /
       \   /
      a+b+c+d

I noticed compilation uses a single core out of 24 available. Is there a way to use all CPU cores for compilation?

Not at the moment, we’re compiling a single function, it’s just that the function is big (imagine having a 3000 line function of just other function calls in C++) and LLVM tries to optimize it really aggressively, which takes a lot of time. In normal programming languages, this pattern doesn’t happen often. Usually, there are small functions which are fast to compile and you can compile a lot of them in parallel, but in this case we have a single function which is huge, so it’s not as straightforward.

In concrete version 1.0.0, simulation takes seconds. In concrete version 2.2.0, simulation takes almost as long as full FHE. Why did the concrete simulation slow down across versions?

The new simulation requires a one time compilation, and it’s not enabled by default. Once you call .simulate(...) for the first time, it does another compilation for the simulator silently. That’s why it might be taking very long. Try using fhe_simulation=True configuration option, time to call .compile(...) time will be doubled, but .simulate(...) calls will be very fast!

Hope this helps, let us know if you have other questions!