Hi @lfolkerts,
You have an O(n^2)
algorithm and loops are unrolled. So lots and lots of operations are created for this circuit, which takes a lot of time to compile.
We’re aware of the issue, and it’s one of our goals to support loops efficiently in the near future!
As for your questions:
- Is there any tips/tricks to speed up compilation time for my function?
You can try to tensorize the program!
def compare2(args, ranks):
for i in range(args.size - 1):
dist = (args[i] > args[i+1:])
ranks[i] += np.sum(dist)
ranks[i+1:] += (1 - dist)
return ranks
This results in 182
nodes compared to 1321
nodes for the original code (for array of size 16
) and it’s no longer exponential, so it should be much better for larger sizes!
You can also make the loop parallel by moving modification of ranks
in another loop like so:
def compare3(args):
dists = []
for i in range(args.size - 1):
dists.append(args[i] > args[i+1:])
ranks = fhe.zeros(args.size)
for i in range(args.size - 1):
dist = dists[i]
ranks[i] += np.sum(dist)
ranks[i+1:] += (1 - dist)
return ranks
I haven’t tested it, but it should result in much shorter runtime!
Lastly, you can try to remove the first dependency in the second loop (ranks[i] += np.sum(dist)
) by creating ranks
using dists
instead of creating zeros and modifying it:
def compare4(args):
dists = []
for i in range(args.size - 1):
dists.append(args[i] > args[i+1:])
ranks = fhe.array([np.sum(dist) for dist in dists] + [0])
for i in range(args.size - 1):
ranks[i+1:] += (1 - dists[i])
return ranks
This version results in 138
nodes
You can probably make the last addition in a tree style (explained just below) and enable dataflow to make it even faster, but that part will only take a fraction of the whole execution so I don’t think it’s necessary
(tree style addition)
a b c d
\ / \ /
a+b c+d
\ /
\ /
\ /
\ /
a+b+c+d
- I noticed compilation uses a single core out of 24 available. Is there a way to use all CPU cores for compilation?
Not at the moment, we’re compiling a single function, it’s just that the function is big (imagine having a 3000 line function of just other function calls in C++) and LLVM tries to optimize it really aggressively, which takes a lot of time. In normal programming languages, this pattern doesn’t happen often. Usually, there are small functions which are fast to compile and you can compile a lot of them in parallel, but in this case we have a single function which is huge, so it’s not as straightforward.
- In concrete version 1.0.0, simulation takes seconds. In concrete version 2.2.0, simulation takes almost as long as full FHE. Why did the concrete simulation slow down across versions?
The new simulation requires a one time compilation, and it’s not enabled by default. Once you call .simulate(...)
for the first time, it does another compilation for the simulator silently. That’s why it might be taking very long. Try using fhe_simulation=True
configuration option, time to call .compile(...)
time will be doubled, but .simulate(...)
calls will be very fast!
Hope this helps, let us know if you have other questions!