Optimizing mean calculation

Hello,

I am using trivial examples prior to tackling my big project to make sure I get some understanding of best way to do things. Along those lines, I thought I’d ask - what is the best way to get fastest performance (and maybe parallelization) when averaging large list of two fhe.int16?

See the things I’ve tried below. But I basically have two files with int16 samples, and I want to integer average the pairs (a+b)//2. There are about 80k samples, and it is taking hours and hours.

I have a threadripper cpu with 64 cores and an nvidia RTX3070. I am looking for any way to optimize the execution. One thought was to try to enable data parallelization, but I can’t see how to do that with direct circuit. Or even know if that’s a good idea.

Does anyone have any thoughts on optimum way to do this?

See below for some of the approaches I’ve tried.

First I was doing this:

with open("new_line1.raw", mode="rb") as file_line1:
        line1=file_line1.read()
with open("new_line2.raw", mode="rb") as file_line2:
        line2=file_line2.read()
outfile = open("out_linear_fhe_onestep.raw", mode="wb")
stream1 = struct.unpack("h" * ((len(line1)) // 2), line1)
stream2 = struct.unpack("h" * ((len(line2)) // 2), line2)

@fhe.circuit({"s1": "encrypted", "s2": "encrypted"})
def circuit(s1:fhe.int16, s2: fhe.int16):
        return ((s1+s2)//2)

print (circuit)

for e1,e2 in itertools.zip_longest(stream1,stream2):
        if e1 is None:
                e1 = 0;
        if e2 is None:
                e2 = 0;
        outfile.write(struct.pack("h",circuit.decrypt(circuit.run(circuit.encrypt(e1,e2)))))

But that doesn’t lend itself to be parallelized.

Now I am doing:

BATCH_SIZE = 1024
with open("data/line1_8k.raw", mode="rb") as file_line1:
        line1=file_line1.read()
with open("data/line2_8k.raw", mode="rb") as file_line2:
        line2=file_line2.read()
outfile = open("out_linear_fhe_np.raw", mode="wb")
stream1 = struct.unpack("h" * ((len(line1)) // 2), line1)
stream2 = struct.unpack("h" * ((len(line2)) // 2), line2)
np_stream1 = np.array(stream1,dtype='i2')
np_stream2 = np.array(stream2,dtype='i2')
if (np_stream1.size>np_stream2.size):
    mod = np_stream1.size%BATCH_SIZE
    new=np_stream1.size+(BATCH_SIZE-mod)
elif (np_stream1.size<np_stream2.size):
    mod = np_stream2.size%BATCH_SIZE
    new=np_stream2.size+(BATCH_SIZE-mod)
np_stream1.resize(new)
np_stream2.resize(new)
stream = np.stack((np_stream1,np_stream2), axis=1)
loops = np_stream1.size//BATCH_SIZE
split_stream=np.split(stream,loops)
@fhe.circuit({"a": "encrypted"})
def circuit(a:fhe.tensor[fhe.int16,BATCH_SIZE,2]):
        foo = np.floor_divide(np.sum(a,axis=1),2).astype(fhe.int16)
        return foo

print (circuit)

enc_stream=[]
for arr in split_stream:
        enc_stream.append(circuit.encrypt(arr))

mix_stream=[]
for param in enc_stream:
        mix_stream.append(circuit.run(param))

for param in mix_stream:
        out=circuit.decrypt(param)
        out.astype('int16').tofile(outfile)

Any thoughts?
All help extremely appreciated.

Thank you.
Ron.

Hi Ron,

When you’re using tensors, it’ll be highly parallelized using OpenMP under the hood :slight_smile:

I’d suggest gradually increasing the batch size until you run out of memory. You don’t need to use sum, floor_divide and astype though, it can be:

@fhe.circuit({"a": "encrypted", "b": "encrypted"})
def circuit(a: fhe.tensor[fhe.int16, BATCH_SIZE], b: fhe.tensor[fhe.int16, BATCH_SIZE]):
        return (a + b) // 2

You can also play with p_error and global_p_error configurations to get better performance. See Exactness - Concrete for more details.

Lastly, you may try multiprocessing of Python and let multiple processes work on multiple batches at the same time. However, the first option of increasing the batch size, should work much better.

Let us know if you have more questions!

Thank you! This helped me a lot.

Ron.

1 Like