Load Model - Complex Circuit

Hello,

I was wondering if we could load the weights of our own linear regression into the LinearRegression class instead of retraining every time ? Like using a state dict of a torch.nn.Linear layer , if it is already quantized.

Also, how to compile a circuit that is a bit more complex than just the LinearRegression ? As an example, I want to do some preprocessing like using some LookupTable from concrete numpy on my encrypted inputs and then call the LinearRegression on them, i.e having a function like this:

concrete_LR = LinearRegression(args)
concrete_LR = concrete_LR.load(weights)
table = cnp.LookupTable([ ... ])

@cnp.compiler({"x": "encrypted"})
def f(x):
    y = table[x]
    return concrete_LR(y)

Instead of having just a circuit from the LinearRegression or the LookupTable

Thank you !

HI @tricycl3,

Currently there is no straightforward way to load weights in the built-in LinearRegression concrete model. If you think this is a feature that could come in handy then you could open an issue on github. May I ask what would be the reason of not retraining?

That being said, there are solutions that you can use already. The most straightforward one is to use concrete-numpy directly since the LinearRegression is quite simple to write in numpy. This however assumes that your weights are quantized to integers. Is this the case? How did you get your quantized weights?

If you have the quantization parameters as well then you can easily go from floating points quantized weights to integer weights.

with quantized integers weights and if you don’t need retraining you can just use concrete-numpy and write down the linear regression like this:

def linear_regression(x):
    return x @ weights + bias

@cnp.compiler({"x": "encrypted"})
def f(x):
    y = table[x]
    return linear_regression(y)

where weights and bias (check the shape of your input and weight such that the matrix multiplication operates properly) are integers. Here, you could combine the TLU and the linear_regression easily.

Another solution would be to have a torch module implementing the LinearRegression so you could just save/load weight using a state_dict and use the compile_torch_model method available in concrete.ml.torch.compile to convert your torch model with pretrained weight to Concrete-ML and so to FHE. Let me know if you want more detail on this.

I assume you use the built-in LinearRegression from concrete.ml.sklearn models. You can rewrite the function as follows to combine both the table and the LinearRegression:

@cnp.compiler({"x": "encrypted"})
def f(x):
    y = table[x]
    return lr.quantized_module_._forward(y)

Let us know if you have any question!

2 Likes

Hello,

Thank you for your answer it helped a lot. I have tried the two differents methods and I have a few questions.

Regarding my quantized model, torch already gives some tips and methods there Introduction to Quantization on PyTorch | PyTorch , it may differ from brevitas or onny I haven’t checked it yet as I am still playing with concrete numpy.

I am trying to do a linear regression on a (576) vector with 10 labels after a call of a lookup table, i.e. a 1d vector of dimension 576 , but I fail to understand where the time increase comes from.

My code :

import time
import numpy as np
import concrete.numpy as cnp

def table_block():
    return np.array([
[0., 0., 1., 0., 1., 1., 0., 1., 1., 1., 0., 0., 1., 0., 0., 0.],
 [1., 1., 0., 0., 1., 1., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0.],
 [1., 0., 1., 0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
 [1., 1., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 1., 0.],
 [0., 0., 1., 0., 1., 1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 0.],
 [1., 1., 0., 0., 0., 1., 0., 0., 1., 0., 1., 1., 1., 1., 1., 1.],
 [1., 1., 0., 1., 1., 1., 1., 0., 0., 0., 1., 1., 0., 0., 1., 1.],
 [1., 1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 0., 0., 1., 1.],
 [1., 0., 0., 0., 0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 0., 0.],
 [1., 1., 0., 0., 0., 1., 1., 0., 0., 0., 1., 0., 1., 0., 1., 1.],
 [1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 0., 0., 0., 1.],
 [1., 1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 0., 0., 1., 1.],
 [1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 1., 1.],
 [1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 0., 1., 1., 1.],
 [1., 1., 0., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0., 0., 1., 1.],
 [1., 1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 0., 0., 1., 1.]
     ]).astype(np.uint8)


lk = []
block = table_block()
npatches = 36

# transforming my truth table into 576 look up tables
for lk_filter in block.transpose():  # on each filter
        lk.extend([cnp.LookupTable(lk_filter)] * npatches)  # npatches per filter
tables = cnp.LookupTable(lk)

nfeat = 576
nclasses = 10
bits = 4
X_train = np.random.randint(0,16, (1000, nfeat)).astype(np.uint8)
y_train = np.random.randint(0,nclasses, (1000,))
w = np.random.randint(0,256,(nfeat,nclasses))
b = np.random.randint(0,256,(nclasses))


@cnp.compiler({"x": "encrypted"})
def f(x):
    y = tables[x]
    res = y @ w + b
    return res 

inputset = X_train[0:4,:]
inpt = X_train[5,:]
print("compiling")
t = time.time()
circuit = f.compile(inputset)
print('compiled in ', time.time()-t)
t = time.time()
circuit.keygen()
print("Keygen done in ", time.time()-t)
enc = circuit.encrypt(inpt)
t = time.time()
res_enc = circuit.run(enc)
print(time.time()-t)
dec = circuit.decrypt(res_enc)
print(dec)

When I just call the lookup table I have an inference of approx 6s, with the following function f_lu :

@cnp.compiler({"x": "encrypted"})
def f_lu(x):
    y = tables[x]
    return y 

When I just call the linear regression I have an inference of approx 0.1s, with the following function f_lr :

@cnp.compiler({"x": "encrypted"})
def f_lr(x):
    res = x @ w + b
    return res 

But when I call the two of them, my inference time increases exponentially and my compilation time too :

compiling
compiled in  4890.356591701508
Keygen done in  51.013827323913574
1253.6375329494476
[70387 70784 11730 46806 41651 11664 21623 51500 18479  6932]

Whereas it would be done in seconds with only the lookup table, or only the linear regression.

I suppose it comes from the bootstrapping, but I expected only a bootstrap between the call of the lookup table and the call of the matrix multiplication of approx bootstrap time * number of values = 0.014 * 576 = 8 s

Is my intuition correct or am I totally missing something ?
Should I try a bigger inputset for the compilation ?

Also, I tried to use the LinearRegression from concrete.ml.sklearn but the timing are really different compared to the simple x @ w + b with the following code :

import time
import numpy as np
from concrete.ml.sklearn import LinearRegression as ConcreteLinearRegression
nfeat = 576
nclasses = 10
bits = 4
X_train = np.random.randint(0,2, (1000, nfeat)).astype(np.float32)
y_train = np.random.randint(0,nclasses, (1000,))

X_test = np.random.randint(0,2, (100, nfeat)).astype(np.float32)
y_test = np.random.randint(0,nclasses, (100,))

print("Creating the model ...")
model_lr = ConcreteLinearRegression(n_bits=bits)#{"model_inputs": 4, "op_inputs": 4, "op_weights": 4, "model_outputs": 4})
print("Fitting the model")
# Fit the model
model_lr.fit(X_train, y_train)
print("Evaluate on some inputs")
# Evaluate the model on the test set in clear
y_pred_clear = model_lr.predict(X_test)
res = np.sum((y_pred_clear == y_test))/len(y_test)*100
print("Accuracy on test set : ",res)

# Compile the model
print("Compiling model ...")
circuit = model_lr.compile(X_train)
print("Compiled")
test_input = X_test[:3]
time_begin = time.time()
circuit.client.keygen(force=False)
print(f"Key generation time: {time.time() - time_begin:.2f} seconds")

# Now predict using the FHE-quantized model on the testing set
time_begin = time.time()
y_pred_fhe = model_lr.predict(test_input, execute_in_fhe=True)
print(f"Execution time: {(time.time() - time_begin) / len(test_input):.2f} seconds per sample")

It took my computer approx 3600s per sample there vs the 0.1s of the matrix multiplication. I suppose there is some hidden logic behind that I am not aware of.

Thank you !

Hi @tricycl3

I will try to address all your points. Let me know if I miss something.

Let’s first take the problem you encounter with Concrete-Numpy only. And thanks for sharing your code this will help a lot to explain what happens.

This is how you define your data:


nfeat = 576
nclasses = 10
bits = 4
X_train = np.random.randint(0,16, (1000, nfeat)).astype(np.uint8)
y_train = np.random.randint(0,nclasses, (1000,))
w = np.random.randint(0,256,(nfeat,nclasses))
b = np.random.randint(0,256,(nclasses))

Here you input is 4 bits while you weights and biases are 8 bits. You table lookup seems to take a 4 bits value and output a 1 bits value. Note that the precision you use will have a huge impact on the FHE execution time (higher precision → slower FHE evaluation).

Before I get to the function you want to compile, you can set:

conf_simulation = cnp.Configuration(
    enable_unsafe_features=True,
    show_mlir=False,
    show_graph=True,
)

and then do

circuit = f.compile(inputset, configuration=conf_simulation, virtual = True)

This will allow you to debug Concrete-Numpy quickly without having to run into the large compilation/keygen and FHE inference time. That way you will see what the circuit and precision looks like at each step (plus intermediate steps) in the function you want to compile.

Let’s look at what happen when you only do the table lookup:

@cnp.compiler({"x": "encrypted"})
def f_lu(x):
    y = tables[x]
    return y 

If you use the above configuration you should see the following graph:

Computation Graph
--------------------------------------------------------------------------------
%0 = x                                               # EncryptedTensor<uint4, shape=(576,)>        ∈ [0, 15]
%1 = tlu(%0, table=[[0, 1, 1, ...  1, 1, 1]])        # EncryptedTensor<uint1, shape=(576,)>        ∈ [0, 1]
return %1
--------------------------------------------------------------------------------

An unsigned 4 bits input is used along with a tlu that returns a 1 bit unsigned output. Here the tlu has a 4 bits precision. This tlu operation is the most expensive computation part in FHE (as opposed to only doing multiplication or additions) and you do it 576 times but fortunately, the precision is relatively low (4 bits) so the execution is pretty fast.

Now let’s take the second example,

@cnp.compiler({"x": "encrypted"})
def f_lr(x):
    res = x @ w + b
    return res 

and take a look at the graph:

Computation Graph
--------------------------------------------------------------------------------
%0 = x                                # EncryptedTensor<uint4, shape=(576,)>        ∈ [0, 15]
%1 = [[200 136  ...  173 255]]        # ClearTensor<uint8, shape=(576, 10)>         ∈ [0, 255]
%2 = matmul(%0, %1)                   # EncryptedTensor<uint20, shape=(10,)>        ∈ [511906, 599870]
%3 = [ 48 132 2 ... 1 171  93]        # ClearTensor<uint8, shape=(10,)>             ∈ [28, 245]
%4 = add(%2, %3)                      # EncryptedTensor<uint20, shape=(10,)>        ∈ [512094, 599918]
return %4
--------------------------------------------------------------------------------

As you can see, the precision here is much higher than for the tlu example with a max precision of 20 bits. You can check this max precision easily with circuit.graph.maximum_integer_bit_width().

This example is pretty fast as there is no tlu. Only multiplication and additions. Note that here the input to the linear regression, x, is in 4 bits but when you will use the tlu before the linear regression, x will be converted to 1 bit.

If you want to combine both tlu and linear model as follows:

@cnp.compiler({"x": "encrypted"})
def f(x):
    y = tables[x]
    res = y @ w + b
    return res 

you will get a graph that looks like this:

Computation Graph
--------------------------------------------------------------------------------
%0 = x                                               # EncryptedTensor<uint4, shape=(576,)>        ∈ [0, 15]
%1 = tlu(%0, table=[[0, 1, 1, ...  1, 1, 1]])        # EncryptedTensor<uint1, shape=(576,)>        ∈ [0, 1]
%2 = [[165  99  ...   42 109]]                       # ClearTensor<uint8, shape=(576, 10)>         ∈ [0, 255]
%3 = matmul(%1, %2)                                  # EncryptedTensor<uint16, shape=(10,)>        ∈ [32767, 38413]
%4 = [190 181   ... 2 116 223]                       # ClearTensor<uint8, shape=(10,)>             ∈ [0, 223]
%5 = add(%3, %4)                                     # EncryptedTensor<uint16, shape=(10,)>        ∈ [32948, 38622]
return %5
--------------------------------------------------------------------------------

Earlier, I mentioned the max bitwidth is very important as it changes the crypto parameters underneath which can impact the overall runtime. Since you were able to run all the examples, I suppose you are using concrete-numpy==0.9.0. In this version of CN, you can do high bitwidth (>16 bits) linear numpy functions and tlu work over 16 bits max. For the example above, the accumulator of the matmul is right at the limit with 16 bits.

I hope that helped a bit to understand what is happening. Now let’s get into the problems.

Currently, there is no multi-precision support for tlu. This means that the tlu complexity will match the max bitwidth of the entire circuit. In this last case, the tlu that used to be on 4 bits has now crypto parameters raised to take into account 16 bits precision which is the most expensive tlu you can run today in Concrete-Numpy. Until multi-precision is supported (an important feature to come) there isn’t any straightforward solution to this problem. What you could do here is to split the matmul into several N smaller matmuls which will decrease the accumulator. You will end up with N values that can be decrypted and summed in the clear without losing any precision.

Concrete-ML being much slower than just x @ w + b is expected as Concrete-ML 0.5.x introduces tlus in linear models. You can easily check this by doing:

circuit = model_lr.compile(X_train, conf_simulation, use_virtual_lib=True)

You will see that tlus appear before and after the matmul which makes the execution time much longer. By installing Concrete-Numpy 0.9 with Concrete-ML 0.5 you were able to do things that were not yet well handled in Concrete-ML. This should be fixed in Concrete-ML 0.6 (being released in a matter of days!).

Note that in your example with Concrete-ML, you use quantized input to train your model. It can be a problem since the LinearRegression fit will re-quantize your inputs using the n_bits parameter provided.

Finally, if you want to go further with neural networks quantization aware training, I strongly suggest to use Brevitas as the Pytorch quantization feature do not allow lower than 8 bits quantization (not ideal in FHE right now). Also Concrete-ML has a great support for Brevitas so it will make you life easier.

Hope that helps!

1 Like

Thanks a lot for your very precise answer, it is very insightful.

Why concrete-numpy uses addition and multiplication, without tlu, in f_lr case and not in f case?
is it because there is a graph tlu object created after the tlu call and thus impact the linear regression?

Is it possible to use to only multiplications and additions (no tlu) in f while keeping the function f_lu? Is there a way to use an other object than tlu to keep the fast characteristic of the linear regression while keeping the representation of a boolean function (here represented as a tlu) ?

What you could do here is to split the matmul into several N smaller matmuls which will decrease the accumulator. You will end up with N values that can be decrypted and summed in the clear without losing any precision.

This is a very good idea, I ll try it and let you know ! Thanks a lot

Thank you again,

tricyl3

You are much welcome.

Why concrete-numpy uses addition and multiplication, without tlu, in f_lr case and not in f case?
is it because there is a graph tlu object created after the tlu call and thus impact the linear regression?

In FHE, a tlu is converted to a Programmable Boostraping (PBS) which is a pretty intense cryptography (FHE) operation. So basically, whenever you use a tlu, you combine multiplication/additions + PBS. TLUs are also created automatically by Concrete-Numpy (i.e. a simple division will create a tlu) so there might be cases where you see a tlu appear but you did not explicitly create one.

Is it possible to use to only multiplications and additions (no tlu) in f while keeping the function f_lu? Is there a way to use an other object than tlu to keep the fast characteristic of the linear regression while keeping the representation of a boolean function (here represented as a tlu) ?

It seems that you have some non-linearity as you want to assign 0 or 1 depending on the value of the feature. So, unless you apply your table lookup prior to sending it to the FHE circuit (in clear), I don’t see how you could get rid of the tlu.

Using a tlu is fine as long as you can keep the precision relatively low (as you did). The fact that it takes a very long time is due to the tlu precision being raise to the precision of the following matmul accumulator. This will be fixed sooner or later for sure.

If you really want to keep this tlu in the function, for now I think that your best shot is to reduce the matmul accumulator as much as you can. e.g. the best case would be having 4 bits weights and returning x * w to the user rather than sum(x*w). Since you have 1 bit precision in x and 4 bits precision in w you will end up with 4 bits accumulator. But of course it’s probably not ideal to return so many values depending on the use case.

If you think about other general questions, do not hesitate to open new thread! This will help other users to find the relevant information (since we drifted quite a bit from the original topic).

1 Like

Thanks a lot for all the relevant advices, I will update and close this post with the new timing once I will have them then.

1 Like

Hi again,

So I have implemented the solution proposed:

What you could do here is to split the matmul into several N smaller matmuls which will decrease the accumulator. You will end up with N values that can be decrypted and summed in the clear without losing any precision.

I have changed it a bit so that my weights are all binary.

E.g: if I have a 4-bits matrix of dimensions (features, classes) I change it to (4, features, classes)

import time
import numpy as np
import concrete.numpy as cnp


nfeatures, nclasses = 576, 10
w_quant_L = np.random.randint(0,16, (nclasses, nfeatures))
w_bits = np.unpackbits(w_quant_L.astype(np.uint8), axis=0)
print(w_bits.shape) # (80, 576)
w_bits2 = []
# extract the relevant 4 bits
for row in range(10):
    w_bits2.append(w_bits[8*row+4:8*row+8,:])
w_bits2 = np.array(w_bits2)
w_bits2 = w_bits2.transpose(1,0,-1) # (4, 10, 576)

Then compute the subsums :


block = np.random.randint(0,2, (16,16))
npatches = 36
nfilters = 16


lk = []
for i,lk_filter in enumerate(block.transpose()):  # on each filter
    lk.extend([cnp.LookupTable(lk_filter)] * npatches)
    # lk.extend([cnp.LookupTable(lk_filter*np.random.randint(0,16,1))] * npatches)  # npatches per filter
tables = cnp.LookupTable(lk)


N = 12  # 18 12 = number of subsums
end = 48  # 32 48 = nfeatures // N 
X_train = np.random.randint(0,16, (1000, nfeatures)).astype(np.uint8)


cfg = cnp.Configuration(p_error=0.1, show_graph=True)

@cnp.compiler({"x": "encrypted"})
def h(x):
    y = tables[x]
    for t in range(3):
        for num_bits in range(4):
            w = w_bits2[num_bits].transpose()
            res = np.expand_dims(y[int(N * 0):int(N + N * 0)] @ w[int(N * 0):int(N + N * 0), :],axis=0)  # + b
            start = 1
            for i in range(start, end):
                res = np.concatenate((res, np.expand_dims(y[int(N * i):int(N + N * i)] @ w[int(N * i):int(N + N * i)],axis=0)), axis=0)
           # 48, 10

            if t == 0 and num_bits == 0:
                res_f = np.expand_dims(res, axis=0)
            else:
                res_f = np.concatenate((res_f, np.expand_dims(res, axis=0)), axis = 0)

    print(res_f.shape)
    return res_f # np.greater(res,0)#res #(res>0)*1.0

inputset = X_train[0:4,:]#[np.zeros(576).astype(np.uint8), (16 * np.ones(576)).astype(np.uint8)] # X_train[0:4,:]
inpt = X_train[5,:]
print("compiling")
t = time.time()
circuit = h.compile(inputset, configuration=cfg)
print('compiled in ', time.time()-t)
t = time.time()
circuit.keygen()
print("Keygen done in ", time.time()-t)
enc = circuit.encrypt(inpt)
t = time.time()
res_enc = circuit.run(enc)
print(time.time()-t)

cloud_output = circuit.decrypt(res_enc)

Then sum the results as W = 23 * w_bits2[0] + 22 * w_bits[1] + 21 * w_bits2[2] + 20 * w_bits[0]:


imgs_fin = np.sum(cloud_output,axis=1)
#print(imgs_fin.shape)
imgs_fin = 2**3 * imgs_fin[0] + 2**2 * imgs_fin[1] + 2**1 * imgs_fin[2] + 2**0 * imgs_fin[3]
#print(imgs_fin.shape)

cloud_pred = np.argmax(imgs_fin,axis=0)

So this takes approx 5s (which is even less than the function that calls only the lookup table).

Thanks for your advices !

1 Like