Neural Network with concrete numpy: Precision / Quantization issue

Hello everybody,

I was trying to implement my own NN, using the example from the documentation (Fully Connected Neural Network — Concrete Numpy Manual)

The trained (cleartext) model performs really well, however, when quantized and henceforth the FHE model (almost) always returns the same output.

Differences between my model and the documentation example:

  • My model doesn’t tackle a classification problem. My model outputs a single scalar value.
  • Therefore I use nn.MSELoss() as the loss function
  • I use a much bigger training and test data set (~40k)
  • I use ReLU6 as activation function

Model itself is quite similar:

  • 4 input neurons (inputs are within [0,1])
  • 2 hidden layer of 3 neurons
  • 1 output neuron (only difference)

Is this a problem of quantization? Every help is highly appreciated! I am happy to provide detailed information if needed.
Thanks in advance!

Martin

2 Likes

Hi Martin,

Thank you for your interest in concrete-numpy and its machine learning applications !

As a first approach to debug the problem I would suggest you track activations throughout your network. For example, please find attached code does it for a 4 layer network with the same topology that you have (if I understood your explanation correctly). I set it up so that it’s easy to track the activations and I plot them at the end (see the graph attached). Would it be possible to do the same thing for your network?

Our current approach to quantization (post training) can significantly degrade performance on larger NNs but in your case the NN seems compatible at first glance if the test set and train set have a very similar distribution.

Could you confirm whether this network is the same as yours and can be used to reproduce the problem? Could you otherwise share the code and maybe the data that you are using ?

Could you please also confirm that you are getting good performance in fp32 before quantization?

 
Nfeatures = 4
 
class Net4(nn.Module):
   def __init__(self, n_feat) -> None:
       super().__init__()
 
       self.fc1 = nn.Linear(n_feat,3)
       self.relu1 = nn.ReLU6()
       self.fc2 = nn.Linear(3,3)
       self.relu2 = nn.ReLU6()
       self.fc3 = nn.Linear(3,3)
       self.relu3 = nn.ReLU6()
       self.fc4 = nn.Linear(3,1)
      
       for m in self.modules():
           if isinstance(m, nn.Linear):
               torch.nn.init.uniform_(m.weight)
               torch.nn.init.zeros_(m.bias)
 
       self.activations = {}
       self.track_activations = False
 
   def set_track_activations(self, enable):
       self.track_activations = enable
       if enable: #reset
           self.activations = {}
           for n, m in self.named_modules(): #in order traversal
               if isinstance(m, nn.Linear):
                   self.activations[n] = np.zeros( (0, m.weight.shape[0]), np.float32)
                   last_neurons = m.weight.shape[0]
               elif isinstance(m, nn.ReLU6):
                   self.activations[n] = np.zeros( (0, last_neurons), np.float32)
  
   def forward(self, x):
 
       for n, m in self.named_children(): #in order traversal
           x = m(x)
           if self.track_activations:
               self.activations[n] = np.vstack( (self.activations[n], x.detach().numpy() ))
 
       return x
 
model.eval()
model.set_track_activations(True)
for bi in range(n_batches):
   data = torch.from_numpy(X[bi*batch_size:(bi+1)*batch_size, :].astype(np.float32))
   model(data)
model.set_track_activations(False)
 
_, ax = plt.subplots(1,7,figsize=(24,12))
for idx, key in enumerate(model.activations):
   ax[idx].hist(model.activations[key].flatten(), 100)
   ax[idx].set_title(key)
plt.show()
4 Likes

Hi Andrei,

Thank you very much for your helpful and quick response!

Here is my network, including your debug code:

class MyNet(torch.nn.Module):

    def __init__(self, input_size):
        super().__init__()

        self.fc1 = nn.Linear(input_size, 3)
        self.relu1 = nn.ReLU6()
        self.fc2 = nn.Linear(3, 3)
        self.relu2 = nn.ReLU6()
        self.fc3 = nn.Linear(3, 1)
        
        for m in self.modules():
            if isinstance(m, nn.Linear):
                torch.nn.init.uniform_(m.weight)
                torch.nn.init.zeros_(m.bias)
 
        self.activations = {}
        self.track_activations = False

    def set_track_activations(self, enable):
        self.track_activations = enable
        if enable: #reset
            self.activations = {}
            for n, m in self.named_modules(): #in order traversal
                if isinstance(m, nn.Linear):
                    self.activations[n] = np.zeros( (0, m.weight.shape[0]), np.float32)
                    last_neurons = m.weight.shape[0]
                elif isinstance(m, nn.ReLU6):
                    self.activations[n] = np.zeros( (0, last_neurons), np.float32)

    def forward(self, x):
        for n, m in self.named_children(): # in order traversal
            x = m(x)
            if self.track_activations:
                self.activations[n] = np.vstack( (self.activations[n], x.detach().numpy() ))

        return x

I have a 4 layer network, with the first layer being the inputs, 2 (FC) hidden layers with ReLU6 and one (FC) output layer. I think I wasn’t clear enough - sorry about that.

I want the network to output the slope of a linear regression (LR) line of 4 inputs (no rocket science, just to play and learn concrete numpy).
Therefore I generate the data randomly. The x-values for the LR is always [0.0, 0.33, 0.66, 1.0] (for 4 points). The y-values are uniformly sampled within [0,1].

def LR_slope(y):
    X = np.linspace(0,1,len(y))
    k,d = np.polyfit(X,y,1)
    
    return [k]

# synthetic training data
l = 50000  # data set length
N = 4      # number of inputs

X = np.array([np.random.uniform(0,1,N) for _ in range(l)])
Y = np.array([LR_slope(x) for x in X])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.1, random_state=42)

X_train = torch.tensor(X_train).float()
X_test = torch.tensor(X_test).float()
y_train = torch.tensor(y_train).float()
y_test = torch.tensor(y_test).float()

After training, performance looks good using the cleartext model. I plotted y_test (expected result) vs. y_pred (actual result).
image
This answers your question about fp32 performance, or am I wrong?

Finally, here are the histograms:

Activations cover a broader range, is this already the problem?

Do you want to see more code? Preferably “inline” here or as a notebook file?

Sidenote about what I experienced: I have to train multiple times and I use fresh training data as well, as sometimes it seems as if training gets stuck in a local minimum.

Thanks,
Martin

Thank you for the very detailed response. The learning problem that you are describing is very interesting. I modified my code to use the input data you generate (50 000 sets of 4 values for each of which we fit a line and take the slope as the target variable).

In our current approach we quantize the inputs to 3 bits. We can assume since you draw uniformly for the 4 values that the max/min of each of the 4 variables will be 1/0. Quantizing these to two bits means we will convert the values in [0…1] to one of the values: [0.0, 0.125, 0.25, … 0.875, 1.0].

If you used the Concrete Numpy torch API to compile your network you probably did something like this:

quantized_compiled_module = compile_torch_model(
    model,
    X,
    n_bits=3,
)

But this hides a lot of the complexity behind quantization and makes it harder to debug problems. If we look at the code of this function we can rewrite the quantization and inference parts as such, adding tracing quantized activations. This will not be done in FHE but in the clear, on quantized values. The FHE circuit should produce identical results (with some epsilon errors from time to time).

from concrete.quantization import QuantizedArray, PostTrainingAffineQuantization
from concrete.torch import NumpyModule

 # Create corresponding numpy model
numpy_model = NumpyModule(model)

# Quantize with post-training static method, to have a model with integer weights
post_training_quant = PostTrainingAffineQuantization(3, numpy_model, is_signed=True)
quantized_module = post_training_quant.quantize_module(X)

# Quantize input
q_x = QuantizedArray(3, X)
q_input = q_x

q_act = {}
for layer_name, layer in quantized_module.quant_layers_dict.items():
    q_x = layer(q_x)
    q_act[layer_name] = q_x

I modified the histogram code to also include the input. I get the following for fp32 and uint3 (3 bit) inference:

_, ax = plt.subplots(1,8,figsize=(24,12))
ax[0].hist(q_input.values.flatten(), 8)
ax[0].set_title("input")
for idx, key in enumerate(q_act):
    ax[idx+1].hist(q_act[key].values.flatten(), 8)
    ax[idx+1].set_title(key)
plt.savefig("distrib_w_int3.png")


Would you be able to apply this code on your model ? Could you do it for both training and test data separately?

It would really help if you could post your full code, if this is something you can share.

Regards,

1 Like

Thanks Andrei,

yes you are right, I used compile_torch_model with n_bits=3.

Histograms:

Training data (45k)

Test data (5k)

Code:

from torch import nn

import torch
import numpy as np
import matplotlib.pyplot as plt

1. Generate the data

def LR_slope(y):
    X = np.linspace(0,1,len(y))
    k,d = np.polyfit(X,y,1)
    
    return [k]

# synthetic training data
l = 50000  # data set length
N = 4     # number of inputs

X = np.array([np.random.uniform(0,1,N) for _ in range(l)])
Y = np.array([LR_slope(x) for x in X])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.1, random_state=42)

X_train = torch.tensor(X_train).float()
X_test = torch.tensor(X_test).float()
y_train = torch.tensor(y_train).float()
y_test = torch.tensor(y_test).float()

2. Define the model

class MyNet(torch.nn.Module):

    def __init__(self, input_size):
        super().__init__()

        self.fc1 = nn.Linear(input_size, 3)
        self.relu1 = nn.ReLU6()
        self.fc2 = nn.Linear(3, 3)
        self.relu2 = nn.ReLU6()
        self.fc3 = nn.Linear(3, 1)
        
        for m in self.modules():
            if isinstance(m, nn.Linear):
                torch.nn.init.uniform_(m.weight)
                torch.nn.init.zeros_(m.bias)
 
        self.activations = {}
        self.track_activations = False

    def set_track_activations(self, enable):
        self.track_activations = enable
        if enable: #reset
            self.activations = {}
            for n, m in self.named_modules(): #in order traversal
                if isinstance(m, nn.Linear):
                    self.activations[n] = np.zeros( (0, m.weight.shape[0]), np.float32)
                    last_neurons = m.weight.shape[0]
                elif isinstance(m, nn.ReLU6):
                    self.activations[n] = np.zeros( (0, last_neurons), np.float32)

    def forward(self, x):        
        for n, m in self.named_children(): # in order traversal
            x = m(x)
            if self.track_activations:
                self.activations[n] = np.vstack( (self.activations[n], x.detach().numpy() ))

        return x
def train():
    for epoch in range(epochs):
        # Get a random batch of training data
        idx = torch.randperm(X_train.size()[0])
        X_batch = X_train[idx][:batch_size]
        y_batch = y_train[idx][:batch_size]

        # Forward pass
        y_pred = model(X_batch)

        # Compute loss
        loss = criterion(y_pred, y_batch)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()

        # Update weights
        optimizer.step()


        if (epoch+1) % 5000 == 0:
            # Print epoch number, loss
            print(f'Epoch: {epoch+1:5} | Loss: {loss.item():.9f}')
            if loss.item() < 1e-9:
                break
    print("Training done")

3. Initialize and train the model

# Initialize our model
model = MyNet(X.shape[1])

# Define our loss function
criterion = nn.MSELoss()

# Define our optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Define the number of iterations
epochs = 40000

# Define the batch size
batch_size = 16
train()

4. fp32 evaluation

y_pred = model(X_test)
print(f"mean error: {(y_pred - y_test).abs().mean()}")

y_test_numpy = y_test.numpy()
plt.plot(y_test_numpy, y_pred.cpu().detach().numpy(), "o", markersize=1)
plt.grid()

plt.xlabel("expected result y_test")
plt.ylabel("actual result y_pred")

plt.show()
X_eval = X_train.numpy()
# X_eval = X_test.numpy()
n_batches = X_eval.shape[0] // batch_size

model.eval()
model.set_track_activations(True)
for bi in range(n_batches):
    data = torch.from_numpy(X_eval[bi*batch_size:(bi+1)*batch_size, :].astype(np.float32))
    model(data)
model.set_track_activations(False)

_, ax = plt.subplots(1,6,figsize=(24,10))
ax[0].hist(X_eval.flatten(), 100)
ax[0].set_title("input")
for idx, key in enumerate(model.activations):
    ax[idx+1].hist(model.activations[key].flatten(), 100)
    ax[idx+1].set_title(key)
plt.show()
plt.savefig("distrib_w_fp32.png")

5. concrete: quantization

from concrete.quantization import QuantizedArray, PostTrainingAffineQuantization
from concrete.torch import NumpyModule

 # Create corresponding numpy model
numpy_model = NumpyModule(model)

# Quantize with post-training static method, to have a model with integer weights
post_training_quant = PostTrainingAffineQuantization(3, numpy_model, is_signed=True)
quantized_module = post_training_quant.quantize_module(X_eval)

# Quantize input
q_x = QuantizedArray(3, X_eval)
q_input = q_x

q_act = {}
for layer_name, layer in quantized_module.quant_layers_dict.items():
    q_x = layer(q_x)
    q_act[layer_name] = q_x
_, ax = plt.subplots(1,6,figsize=(24,10))
ax[0].hist(q_input.values.flatten(), 8)
ax[0].set_title("input")
for idx, key in enumerate(q_act):
    ax[idx+1].hist(q_act[key].values.flatten(), 8)
    ax[idx+1].set_title(key)
plt.savefig("distrib_w_int3.png")

Regards

(sorry about the last reply, my stupid fault)

Hey Martin,

Just sneaking here to tell you that Andrei is off for some days. When he is back, he will have a look and continue to help you.

Regarding the concrete.ml.* imports, it is a teaser of what’s coming: Concrete Numpy will be divided in two, Concrete Numpy (for numpy compilation) and ConcreteML on top of that, for ML things, for clarity. Andrei will update his post to use only the public imports. So, not an error on your side!

Cheers

1 Like

Thanks for the message :slight_smile:
Appreciate your help,
Martin

1 Like

Hi Martin,

Looking over the graphs that you provide it seems that the quantized network output takes one of several values with more probability of returning a value around 0, which is the same in the fp32 result. It does not look like it’s always returning the same value.

We quantize weights & activations in the network to 3 bits, with the exception of the output layer’s activations which are 6 bits. Thus the last histogram, plotted with the code I gave, can be made clearer by using 64 bins. Here’s a better version of that code which also prints all the possible values taken by the activation tensor for each layer.

Could you please confirm that, when the fp32 solution is not the trivial one, you are still having the problem? I would say that if you are getting the same output value every time, the quantized activation distribution for fc3 would not ressemble in any way the fp32 one?

_, ax = plt.subplots(1,6,figsize=(24,10))
ax[0].hist(q_input.values.flatten(), 8)
ax[0].set_title("input")
for idx, key in enumerate(q_act):    
    nbins = 2**6 if idx == len(q_act) - 1 else 2**3
    unique_values = np.unique(q_act[key].values.flatten())
    print(key, unique_values)
    ax[idx+1].hist(q_act[key].values.flatten(), nbins)
    ax[idx+1].set_title(key)
plt.savefig("distrib_w_int3.png")


And here’s the output with a trivial solution:

Here are a few tips:

  • reduce the LR to 0.01, the network will converge every time to a good solution. with 0.1 many times it converged to a trivial solution (same slope whatever the inputs ).
  • use the quantization code to validate the output of the network before compiling to FHE which is quite slow (this is what you are now doing)
  • when compiling to FHE use fewer samples (100-200) should suffice. With this network you might (rarely) encounter an exception in compilation saying you are exceeding 7 bits.
2 Likes

Dear @andrei-stoian-zama

I hope I don’t bother you during your off-days!
Thank you for your reply and the tips!

Now, I (almost) always get a 7-bits overflow during compiling.
Once it worked, this was the output:

Then I plotted a histogram of the fp32-model (y_pred) and the same for the quantized model (showing the initial problem of the same output):
image image

Here is the code for compiling and plotting the output histograms:

from concrete.torch.compile import compile_torch_model

try:
    print("Compiling the model to FHE.")
    quantized_compiled_module = compile_torch_model(
        model,
        X_train[:100],
        n_bits=3,
    )
    print("The network is trained and FHE friendly.")
except Exception as e:
    if str(e).startswith("max_bit_width of some nodes is too high"):
        print(f'The network is not fully FHE friendly, retrain.')
    else:
        raise e
plt.hist(y_pred.detach().numpy())
plt.show()
X_test_numpy = X_test.numpy()
quant_model_predictions = quantized_compiled_module(X_test_numpy)
plt.hist(quant_model_predictions)
plt.show()

Martin,

First of all thank you very much for doing all this work to try to get this running and giving us valuable feedback on concrete-numpy !

I believe the issue is caused by a bug in our library. Let me explain:

While I believed the code I gave for quantized inference was the exact equivalent to what is done when you use the compile_torch_model and then calling forward on the resulting QuantizedModule I had forgotten about something: the input we are giving to this code is not the exact same as that taken by QuantizedModule.forward.

  • The code I gave creates a QuantizedArray from X_eval (basically computes quantization parameters for X_eval and quantizes it) and passes this to the inference: this works here because we calibrate with X_eval and test with X_test which has the same exact distribution. In general is not a good idea as the distribution of the new input could change from that of the original calibration data. We need to use the quantizer obtained during calibration and compilation of the network. Here we would do:
   q_calib = QuantizedArray(2, X_calib) # compute the quantizer of the calibration data
    q_calib.update_values(X_test.numpy()) # apply the quantizer to the test data
    q_input = q_calib
    q_x = q_input

    q_act = {}
    for layer_name, layer in quantized_module.quant_layers_dict.items():
        q_x = layer(q_x)
        q_act[layer_name] = q_x

Now the bug is the following: The code in the QuantizedModule wants to do this step so it takes the quantized parameters from the calibration data that was given in compile_torch_module, and quantizes the forward input with it. However, there is a typo in the code:

        # If the q_x is a numpy module then we reuse self.q_input parameters
        # computed during calibration.
        # Later we might want to only allow nympy.array input
        if not isinstance(q_x, QuantizedArray):
            assert self.q_input is not None
            self.q_input.update_qvalues(q_x)
            q_x = self.q_input

In the module we released we did update_qvalues and the correct function is update_values as shown above. update_qvalues computes raw unquantized values from quantized values and update_values computes quantized values from raw values. Clearly we need the update_values here.

Could you please go to your virtualenv, find the concrete-numpy package source, it should be at:
.venv/lib/python3.8/site-packages/concrete/quantization/quantized_module.py and modify line 51 (forward function) changing it to self.q_input.update_values(q_x) ?

Could you please rerun with this change ?

Furthermore just mentioning that when you do
quant_model_predictions = quantized_compiled_module(X_test_numpy)

this does not actually run inference in FHE, you need to call forward_fhe for that. You will need to quantize your data to ints before calling forward_fhe, the correct way to do it would be:

from tqdm import tqdm
N_run_fhe = 100
quantized_compiled_module.q_input.update_values(X_test_numpy[0:N_run_fhe,:])
fhe_input = quantized_compiled_module.q_input.qvalues
fhe_predictions = np.zeros((N_run_fhe,))
for idx, q_i in enumerate(tqdm(fhe_input)):
    assert(np.all(np.abs(q_i - q_i.astype(np.uint8)) < 0.0001 ))
    y_fhe = quantized_compiled_module.forward_fhe.run(q_i.astype(np.uint8).reshape(1,-1))
    fhe_predictions[idx] = quantized_compiled_module.dequantize_output(y_fhe)

Note that this takes 27 seconds per iteration on my machine, so about 2700 seconds = 45 minutes in total for the 100 test examples.

2 Likes

Thanks @andrei-stoian-zama . So, obviously, @Martin , this is a bug that we’re going to fix in our next release. Thanks a lot for helping us pointing this problem. And yes, of course too, we’re going to add more tests too, in order to try to catch this kind of errors automatically before releasing.

1 Like

Thank you guys!

I fixed the typo and reran the code.

I still mostly get an error during compile_torch_model. I thought modifying the data might help - actually I did help, but only once. Instead of uniformly sampled data, the 4 input points (almost) lie on a straight line plus some noise, e.g.
image

However, I am a bit confused. I don’t know whether my code is as intended:

X_calib = X_train.numpy()
n_batches = X_calib.shape[0] // batch_size
numpy_model = NumpyModule(model)
post_training_quant = PostTrainingAffineQuantization(3, numpy_model, is_signed=True)
quantized_module = post_training_quant.quantize_module(X_calib)

q_calib = QuantizedArray(3, X_calib)  # compute the quantizer of the calibration data
q_calib.update_values(X_test.numpy()) # apply the quantizer to the test data
q_input = q_calib
q_x = q_input

q_act = {}
for layer_name, layer in quantized_module.quant_layers_dict.items():
    q_x = layer(q_x)
    q_act[layer_name] = q_x

Histograms (for the one time it worked):


After compile_torch_model outputs are distributed as follows (left: y_pred = model(X_test), right: quant_model_predictions = quantized_compiled_module(X_test_numpy) )

image image

Looks better (? :slight_smile: )

Some questions from my side:

  • X_calib can be anything, right? If X_calib = X_test, q_calib wouldn’t change after the update, but it is good practice to separate them!?
  • I understand that quant_model_predictions = quantized_compiled_module(X_test_numpy) does not run FHE inference, but it DOES run inference in quantized (clear), right?
  • What is the preferred data set for compile_torch_model? X_test? X_calib?

Hi Martin,

Thank you very much for your feedback! I think your current version has encountered the limitations we experience with this very early version of Concrete-Numpy and that you are getting the correct results from the quantized model.

Your modification of the data should simplify the learning problem, but this does not impact the quantization. Two things impact the compilation making it work or fail:

  • When doing multiply-accumulate (each neuron in the network will do a sum of weight * activations) the accumulator must be at most 7bits wide. You are quantizing to 3 bits. Therefore you can afford to have a maximum of 2 connections per neuron, in the worst case scenario of all weights=2^3-1. This scenario of course won’t work for your network with 4 inputs, unless you have lucky values for the weights (low values or some zeros). But for regression the weight distributions are not so lucky.
  • The support of the distribution of the weights (i.e. min/max of the distribution) and activations can sometimes introduce higher accumulation bitwidths but it’s hard to quantify this problem.

I would suggest that you use 2 bits weights and activations. You can then increase the number of connexion per neuron (even 6…8 would work).

A second approach would be to increase the depth of the network while keeping the connections of the neurons sparse (0 weights for some inputs). This is classical network pruning and you can use torch functions for this (torch.nn.utils.prune.l1_unstructured — PyTorch 2.2 documentation - make sure to remove pruning before compilation).

  • X_calib can be anything, right? If X_calib = X_test, q_calib wouldn’t change after the update, but it is good practice to separate them!?

To be unbiased, X_calib should normally be a held out set of the data or you can use a part of X_train. In an operational setting you would not know the real X_test during quantization.

  • I understand that quant_model_predictions = quantized_compiled_module(X_test_numpy) does not run FHE inference, but it DOES run inference in quantized (clear), right?

Yes you are correct, this runs inference in quantized in the clear.

  • What is the preferred data set for compile_torch_model ? X_test? X_calib?

I would suggest X_calib. Normally we don’t have X_test at training/calibration/compilation time.

Regards and thanks for all your work!

2 Likes

Thanks @Martin for your questions-and-answers and your interest in our work.

As we arrive to a conclusion in this long thread, I would like to thank you @Martin once again a lot for your involvement with the current version, and encourage you to stay up to date with any future product update. As said by @andrei (thanks to him too for the quality and dedication of his answers), we’re currently in the first version of our tool, and we’ll have much more to show in further releases.

The best way to know about our future update is to subscribe to Zama’s newsletter and to subscribe (watch) to the Github repo, if you haven’t already done it.

Cheers!