Discrepancies between scikit and concrete when processing similar samples

So, I’ve been making a fuzzer to compare the concrete-ml FHE models against the scikit-learn ones. The goal is to look for differences that could be pointing out to a possible logical bug. So far I’ve started testing the logistic regression model. I’ve trained both the concrete-ml and the scikit-learn implementations with the same dataset and then I gave them both the same random inputs. The happy path is that both outputs should be the exact same. However, when using inputs that are essentially the same value repeated throughout all the samples save for one, concrete-ml and scikit give out different results. Is this at all expected? Should there be some tolerance for differences between concrete and scikit?
Here is one of the examples:
100 samples, 5 features each:

[ [-0.7098039215686274, -0.9999999952758206, -1.0, -1.0, -1.0], [ -1.0, -1.0, -1.0, -1.0. -1.0], … (all other 99 samples are the same) [-1.0, -1.0, -1.0, -1.0. -1.0] ]

[ 0 1 … 1 ]

[ 1 1 … 1 ]

And here is the code im testing:

import sys
import atheris
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression as SklearnLogisticRegression
from concrete.ml.sklearn import LogisticRegression as ConcreteLogisticRegression

# Dataset to train
x, y = make_classification(n_samples=100, class_sep=2, n_features=5, random_state=42)

# Split the data-set into a train and test set,
# each set is split into input and result.
input_train, _, result_train, _ = train_test_split(
  x, y, test_size=0.2, random_state=42

# Start the concrete-ml logistic regression model, train (unencrypted data) and quantize the weights.
concrete_model = ConcreteLogisticRegression()
concrete_model.fit(input_train, result_train)

# Compile FHE

# Start the sklearn logistic regression model
scikit_model = SklearnLogisticRegression()
# Train
scikit_model.fit(input_train, result_train)

def compare_models(input_bytes):
  fdp = atheris.FuzzedDataProvider(input_bytes)
  data = [fdp.ConsumeFloatListInRange(5, -1.0, 1.0) for _ in range(100)]
  # Run the inference, encryption and decryption is done in the background
  fhe_pred = concrete_model.predict(data, execute_in_fhe=True)
  # Get scikit prediction
  prediction = scikit_model.predict(data)

  # Compare both outputs
  assert((fhe_pred == prediction).all())

atheris.Setup(sys.argv, compare_models)

Concrete-ML uses quantization to perform ML computations on integers instead of the floating point values used by typical frameworks such as sklearn and pyTorch. Thus, depending on the quantization level, the results using Concrete-ML will be more or less accurate with respect to those of the sklearn models.

For LogisticRegression, as for all linear models in Concrete-ML, you can use only a low amount of quantization. By default you are using 8-bit quantization (see concrete.ml.sklearn.linear_model.md - Concrete ML ).

You can probably increase the number of bits for quantization for this model up to 10-12 bits, which will decrease the error of the quantization. This will improve results.

However, you should not expect that results will be equal to the sklearn results 100% of the time. Increasing the n_bits will get you asymptotically towards the 100%.

Thank you so much for the reply, I will play around with the n_bits field and some error tolerance when evaluating the final results.
Follow up question: do you have any sort of estimate as to how much the results can differ between sklearn and concrete? I’m guessing it’s going to be very dependent on the model and the bits used for the quantization, but maybe you have some more concrete numbers

The more bits you use (up to the limit of FHE constraints) the better and with linear models, especially low-dimensional ones, you can probably get within >99% accuracy with respect to the result of the float classifier.

For neural networks when you use quantization aware training with low number of bits (2-6) the accuracy you get when running with pyTorch should be within 1% of the accuracy you get with Concrete-ML when you import the model with compile_brevitas_qat_model.

1 Like