How do I interpret the output of `predict_proba()`?

jcherzog · March 22, 2023, 4:47pm

I cannot tell from your documentation how to interpret the output of a model’s predictions. Consider the following code (uninteresting imports omitted for brevity):

from concrete.ml.sklearn import LogisticRegression as ConcreteLogisticRegression
from sklearn.linear_model import LogisticRegression as SkLogisticRegression

# Create the data for classification:
X, y = make_classification(
    n_features=30,
    n_redundant=0,
    n_informative=2,
    random_state=2,
    n_clusters_per_class=1,
    n_samples=250,
)

# Retrieve train and test sets:
X_model_owner, X_client, y_model_owner, y_client = train_test_split(X, y, test_size=0.4, random_state=42)

# Train the model and compile it
concrete_model = ConcreteLogisticRegression()
concrete_model.fit(X_model_owner, y_model_owner)
concrete_model.compile(X_model_owner)

server_dir = TemporaryDirectory()
client_dir = TemporaryDirectory()
dev_dir = TemporaryDirectory()

fhemodel_dev = FHEModelDev(dev_dir.name, concrete_model)
fhemodel_dev.save()

copyfile(dev_dir.name + "/server.zip", server_dir.name + "/server.zip")
copyfile(dev_dir.name + "/client.zip", client_dir.name + "/client.zip")
copyfile(
    dev_dir.name + "/serialized_processing.json",
    client_dir.name + "/serialized_processing.json",
)

fhemodel_client = FHEModelClient(client_dir.name, key_dir=client_dir.name)
fhemodel_client.generate_private_and_evaluation_keys()
serialized_evaluation_keys = fhemodel_client.get_serialized_evaluation_keys()

clear_input = X_client[[0], :]
encrypted_input = fhemodel_client.quantize_encrypt_serialize(clear_input)
encrypted_prediction = FHEModelServer(server_dir.name).run(encrypted_input, serialized_evaluation_keys)
decrypted_prediction = fhemodel_client.deserialize_decrypt_dequantize(encrypted_prediction)

print("Concrete-ML decrypted prediction:")
print(decrypted_prediction)

direct_prediction = concrete_model.predict_proba(clear_input)
print("Concrete-ML direct prediction (predict_proba):")
print(direct_prediction)

direct_prediction2 = concrete_model.predict(clear_input)
print("Concrete-ML direct prediction (predict):")
print(direct_prediction2)

sklearn_model = SkLogisticRegression()
sklearn_model.fit(X_model_owner, y_model_owner)
sklearn_prediction = sklearn_model.predict(clear_input)
print("Sklearn prediction:")
print(sklearn_prediction)

This allows me to see and compare (all for the same query):

The results of encrypting the query, sending it to the server to process in a FHE fashion, and having the client decrypt,
The results of calling model.predict_proba()
The results of calling model.predict(), and
The results of calling a straight sklearn model.

What do I get?

Concrete-ML decrypted prediction:
[[0.03565675 0.96434325]
 [0.03565675 0.96434325]]
Concrete-ML direct prediction (predict_proba):
[[0.03565675 0.96434325]
 [0.03565675 0.96434325]]
Concrete-ML direct prediction (predict):
[1 1]
Sklearn prediction:
[1]

How the heck do I predict the first three results? The documentation (concrete.ml.sklearn.base.md - Concrete ML) merely tells me that the results will be ‘numpy ndarray with probabilities (if applicable)’ (for predict_proba()) and ‘numpy ndarray with predictions’ (for predict()). Can I get some more detail? I can guess what’s going on, but (1) would prefer not to have to guess, and (2) I sent in one query. Why am I getting two answers?

jfrery · March 23, 2023, 10:35am

Hi @jcherzog,

Thanks for reaching out about this issue. There is indeed a problem with the FHE deployment here which is fixed in our internal branch. We will update the public one with the fix shortly. For now, you can get the correct inference shape using concrete_model predict functions by removing the calls to FHEModelDev|Client|Server. The bug should only impact linear models so tree-based and neural networks should be fine to use in your example.

As for your question about the output of the model’s predictions, I understand why it might be confusing. Our API is designed to be similar to the scikit-learn API, which is why we don’t go into too much detail in our documentation. To clarify, the predict_proba() method returns a numpy ndarray with probabilities of shape (n_examples, n_classes), and the predict() method returns a numpy ndarray with predictions of shape (n_examples) where each value is the predicted class. You can check the scikit-learn documentation for more information.

I hope this explanation helps. If you have any other questions or concerns, just let us know and we’ll be happy to help out.